I think I saw a paper that argued that CMA-ES is making an approximation to the ...

I think I saw a paper that argued that CMA-ES is making an approximation to the natural gradient, which is not the same gradient you see in typical NN trainings. Or at least so I understood it. (I have no background in data science or ML, I'm just a bored engineer)

I haven't estimated the number of trials you would need for 10M Shakespeare model but I think to get to the same level as gradient descent, it might be around 10M, i.e. same ballpark as the number of parameters. Which makes some intuitive sense because of how little you learn from each black box function evaluation.

There's maybe some hope that there is a hidden structure in the problem that does not actually need anywhere close to 10M parameters so that a black box optimizer might still find it. I don't have my hopes up though but I'm trying to poke at it.

I would think that if it turns out LLMs are not totally impossible with black box optimization, then it would be good to find a reason to use it. E.g. objective functions that don't have a hope of having a good gradient. Some kind of weird architecture that can't be trained conventionally. Maybe fine-tuning gradient descent optimized models with those things. Etc. Feels like a solution looking for a problem.

I'm doing my project for fun and just seeing if it's possible at all.