The model's weights can be sharded across multiple GPU's. A "common" training se...

The model's weights can be sharded across multiple GPU's. A "common" training server could contain (for instance) eight "A100" GPU's, each with 40 GB (or up to 80 GB) a piece for a total of 320 GB working VRAM. Since they're connected to each other in the same PC, they can communicate with each other quickly enough to calculate in coordination in this fashion. This setup is _very_ expensive of course. Probably in the hundreds of thousands of dollars.

If you're hoping to run the model yourself, you will need enough money and expertise to rent and deploy it to a server with as many GPU's. Alternatively, volunteers and other researchers will be able to quantize (compress) the model and make it easier to run on configurations without as much VRAM.

If you ran it on CPU it may indeed be super slow, but it's possible it's fast enough for the purposes of running the model rather than trying to train that model. I am seeing (limited) success with the maxed out Mac lineup ($4500) using the beefy M1/M2 line of CPU's.