34B Q4 will use around 20GB of memory. If it's running slow, make sure metal is ...

npsomaratna · on Aug 29, 2023

Thank you. I had to reduce the context length to get this to work without crashing (from 16k to 8k)—and I'm seeing the ~100% speed up you mentioned.

However, when I run the LLM, OSX becomes sluggish. I assume this is because the GPU's utilized to the point where hardware-based rendering slows down due to insufficient resources.

I wonder if there's a way to avoid that slowdown?

SparkyMcUnicorn · on Aug 29, 2023

I haven't noticed any slowdowns. Maybe check that threads/n_threads is set correctly for your machine (total cores - 2. 10 cores = 8, 8 cores = 6).

n_gpu_layers should also be set to anything other than 0 (default). I don't think the exact number matters for metal, but I use 128.