>You don't need "very much" expert overlap to see aggregate gains at scale, you ...

zozbot234 · 2026-05-25T11:02:36 1779706956

An aggregate speedup of 2x is a lot, we don't need that in a local context. Local hardware is heavily constrained by power and thermals, not just bandwidth; so all we really care about is raising compute intensity for decode a little bit to relax the memory bandwidth constraint. The average factor will depend on just how sparse the model is and how far you can push parallelism, there isn't just one single answer.

EnPissant · 2026-05-25T18:13:08 1779732788

But you won't see 2x expert re-use, the speedup with 5 streams will be tiny.