This looks awesome!!! I’m curious on the ensemble: does it mean “train 8 differe...

sdpmas · 2026-03-05T00:33:38 1772670818

no ensembling means train 8 models and during inference avg logits of all 8 models to make a prediction.

magicalhippo · 2026-03-05T15:08:09 1772723289

Maybe some newer references are better, but my mind went to the Model Soups paper[1]:

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups."

[1]: https://arxiv.org/abs/2203.05482

jiggawatts · 2026-03-05T07:58:20 1772697500

That doesn't seem all that different to a MoE architecture.

yorwba · 2026-03-05T08:47:30 1772700450

It's the opposite of a MoE architecture in many ways. MoE splits every individual feed-forward layer into many tiny subnetworks, only a small number of which contribute to the layer output, and they get trained together to complement each other.

Ensembling makes multiple copies of the entire model, trains them independently on the same task, and then has every copy contribute to the output.

Reducing computation vs. increasing it; operating at per-layer granularity vs. whole model; specialization vs. redundancy.