That doesn't seem all that different to a MoE architecture.

yorwba · 2026-03-05T08:47:30 1772700450

It's the opposite of a MoE architecture in many ways. MoE splits every individual feed-forward layer into many tiny subnetworks, only a small number of which contribute to the layer output, and they get trained together to complement each other.

Ensembling makes multiple copies of the entire model, trains them independently on the same task, and then has every copy contribute to the output.

Reducing computation vs. increasing it; operating at per-layer granularity vs. whole model; specialization vs. redundancy.