The problem is that they have to be tested for 7B models at least to show promis...

tbalsam · on March 22, 2024

Due to some of my personal experiences over the years w/ model development, I believe that this is more due to a failure of the current mainline version of Transformers (the ++ version I believe) not scaling properly, vs an indicator of scale.

If that is the case, then it may well be possible to fix some of the scaling issues more apparent with smaller transformer models (maybe not, though). This is at least some of the reasoning that I've been applying when developing hlb-gpt, for example. It's partially also why I think changing how we use nonlinearities within the network might impact scaling, due to some of the activation spikes used in more linear regions of the network to control network behavior in a way not originally intended.

Agreed that it does require a ton of resources though. But I do think that the problem can be solved on a smaller scale. If we don't have a cleanly logarithmic curve, then I think that something is dearly wrong with our base architecture. (However, of course, I may entirely be missing something here).

quotemstr · on March 22, 2024

I wonder whether we're missing out on techniques that work well on large models but that don't show promise on small ones

hackerlight · on March 23, 2024

More like we're missing out on techniques full stop. Proving things at scale is GPU expensive and gatekeeps publication and therefore accessibility.