,,Look, humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work?''
I agree with you, but also it's amazing how much deepmind has achieved by putting neuroscientists and machine learning experts in the same room, and trying to make systems that work inside the human brain work efficiently on metal.
If you look at this talk for 2010, Demis was already listing attention as an example (which was responsible for the recent improvement in protein folding prediction as an example):
As far as I'm aware, attention does not even attempt biological plausibility, nor was it in any way inspired by biology. The issues attention addresses are very specific to sequential nature of so called Recurrent Neural Networks. The first issue is known as exploding / vanishing gradients - basically as you keep multiplying some vector with matrices you will either explode that vector to infinity or squeeze to zero, the same happens with derivatives. The second issue is that you can not parallelize sequential operation. Attention address this issues by removing recurrence by using a specific invented mathematical structure. There was no name for it, but attention gives good intuition for what that mathematical structure is trying to do. Kind of like quantum chromodynamics uses the term "colors" in a way that has nothing to do with light, photons or even electromagnetic force.
>As far as I'm aware, attention does not even attempt biological plausibility, nor was it in any way inspired by biology.
It may not have been the intention, but associative memory is the one of the only mechanisms that computational neuroscientists can agree on broadly. There's been recent work on energy-based models that suggest biologically plausible methods adjacent to attention. [0]
Absolutely, modern NN architectures have been inspired by biological ones- despite their massive differences.
Even in cases like attention, the modern version (that actually works in GPT-3, AlphaFold2, etc), has little in common with both the english word and what we think of as attention. Its a formula with two matmuls and a softmax: softmax(AB)C. In particular, it doesn't necessarily look anywhere at all- just a weighted sum of the inputs. Nothing like the hard attention used by the human visual cortex.
Its not even that different from a convolution where you allow the weights to be a function of the input.
So the inspiration might have come from humans, but the actual architectures have largely come from pure trial and error, with limited, difficult to explain intuition on what tends to work.
,,This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention
layer with sufficient number of heads is at least as expressive as any convolutional
layer. Our numerical experiments then show that self-attention layers attend to
pixel-grid patterns similarly to CNN layers, corroborating our analysis.''
Most definitely, what I meant though is how do we know mental attention is not essentially a big softmax (over some context, e.g. short or long term memory)?
The eye physically focusing on one thing at time seems like a special case (and this fact isn't even true of all animals, e.g. many prey species), and not a part of the brain's attention mechanism.
Actually the retinal anisotropy is a feature distinct from attention. You can fixate on a point in a scene and then attend any other point in the scene. This is indeed how both animal and human attention experiments in the field is setup to control for the eyes movements.
I agree with you, but also it's amazing how much deepmind has achieved by putting neuroscientists and machine learning experts in the same room, and trying to make systems that work inside the human brain work efficiently on metal.
If you look at this talk for 2010, Demis was already listing attention as an example (which was responsible for the recent improvement in protein folding prediction as an example):
https://www.youtube.com/watch?v=F5PSyu7booU