,,Look, humans can't explain or understand how we drive, speak, translate, play ...

Isinlor · on Dec 6, 2020

As far as I'm aware, attention does not even attempt biological plausibility, nor was it in any way inspired by biology. The issues attention addresses are very specific to sequential nature of so called Recurrent Neural Networks. The first issue is known as exploding / vanishing gradients - basically as you keep multiplying some vector with matrices you will either explode that vector to infinity or squeeze to zero, the same happens with derivatives. The second issue is that you can not parallelize sequential operation. Attention address this issues by removing recurrence by using a specific invented mathematical structure. There was no name for it, but attention gives good intuition for what that mathematical structure is trying to do. Kind of like quantum chromodynamics uses the term "colors" in a way that has nothing to do with light, photons or even electromagnetic force.

whymauri · on Dec 6, 2020

>As far as I'm aware, attention does not even attempt biological plausibility, nor was it in any way inspired by biology.

It may not have been the intention, but associative memory is the one of the only mechanisms that computational neuroscientists can agree on broadly. There's been recent work on energy-based models that suggest biologically plausible methods adjacent to attention. [0]

[0] https://arxiv.org/abs/2008.06996

Straw · on Dec 6, 2020

Absolutely, modern NN architectures have been inspired by biological ones- despite their massive differences.

Even in cases like attention, the modern version (that actually works in GPT-3, AlphaFold2, etc), has little in common with both the english word and what we think of as attention. Its a formula with two matmuls and a softmax: softmax(AB)C. In particular, it doesn't necessarily look anywhere at all- just a weighted sum of the inputs. Nothing like the hard attention used by the human visual cortex. Its not even that different from a convolution where you allow the weights to be a function of the input.

So the inspiration might have come from humans, but the actual architectures have largely come from pure trial and error, with limited, difficult to explain intuition on what tends to work.

xiphias2 · on Dec 6, 2020

Actually self attention is a generalization of convolution:

https://openreview.net/pdf?id=HJlnC1rKPB

,,This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis.''

smallnamespace · on Dec 6, 2020

How do we really know that brains use hard attention?

Straw · on Dec 6, 2020

Eyes only have high resolution in a tiny spot.

smallnamespace · on Dec 7, 2020

Most definitely, what I meant though is how do we know mental attention is not essentially a big softmax (over some context, e.g. short or long term memory)?

The eye physically focusing on one thing at time seems like a special case (and this fact isn't even true of all animals, e.g. many prey species), and not a part of the brain's attention mechanism.

Straw · on Dec 7, 2020

Oh, I don't know, it very well might! However, the brain's physical structure doesn't look conducive to 4096x4096 matmuls!

l33tman · on Dec 7, 2020

Actually the retinal anisotropy is a feature distinct from attention. You can fixate on a point in a scene and then attend any other point in the scene. This is indeed how both animal and human attention experiments in the field is setup to control for the eyes movements.