Attention Is All You Need https://arxiv.org/abs/1706.03762 It's from 2017 but I ...

tehsauce · on Dec 8, 2020

“it can detect patterns among an O(n) input size without requiring an O(n^2) size neural net”

This might be misleading, the amount of computation for processing a sequence size N with a vanilla transformer is still N^2. There has been recent work however which has tried to make them scale better.

m3at · on Dec 9, 2020

You raise an important point. The proposed solutions are too many to enumerate, but if I had to pick just one currently I would go for "Rethinking Attention with Performers" [1]. The research into making transformer better for higher dimensional inputs is also moving fast and is worth following.

[1] https://arxiv.org/abs/2009.14794

abecedarius · on Dec 8, 2020

It's clearly important but I found that paper hard to follow. The discussion in AIMA 4th edition was clearer. (Is there an even better explanation somewhere?)

glial · on Dec 8, 2020

I found it difficult to read too. Here's an annotated version, with code, which helps:

https://nlp.seas.harvard.edu/2018/04/03/attention.html

6gvONxR4sf7o · on Dec 8, 2020

It's crazy to me to see what still feel like new developments (come on, it was just 2017!) making their way into mainstream general purpose undergraduate textbooks like AIMA. It's this what getting old feels like? :-\

I start to understand what you always hear from older ICs about having to work to keep up, or else every undergrad coming out will know things you don't.

mrfox321 · on Dec 8, 2020

I would argue that input scaling is not fundamental to Transformers.

Recurrent neural network size is also independent of input sequence length.

The successful removal of inductive bias is really what differentiates this from previous sequence-to-sequence neural networks.

keithyjohnson · on Dec 9, 2020

Which inductive bias?

nmfisher · on Dec 9, 2020

Presumably that the output at step (n) is conditioned only the output of step (n-1).