Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Attention Is All You Need

https://arxiv.org/abs/1706.03762

It's from 2017 but I first read it this year. This is the paper that defined the "transformer" architecture for deep neural nets. Over the past few years, transformers have become a more and more common architecture, most notably with GPT-3 but also in other domains besides text generation. The fundamental principle behind the transformer is that it can detect patterns among an O(n) input size without requiring an O(n^2) size neural net.

If you are interested in GPT-3 and want to read something beyond the GPT-3 paper itself, I think this is the best paper to read to get an understanding of this transformer architecture.



“it can detect patterns among an O(n) input size without requiring an O(n^2) size neural net”

This might be misleading, the amount of computation for processing a sequence size N with a vanilla transformer is still N^2. There has been recent work however which has tried to make them scale better.


You raise an important point. The proposed solutions are too many to enumerate, but if I had to pick just one currently I would go for "Rethinking Attention with Performers" [1]. The research into making transformer better for higher dimensional inputs is also moving fast and is worth following.

[1] https://arxiv.org/abs/2009.14794


It's clearly important but I found that paper hard to follow. The discussion in AIMA 4th edition was clearer. (Is there an even better explanation somewhere?)


I found it difficult to read too. Here's an annotated version, with code, which helps:

https://nlp.seas.harvard.edu/2018/04/03/attention.html


It's crazy to me to see what still feel like new developments (come on, it was just 2017!) making their way into mainstream general purpose undergraduate textbooks like AIMA. It's this what getting old feels like? :-\

I start to understand what you always hear from older ICs about having to work to keep up, or else every undergrad coming out will know things you don't.


I would argue that input scaling is not fundamental to Transformers.

Recurrent neural network size is also independent of input sequence length.

The successful removal of inductive bias is really what differentiates this from previous sequence-to-sequence neural networks.


Which inductive bias?


Presumably that the output at step (n) is conditioned only the output of step (n-1).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: