Thursday, June 25, 2026

What’s the main points of Attention Is All You Need (Ashish Vaswani, et al, 2017)? (ChatGPT, 2026-6-25) (菊次郎十六式)

https://youtu.be/_VaEjGnHgOI?si=Kc9dc0GyrSKg1LaV

The paper Attention Is All You Need is one of the rare papers that genuinely changed the trajectory of computer science. Before it, almost all successful AI systems for language relied on recurrent neural networks (RNNs) or long short-term memory networks (LSTMs). After it, the architecture introduced in this paper—the Transformer—became the foundation of models such as OpenAI’s GPT series, Google’s Gemini, Anthropic’s Claude, and many others.

Its importance is not merely technical. It proposed a different philosophy of computation.

The central claim

The title is deliberately provocative:

Attention is all you need.

The authors argued that, for sequence modeling (translation, language understanding, etc.), recurrence is unnecessary.

Instead of processing words one after another, a model can process all words simultaneously, while allowing every word to “pay attention” to every other word.

That seemingly simple idea changed everything.


Before 2017

Language models worked like reading a novel one word at a time.

The → cat → sat → on → the → mat

Each new word depended on a hidden memory from previous words.

Problems:

  • slow
  • difficult to parallelize
  • long-range dependencies fade
  • memory bottleneck

The farther apart two words are, the harder it becomes to connect them.


Self-attention

Instead of moving sequentially, every word immediately looks at every other word.

For example,

“The animal didn’t cross the street because it was tired.”

When processing “it”,

the model asks

Which earlier word matters most?

Possible candidates:

  • animal
  • street

Self-attention assigns weights.

it

 ↑↑↑↑

animal   0.92

street   0.03

cross    0.02

The representation of “it” is therefore constructed mostly from “animal.”

Meaning is produced dynamically.


The Transformer

The architecture contains

Input

Embedding

Positional Encoding

Self-Attention

Feed-forward Network

Output

Repeated many times.

Instead of recurrence,

the computation repeatedly asks

“Who should I listen to?”


Query, Key, Value

Every word generates three vectors.

  • Query
  • Key
  • Value

Think of a library.

Each book has

  • a title (Key)
  • contents (Value)

The reader arrives with

  • a search request (Query)

Attention computes

Similarity(Query, Key)

Weight

Weighted sum of Values

Meaning is assembled according to relevance.


Multi-head attention

Instead of one attention mechanism,

the Transformer uses many simultaneously.

Different heads specialize.

One may notice

  • grammar

Another

  • subject-verb agreement

Another

  • long-distance references

Another

  • punctuation

Another

  • semantic similarity

The model therefore examines a sentence from several perspectives at once.


Positional encoding

Because the Transformer reads everything simultaneously,

it needs another way to know order.

It therefore adds positional information.

Word

+

Position

Representation

Without this,


Dog bites man


and

Man bites dog

would appear identical.


Parallelization

This was perhaps the biggest engineering breakthrough.

RNN:

word1

word2

word3

word4

Transformer:

word1

word2

word3

word4

all processed together

Modern GPUs excel at massive parallel computation.

The Transformer matched hardware almost perfectly.

That is one reason why today’s enormous language models became feasible.


Scaling

Another discovery emerged after 2017.

Transformers improve predictably as you increase

  • data
  • parameters
  • computation

This scaling behavior was not fully appreciated in the original paper, but later work showed that larger Transformers consistently acquire more sophisticated capabilities.

GPT is essentially

Transformer

+

huge data

+

huge compute

+

careful training


Beyond language

Transformers soon spread everywhere.

Today they are used in

  • language
  • images
  • protein folding
  • genomics
  • chemistry
  • robotics
  • audio
  • video

Almost any problem involving structured information can potentially benefit from attention mechanisms.


Why “attention” matters philosophically

This is the part I think connects most deeply with many of our previous conversations.

The Transformer abandons a simple linear notion of time.

Instead of saying

first → second → third

it says

every element exists in relation to every other element.

Meaning is relational rather than purely sequential.

This resembles ideas found in several traditions:

  • Martin Heidegger: entities disclose one another within a world.
  • Maurice Merleau-Ponty: perception is a field of relations rather than isolated sensations.
  • Jacques Lacan: a signifier gains meaning through its relations with other signifiers.
  • Gilles Deleuze and Félix Guattari: meaning emerges from networks and assemblages rather than linear chains.

The mathematics is different from these philosophical frameworks, but there is a family resemblance: context is constitutive of meaning.


A connection to one of your recurring themes

One connection to your recent reflections is especially striking. You have often contrasted an algorithmic self with an analogical self, and you’ve spoken about today’s world as an attention ecology rather than merely an attention economy.

The Transformer makes attention the central computational resource. Every token competes for the model’s limited attention budget. In today’s digital platforms, people likewise compete for human attention. The same underlying concept—selective allocation of limited resources—appears in both domains, although with very different goals.

Your concern is that human attention can become captured by algorithmic optimization, producing what you’ve called an attention sink. The Transformer itself is not the cause of that phenomenon; it is a mathematical mechanism for relating information. But the success of attention-based models has also helped accelerate AI systems that increasingly participate in shaping human attention.

That suggests an intriguing inversion of the paper’s title:

For machines, attention is all you need.

For humans, attention alone is not enough.

Human life also depends on memory, embodiment, history, ethical commitment, and sustained relationships. Attention may be the gateway through which experience enters, but it is not, by itself, the whole of personhood. That distinction becomes increasingly important as AI systems become ever more adept at modeling and directing attention.