How AI Learned to Read: The Attention Mechanism Explained

"The trophy didn't fit in the suitcase because it was too big."

You read that and instantly knew: "it" means the trophy. Not the suitcase.

You didn't reason through it. You looked at the whole sentence at once, felt which words were related, and resolved the meaning without thinking about the process.

This is something humans do effortlessly — and it turned out to be one of the hardest problems in AI.

How AI Used to Read: One Word at a Time

Before 2017, the dominant approach to language in AI was sequential.

Models called RNNs (Recurrent Neural Networks) read text like a typewriter produces it — left to right, one word at a time. At each step, the model updated an internal memory with what it had seen so far, then moved on.

The memory was supposed to carry context forward. But it faded.

By the time an RNN reached the end of a long sentence, the beginning had become a faint signal buried under everything that came after.

LSTMs (Long Short-Term Memory networks) came next — built to fix this. They added gating mechanisms: ways to selectively remember and forget different pieces of information. A real improvement.

But still sequential. Still word by word. Still slow to train, because each step depended on the one before.

For short sentences, fine. For long documents or anything where early context mattered at the end, it broke down.

The Attention Mechanism

In 2017, a paper called "Attention Is All You Need" introduced a different approach entirely.

Instead of reading word by word, the model looks at the entire sequence at once. Every word, simultaneously.

But that created a new problem: if you're looking at everything at once, how do you know which words actually matter to each other?

That's what attention solves.

For every word in a sentence, attention calculates a score against every other word. How much should "it" attend to "trophy"? To "suitcase"? To "fit"?

These scores — called attention weights — tell the model where to focus.

In our example: "it" → trophy: high weight "it" → suitcase: low weight "it" → didn't: near zero

The model doesn't get told which word "it" refers to. It calculates the relationship from patterns in the training data. Trained on enough examples, it learns that pronouns attend strongly to the noun they refer to.

This is self-attention — every word attending to every other word in the same sequence, all at once.

Why This Was a Breakthrough

Two things changed when attention replaced sequential processing.

Long-range dependencies. Because attention calculates relationships across the entire sequence simultaneously, it doesn't matter how far apart two words are. A word at the start of a paragraph can directly influence a word at the end. No fading. No forgetting.

Parallelism. Because the model no longer waits for each sequential step, the entire attention calculation runs in parallel on modern hardware. This is why LLMs can be trained at the scale they are. You couldn't have trained GPT-4 on an RNN — the sequential constraint would have made it impossibly slow.

What "Attention Weight" Actually Means

Imagine you're reading a meeting transcript and someone asks: "What did Sarah say about the deadline?"

You don't re-read from the beginning word by word. Your eyes jump to the parts that mention Sarah, or deadlines, or both. You weigh some sections as highly relevant and skim the rest.

Attention does this computationally.

For every word being processed, the model looks at every other word and calculates which ones are most relevant. The result is a weighted sum — the model takes in everything, but pays different amounts of attention to different parts.

High weight = this word matters for understanding the current word. Low weight = mostly irrelevant here.

Nobody programs these weights. They emerge from training.

The Takeaway

Attention is not an optimisation. It's a replacement.

RNNs read like a typewriter — one word at a time, losing context as they go.

Attention reads like a human — the whole thing at once, relationships calculated simultaneously, nothing forgotten.

That shift is the reason modern language models exist at the scale they do.

Day 08: the full architecture that wraps attention into a complete system — the Transformer.

Day 07 of 100 — AI Foundations | Change of Basis — Reframe the familAIr. See the invisible.

Attention: How AI Learned to Read Like You Do

How AI Used to Read: One Word at a Time

The Attention Mechanism

Why This Was a Breakthrough

What "Attention Weight" Actually Means

The Takeaway

Comments

Change of Basis

What Is AI? The Origin Story Nobody Told You

More from this blog

Training Data: The Model Is What It Eats

A Neural Network Didn't Learn Your Taste. It Calculated It.

You're Not Writing Prompts. You're Giving the AI Coordinates.

LLMs Don't Read Words. Here's What They Actually See.

Command Palette

How AI Used to Read: One Word at a Time

The Attention Mechanism

Why This Was a Breakthrough

What "Attention Weight" Actually Means

The Takeaway

Comments

Change of Basis

What Is AI? The Origin Story Nobody Told You

More from this blog