Training Data: The Model Is What It Eats
GitHub Copilot writes Python like a senior engineer.
Ask it to write COBOL — the language running most of the world's banking systems — and it sounds like it's guessing.
Same model. Same architecture. Same training process. Completely different quality.
The difference isn't in the neural network. It's in what that network was trained on.
GitHub has hundreds of millions of Python repositories. Almost none in COBOL. COBOL developers aren't pushing to GitHub — they're on mainframes in banks and government systems, on codebases from the 1970s that will never see a public repo.
The model learned Python deeply. COBOL, barely. Not because anyone decided that. Because that's what the data looked like.
This is training data — and why it matters more than almost any other factor in AI.
What the Model Actually Learns
In the last post, we established that a neural network doesn't query a database. It has weights — billions of small numbers, each adjusted during training until predictions get better.
But better compared against what?
The training data.
Training data is the collection of examples the model sees during training. For an LLM like GPT-4, that's text — billions of documents from the internet, books, academic papers, code repositories, news archives. For an image model, it's images with labels. For a recommendation system, it's the user behaviour logs.
The model reads this data, makes predictions, compares them against the known correct output, and adjusts its weights to reduce the error. Repeat this billions of times, and the patterns crystallise inside the weights.
When training ends, the weights are frozen. Everything the model knows — everything it will ever know — is locked inside those numbers.
The model is its training data.
Why Volume Matters
The more examples you feed a model, the more patterns it finds.
A language model that sees 100 examples of Python code learns some Python. One that sees 100 million learns syntax, idioms, conventions, how senior engineers write differently from juniors, and how styles differ between domains.
Volume gives the model coverage. Coverage means it handles edge cases, niche inputs, unusual contexts — because it has seen enough variation to generalise.
The COBOL problem is a volume problem. There simply isn't enough training data in that language for the model to learn it deeply.
But volume alone isn't enough.
Why Quality Matters More
Here's the part most AI coverage skips.
Training on bad data at a massive scale doesn't give you a better model. It gives you a confidently wrong model.
If your training data is full of factual errors, the model learns to produce factual errors fluently. If it's noisy, inconsistent, poorly labelled — the model learns those patterns too. It doesn't know the data is bad. It finds the patterns that are there.
Garbage in. Garbage out. Except the output is eloquent, well-structured, confident garbage.
This is why teams building frontier models spend enormous effort on data curation, not just collection. They deduplicate documents. They filter for quality — removing spam, low-effort content, and machine-generated noise. They balance datasets so no single domain dominates. They label examples carefully, because a wrong label at scale poisons the model's understanding of an entire category.
High-quality data at a smaller scale consistently beats low-quality data at a massive scale.
The Pipeline Nobody Talks About
The teams building GPT-4, Claude, and Gemini aren't just running training scripts.
They're ingesting petabytes of raw text. Deduplicating records across formats and sources. Writing quality filters. Flagging inconsistent labels. Handling missing data. Versioning datasets so experiments stay reproducible.
That's a data pipeline.
A very large, very consequential data pipeline — built on the same first principles any data professional has applied a hundred times.
The AI gets the headlines. The training data pipeline is why it works.
The Takeaway
When a model gets something wrong — when Copilot writes broken COBOL, when an LLM makes up a citation, when a translation is awkward — the first question isn't "is the architecture broken?"
The first question is: what did this model train on?
The ceiling is always set by the data. More is better. High quality beats more. The ideal is both — high volume and high quality — which is why building training datasets is one of the most expensive, careful processes in AI.
Day 06 of 100 — AI Foundations | Change of Basis — Reframe the familAIr. See the invisible.
