How Large Language Models Work: Tokens, Transformers, and Why Training Costs $100M

When someone says "ChatGPT has 175 billion parameters" or "Claude was trained on trillions of tokens," those numbers sound impressive and mean almost nothing. What is a parameter? What is a token? Why does training cost $100 million? Why does every answer you get cost someone a fraction of a cent?

This post explains how a large language model actually works — from the moment you type a word to the moment it responds — using analogies, not equations.

Start Here: What Is an LLM Actually Doing?

The simplest honest description of an LLM: it is a next-word predictor, trained on almost everything humans have ever written.

That's it. You give it text. It predicts what word (actually, what token — more on that shortly) should come next. Then it appends that word and predicts the next one. And the next. Until it stops.

The magic isn't the mechanism — next-word prediction is a decades-old idea. The magic is that when you train a next-word predictor on enough text, something unexpected happens: to predict words well, the model is forced to develop a deep understanding of language, facts, logic, and reasoning. You don't teach it any of that. It emerges from the prediction task.

This is the counterintuitive thing about LLMs: nobody sat down and wrote rules for how Claude or GPT-4 should reason. The model learned to reason because reasoning well makes you better at predicting what comes next.

Step 1: Tokenisation — Breaking Text Into Chunks

Before any processing happens, your text is broken into pieces called tokens.

A token is not the same as a word. It's closer to a syllable or a common chunk of characters. The model has a fixed vocabulary of tokens — typically 50,000 to 100,000 of them — and any text you write can be expressed as a sequence of these tokens.

"Hello, how are you?"

Tokens: ["Hello", ",", " how", " are", " you", "?"]
          1       2      3      4      5      6

Simple words are usually one token. Rare words get split:

"Unbelievable" → ["Un", "believ", "able"]   (3 tokens)
"cat"          → ["cat"]                     (1 token)

Why tokens instead of characters or words?

Characters (a, b, c) are too granular — the model would need to learn to spell before it could learn to mean anything
Whole words create a vocabulary too large — there are millions of words, including names, slang, technical terms, and words in every language
Tokens hit the sweet spot: a vocabulary large enough to cover all text, small enough to be manageable

Why does this matter? Because token count is how LLMs measure everything — training cost, inference cost, context window size. When a model says it has a "128k context window," it means it can hold 128,000 tokens in memory at once — roughly 100,000 words.

Step 2: Embeddings — Turning Tokens Into Meaning

Once your text is tokenised, each token gets converted into an embedding: a list of numbers.

Not arbitrary numbers — numbers that encode meaning. Tokens with similar meanings end up with similar numbers.

Think of it like a coordinate system where meaning is geography:

"king"   →  [0.8, 0.2, 0.9, ...]
"queen"  →  [0.8, 0.1, 0.9, ...]   ← nearby: similar meaning
"car"    →  [0.1, 0.9, 0.2, ...]   ← far away: different meaning

These lists are long — typically 768 to 4096 numbers per token. That's the model's internal representation of what each token means in abstract space.

The famous example: in these number-spaces, king - man + woman ≈ queen. The geometry of meaning is encoded in the arithmetic of the numbers.

Step 3: Attention — How Words Look at Each Other

Here's the problem: the meaning of a word depends entirely on its context.

"The bank was steep" — bank means a riverbank. "The bank was closed" — bank means a financial institution.

Same word, completely different meaning. A model that looks at each token in isolation can't resolve this. It needs to look at the whole sentence and figure out which tokens are relevant to understanding each other.

This is what attention does.

For every token in your input, the attention mechanism asks: "which other tokens in this sequence are most relevant to understanding me?" It then weights each other token by relevance and incorporates that context into its understanding.

Sentence: "The animal didn't cross the street because it was too tired"

What does "it" refer to?
    "it" looks at all other tokens and assigns weights:
    - "animal"  → high weight (it = animal)
    - "street"  → low weight
    - "tired"   → medium weight (helps confirm it's the animal, not the street)

Attention is the core innovation that made modern LLMs possible. Before attention, models read text left to right like a person reading a sentence for the first time. Attention lets every token look at every other token simultaneously — like rereading the whole paragraph with full context before answering.

Multi-head attention means the model runs this process many times in parallel, each time looking for different kinds of relationships — one head might learn to track subject-verb agreement, another might track pronoun references, another might track topic continuity.

Step 4: Position Encoders — Teaching the Model About Order

There's a problem with attention: it looks at all tokens simultaneously, with no sense of sequence. To attention, "dog bites man" and "man bites dog" look identical — same tokens, same attention patterns, completely different meaning.

Position encoding fixes this by adding a number to each token's embedding that indicates where it sits in the sequence.

Token 1: "The"    embedding + position signal for slot 1
Token 2: "cat"    embedding + position signal for slot 2
Token 3: "sat"    embedding + position signal for slot 3

The position signal is carefully designed so that the model can use it to infer both absolute position ("this is the 5th word") and relative position ("these two tokens are 3 apart"). Without it, the model would treat every sentence as a bag of words with no order.

Step 5: The Transformer — Putting It All Together

A transformer is the architecture that combines all of this. One transformer layer looks like:

Input tokens
    │
    ▼
Attention: each token looks at all others, gathers context
    │
    ▼
Feed-forward network: each token processes its updated representation
    │
    ▼
Output: richer, more context-aware token representations

A real LLM doesn't run one layer — it stacks many of them. GPT-2 (2019) had 12 layers. GPT-3 had 96. Claude and GPT-4 have hundreds. Each layer refines the model's understanding a little more.

By the final layer, the token representations have been through dozens of rounds of contextual refinement. The model then looks at the last token's final representation and outputs a probability distribution over all ~50,000 tokens in its vocabulary: "there's a 40% chance the next token is 'the', 15% chance it's 'a', 5% chance it's 'this'..."

It picks from that distribution (usually the most likely token, sometimes with a bit of randomness for variety), appends it to the sequence, and repeats.

What Are "1 Billion Parameters"?

Every arrow in the diagrams above — every connection between neurons, every attention weight — is a parameter: a number that the model adjusts during training.

Parameters are the model's memory of everything it has learned. More parameters = more capacity to remember patterns, nuances, and knowledge from the training data.

To make this concrete:

A single parameter is like one synapse in a brain — a connection strength between two neurons
1 billion parameters is roughly the number of synapses in a small animal's brain
GPT-3 at 175 billion parameters is in the territory of a small mammal
GPT-4 is estimated at over 1 trillion parameters

But raw parameter count isn't everything. A well-trained 7-billion parameter model can outperform a poorly-trained 70-billion parameter model. How you train matters as much as how many parameters you have.

Training: Where the $100 Million Goes

Training an LLM means finding the right values for all those billions of parameters.

Here's how it works:

02
Feed the model a sentence with the last word removed: "The cat sat on the ___"
04
Let it predict the missing word. First time: random guess, almost certainly wrong.
06
Tell it the correct answer ("mat") and calculate how wrong it was
08
Adjust every parameter slightly in the direction that would have made the prediction more correct — this is called backpropagation
10
Repeat this trillions of times across hundreds of billions of sentences

That's it. The entire training process is: guess → measure wrongness → nudge parameters → repeat.

The reason this costs tens of millions of dollars:

Data: Training requires 1–10 trillion tokens of text — effectively the entire internet, plus books, code, scientific papers, and more
Compute: Each training step requires multiplying enormous matrices of numbers together, which requires thousands of specialised chips (GPUs or TPUs) running for weeks or months
Electricity: Thousands of GPUs running for months draws enormous power. A single training run for a frontier model consumes as much electricity as a small town uses in a year
Iteration: You don't get it right on the first try. Training requires many experiments, failed runs, and architectural changes before you get a model that works well

GPT-4's training run is estimated to have cost over $100 million. Gemini Ultra and Claude 3 Opus are in the same range.

Fine-Tuning: Making a General Model Useful

A model trained purely on next-word prediction on raw internet text is smart but weird. It will complete sentences, not answer questions. It might complete "How do I make a bomb?" by writing the next likely sentence — because the internet contains that content.

Fine-tuning is the second stage of training that shapes the model's behaviour:

02

Instruction fine-tuning: Show the model thousands of examples of (question, good answer) pairs. Now it learns to respond to questions rather than just complete text.
04

RLHF (Reinforcement Learning from Human Feedback): Human raters score the model's outputs for helpfulness, accuracy, and safety. A second model (a "reward model") learns to predict what humans prefer. The main model is then trained to generate outputs the reward model scores highly.

This is why ChatGPT answers questions helpfully rather than just predicting the next word. The base model learned language from the internet; fine-tuning shaped it into an assistant.

Fine-tuning is much cheaper than pre-training — sometimes 1,000× cheaper — but it still requires significant compute for frontier models.

Inference: Why Every Answer Costs Money

Training happens once (or a few times per model generation). Inference is what happens every time you send a message.

Each time you ask a question:

02
Your text is tokenised
04
It passes through all transformer layers — potentially hundreds of them
06
For each output token, the process repeats from your input + all previously generated tokens

For a model with hundreds of billions of parameters, generating each token requires multiplying vectors through those billions of parameters — on specialised hardware. The model must hold all those parameters in memory (GPU RAM) for the duration of the request.

At ChatGPT's peak usage (~100 million requests per day), the inference cost is estimated at $700,000 per day — roughly $250 million per year, just to serve existing users. That's before a single dollar of profit.

This is why:

API access to frontier models is priced per token (you pay per word, essentially)
Smaller models (7B, 13B parameters) are far cheaper to run and increasingly competitive
Companies like Anthropic and OpenAI spend as much on compute infrastructure as on research

The Whole Picture

You type: "Explain gravity simply"

    │
    ▼ Tokenisation
["Explain", " gravity", " simply"]  →  token IDs [4321, 8820, 7102]

    │
    ▼ Embeddings
Each token ID → list of 4096 numbers

    │
    ▼ + Position encoding
Numbers adjusted to include position in sequence

    │
    ▼ Through 96 transformer layers:
    │   Attention: each token looks at all others
    │   Feed-forward: refine representations
    │   (repeat × 96)

    │
    ▼ Final layer output
Probability over 50,000 tokens → pick "Gravity"

    │
    ▼ Append "Gravity", repeat for next token → "is"
    → "is" → "a" → "force" → "that" → "pulls" → ...

Output: "Gravity is a force that pulls objects toward each other..."

The whole process happens in under a second. Every word you see appear is one round of this loop.

What It Can't Do (And Why)

Understanding the mechanism explains the limitations:

It has no memory between conversations. Each conversation starts fresh. The model has no persistent memory of you — only what's in the current context window.

It can't look things up. Unless given a tool to do so, it only knows what was in its training data. Its "knowledge" is frozen at the training cutoff.

It can confidently state wrong things. The model generates the most plausible next token — not the most true one. Plausibility and truth often align, but not always. This is why LLMs "hallucinate."

It reasons by pattern, not by logic. It doesn't run a proof. It pattern-matches against training data where problems like yours were solved. For novel problems with no similar training examples, it's more likely to fail.

Go Deeper: Andrej Karpathy's Explanations

Andrej Karpathy — former head of AI at Tesla and founding member of OpenAI — has produced the best plain-English explanations of how these systems work:

Let's build GPT: from scratch, in code, spelled out — builds a working GPT in 2 hours of video. No prior ML knowledge required. If you watch one thing, watch this.
Intro to Large Language Models — a 1-hour lecture covering everything in this post and more, with visuals.
Neural Networks: Zero to Hero playlist — the full series from backpropagation basics to building GPT-2.

He has the rare ability to explain deeply technical ideas without dumbing them down — the same target this post is aiming for.