Token Optimization for LLMs

Every token you send to an LLM and every token it sends back has a price and a time cost. At prototype scale nobody notices. At production scale — thousands of calls a day, or an agent looping dozens of times per task — token spend becomes the line item that decides whether the whole thing is viable.

The good news: most workloads are carrying far more tokens than they need. This post is the practical checklist for cutting cost and latency without cutting answer quality.

Why tokens are the unit that matters

Two facts shape everything below.

Tokens are billed asymmetrically. Input (the prompt you send) is cheap; output (what the model generates) is typically several times more expensive per token. That single asymmetry means the biggest wins often come from shrinking output, not just input.

Output tokens dominate latency. The model generates one token at a time, so a long response is a slow response almost regardless of prompt size. Trimming what the model writes speeds up the user experience more than trimming what it reads.

Keep both in mind and the levers below sort themselves by impact.

The levers

1. Prompt caching — the highest-leverage win

If you send the same large prefix on many requests — a system prompt, a document, a set of few-shot examples — cache it. With caching, that shared prefix is processed once and reused: cache reads cost a small fraction of full input price (roughly a tenth), while a cache write carries a modest premium. Two or three requests over the same prefix and you're already ahead.

The one rule that makes caching work: it's a prefix match. Any byte change anywhere in the prefix invalidates everything after it. So keep stable content first and volatile content (timestamps, per-request IDs, the varying question) last. A datetime.now() interpolated into your system prompt silently defeats the entire cache. Verify it's working by checking the cache-read token count on your responses — if it's zero across identical-prefix requests, something is invalidating the prefix.

2. Trim and prune the context

The cheapest token is the one you never send. Audit what's actually in your prompt:

Cut verbose system prompts down to what changes behaviour.
Drop few-shot examples the model no longer needs once you've moved to a more capable model.
On long conversations, summarise old turns and drop the raw transcript rather than carrying the full history forward.

This is context engineering viewed through a cost lens — less context is usually also better answers, so this lever pays twice.

3. Structured output to cut generation

Because output is the expensive, slow half, shaping it is one of the best levers you have. Ask for exactly the fields you need in a defined schema rather than prose — a JSON object with three keys instead of three paragraphs explaining them. Structured output both guarantees a parseable response and stops the model from padding the answer with narration you'll throw away.

4. Model routing

Don't send every request to your most capable (and most expensive) model. Route by difficulty: a small, fast model handles the easy, high-volume calls — classification, extraction, simple lookups — and you escalate to the frontier model only when the task genuinely needs it. For a workload with a long tail of trivial requests, routing can cut the bill dramatically with no visible quality drop.

5. Batching

If your work isn't latency-sensitive — overnight processing, bulk classification, offline enrichment — send it through a batch API. Batched requests commonly run at a large discount (often around half price) in exchange for asynchronous, best-effort turnaround. For anything you don't need answered right now, this is free money.

6. Control reasoning depth

Modern models let you dial how hard they think. Higher reasoning effort buys quality on genuinely hard problems and burns tokens on easy ones. Match the setting to the task: reserve deep reasoning for the complex cases and keep routine calls light. Overthinking a simple request is pure waste — in both dollars and seconds.

7. Measure with real token counts

You can't optimise what you don't measure, and eyeballing string length is not measuring. Use the provider's token-counting endpoint to get accurate counts for your actual prompts — do not estimate with a generic tokenizer from a different model family, which can be off by 15–30% or more, especially on code and non-English text. Baseline your real prompts, then track the number as you cut.

A practical checklist

Work through these in order — they're roughly sorted by effort-to-payoff:

02
Cache any large, repeated prefix. Verify cache reads are landing.
04
Trim the system prompt and drop unused few-shot examples.
06
Compact long conversation history instead of carrying it whole.
08
Constrain output to a schema; stop paying for narration.
10
Route easy calls to a cheaper model.
12
Batch anything that isn't real-time.
14
Tune reasoning effort to the task.
16
Measure with real token counts, before and after.

None of these trade quality for savings when applied with judgement — most of them (less noise in context, tighter output, right-sized reasoning) make responses better while making them cheaper.

Token efficiency starts before the model call — with clean, minimal input. If your pipeline feeds LLMs data scraped from the web, extracting only the fields you need as structured rows keeps that input lean from the start. CrawlPilot does exactly that; the getting started guide shows the flow. To understand where tokens come from in the first place, see how LLMs work, and for the discipline behind lever #2, context engineering.

Token Optimization: Cut LLM Cost and Latency Without Losing Quality