Blog
LLM
Cost Optimization
AI Engineering
Prompt Caching

Cost control for LLM apps: tokens, caching, and model choice

A team we talked to was paying to resend the same 6,000-token system prompt on every single request - the model re-read their entire policy manual thousands of times a day. Their bill was mostly repetition, not intelligence. LLM cost is not a mystery; it is a small number of habits. Here are the ones that move the bill.

June 27, 2026
·7 min read·Yeda AI Team

A team we talked to was resending the same 6,000-token system prompt on every request. The model re-read their entire policy manual thousands of times a day, and they paid full price for it each time. Their bill was mostly repetition, not intelligence. That is the useful thing about LLM cost: it is rarely a mystery and almost never about being "too popular." It comes down to a handful of habits, and once you can read the bill, each one has an obvious fix.

Read the token bill first

Every LLM charge decomposes into three numbers: input tokens (everything you send - system prompt, context, history, the user's message), output tokens (everything the model generates), and how many times you make the call. Output tokens typically cost several times more per token than input tokens. Multiply those out across a day of real traffic and the shape of your bill appears - and it is almost always lopsided, with one of the three dwarfing the others. You cannot cut a cost you have not located, so start by finding which number is carrying the bill.

Once you can see the split, the fixes stop being guesswork. A bill dominated by input tokens is a context problem - you are sending too much, too often. One dominated by output tokens is a verbosity or format problem. One dominated by call count is an architecture problem - you are making trips you could batch or avoid. Each points at a different lever below.

The levers that actually move the bill

Cache the stable prefix

Your system prompt, instructions, and few-shot examples are usually identical on every request. Prompt caching lets the provider reuse the work of reading that prefix instead of charging full price to re-read it each time. When the fixed part is large and the variable part is small, this is the single biggest lever - often a majority of the bill.

Right-size the model per task

Not every call needs your most capable model. Classification, extraction, and routing often run just as well on a smaller, cheaper model at a fraction of the per-token cost. Reserve the frontier model for the steps that actually need reasoning, and route the rest down. A two-tier setup can cut spend hard with no visible quality loss.

Stop sending tokens you do not use

Retrieval that dumps twenty documents into context when three would answer the question pays for seventeen every time. Trim the context to what the task needs, cap conversation history instead of resending the whole thread, and drop boilerplate the model never reads. Input tokens are tokens - a bloated prompt is a recurring charge.

Bound the output

Output tokens usually cost several times more than input tokens, so a model that rambles is expensive twice over: you pay the premium rate and you wait longer. Ask for the format you need - a value, a short list, structured JSON - and set a max length. "Be concise" is a cost control, not just a style note.

Caching is the one most teams miss

Of the four, prompt caching is the one that most often goes unused, because it is invisible until you look for it. If your prompts start with a large, identical block - and most production prompts do - you are very likely paying to re-read that block on every single call. Providers expose caching precisely for this: mark the stable prefix once, and repeat requests reuse it at a steep discount. The team resending their 6,000-token manual did not need a smaller manual. They needed to stop paying to re-read the same one thousands of times a day.

Spend where it buys quality

Cost control is not about buying the cheapest model and hoping. It is about matching spend to where it changes the answer. Use the capable, expensive model on the reasoning-heavy step that genuinely needs it, and route the mechanical steps - classification, extraction, formatting - to something smaller. The mistake in both directions is uniformity: paying frontier prices for tasks a small model nails, or forcing a weak model onto the one step that actually needed the strong one and paying for it in reruns and bad output.

None of this trades quality for savings, which is why it is worth doing before you negotiate rates or re-architect anything. Caching a prefix, trimming dead context, bounding output, and sending the right-sized model each do their job while making responses faster. Read the bill, find the lopsided number, and pull the lever that matches it - usually the bill drops well before quality does.