There is a particular shape of AI project failure that we have started to recognise on first contact. It is not the spectacular failure — the demo that hallucinates a price, the chatbot that says something embarrassing to a regulator. Those make the news but they are not the common case. The common case is quieter. The pilot ships, the metrics look fine for a quarter, and then the finance team starts asking why the GPU bill is climbing while the value attributed to the system is not. By the time the team has answers, the budget has moved.
This is the cost curve most pilots ignore. It does not show up at the demo because the demo runs on twenty queries against a development model with a tight prompt. It shows up in month four, when the queries are five thousand per hour, the prompts have grown to handle edge cases, the team has switched to a more capable (and more expensive) model to fix a stubborn failure mode, and nobody is watching the bill because the project is "working."
Why the curve gets ignored
Three reasons, in sequence.
First — the demo is the wrong dataset. The team builds a pilot against the best version of the data: the happy paths, the queries the team thought of, the cases the model handles well. The cost on this dataset is artificially low because the queries are short and the model rarely needs the long-context expensive path. Real production traffic is the opposite — long tails, weird formats, prompts that grow.
Second — model selection happens too early. The team picks the model in week one based on capability, demos against it, and never re-runs the selection question. Six months later, smaller models that didn't exist when the project started would do the same work for a tenth of the cost, and the team has structural reasons not to switch.
Third — the eval harness is built late, if at all. The team measures "does it work" qualitatively and never builds the instrumentation that would let them compare candidate configurations on cost and quality together. Without that comparison, the optimisation conversation can't happen.
The cost shape, plotted
Imagine a graph. X-axis: months since launch. Y-axis: total monthly cost.
The naive expectation is a flat line — usage stabilises after a month, cost stabilises with it. The actual shape is rarely flat. It is a slow upward curve, and the slope is steep enough over a year that the original ROI projection becomes a fiction.
The slope has four sources, which we have seen across roughly a dozen post-mortems:
1 — Prompt growth. Every edge case the team finds gets added to the system prompt or context window. Within six months the prompt is 4× its launch length. Cost scales with input tokens.
2 — Model upgrade. A more expensive model is brought in to handle a specific failure mode (say, low-accuracy classifications on a hard segment). The model is good enough that nobody re-runs the analysis to see if a finetuned smaller model would do the same job for less.
3 — Retry expansion. The system retries on bad outputs, sometimes against a different (more expensive) model. Retries get tuned upward to fix tail failures. Each retry doubles the cost on that fraction of traffic.
4 — Usage growth. The feature works, more users use it, total cost rises with usage. This one is natural — but it compounds with the other three.
Together, these produce a 3–5× growth in monthly cost over a year against a fixed value per output. The unit economics quietly invert.
The demo is the wrong dataset. Real production traffic is long tails, weird formats, prompts that grow.
What changes when cost is a first-class constraint
The architectural moves are not exotic. They are the same moves database engineers have been making for forty years — they're just rarely applied to AI work because most AI teams come from a research mindset rather than an operations one.
Routing
The cheapest call is the one that doesn't go to the expensive model. Most production AI systems should have a router in front — a small fast model (or a deterministic classifier) that decides whether the query can be answered cheaply, expensively, or not at all. For the AI Intelligence Layer we built at Klay Securities, the router catches about 60% of queries before they hit the expensive model. Cost drops by roughly that fraction.
Caching
Many enterprise queries are repeats. A summarisation of yesterday's market that has been computed once should not be re-computed for every analyst who asks for it. Caching at the prompt-hash level is a single afternoon's engineering and routinely cuts cost by 20–40%.
Tiered models
Use the cheapest model that works for each class of query. A summarisation task probably doesn't need a frontier model. A regulatory drafting task probably does. The cost differential between the two can be 30×. Building the system around a single model is the lazy default and the expensive one.
Prompt compression
Every token costs money on the input side. Prompts that grow to 4× over a year can be re-compressed by half with little quality loss using a quarterly review process: prune redundant instructions, replace verbose few-shot examples with terser ones, move stable rules into fine-tuning where appropriate.
Async batching
For non-interactive workloads — overnight summarisation, batch reviews, scheduled drafts — batch APIs from most vendors cost 50% of the synchronous price. Any workload that doesn't need a 2-second response should be on the batch path.
The eval harness, treated seriously
None of the moves above work without an eval harness that can measure quality alongside cost. The harness is the part most teams skip until month nine. The teams that succeed build it in week three.
The minimum viable harness:
- A frozen test set of 200–1000 representative queries, labelled with the right answer or scored by a domain expert.
- An automated runner that executes the test set against any candidate configuration and produces a quality score, a cost estimate, and a latency distribution.
- A regression baseline — the current production configuration is the benchmark; any change has to beat it on quality without crossing a cost ceiling.
- A model selection function that recommends the cheapest model that passes the quality bar for each query class.
This is two engineer-weeks of work. It pays for itself within the first month of operation because it unlocks every other optimisation. Without it, the team has no way to argue for a switch from an expensive model to a cheaper one — every change feels like a risk.
The conversation with the CFO
Once cost is a first-class concern with instrumentation behind it, the conversation with finance changes. It stops being "we need more GPU budget" and becomes "we've reduced unit cost by 40% this quarter while holding quality constant; the budget increase is to expand the system to a new workflow."
This is the conversation that protects the project. CFOs are not the enemy. They are operators looking for evidence that the investment is improving over time. Give them that evidence and the budget grows. Refuse to look at the cost curve and they will eventually conclude the project is not under control.
What pilots that succeed look like
A pattern from the three biggest AI projects we have shipped in the last 18 months:
By month two, the eval harness was running. By month three, the router was catching the cheap queries. By month six, the team had switched at least one workflow from a frontier model to a fine-tuned smaller one without losing quality. By month nine, the cost-per-output had dropped 50–70% while total volume had grown 2–4×. By month twelve, the finance team was asking for more.
None of this required research breakthroughs. It required treating cost as part of the design.
If you're starting a pilot now
Three things to do before you write the first prompt.
Build the eval harness first. Before the prompt, before the model selection. The harness is the instrument you will use to argue every subsequent decision.
Pick a cost target. What is the unit cost the project needs to hit at scale? Write it down. Optimise against it from day one. A target of "cheap enough" is not a target.
Run the router experiment in week one. Even a dumb keyword classifier in front of an expensive model is usually a 30% cost cut. Put it in before you ship anything.
The pilot is not over when it works on the demo. The pilot is over when the unit cost is sustainable and the eval harness says quality is stable. The teams that understand this build AI that survives. The teams that don't end up explaining a quietly rising GPU bill to a quietly skeptical CFO.
Closing
Most AI projects do not fail at the demo. They fail at the cost curve. The teams that win the next two years will be the ones that treat GPU spend, model selection, and the eval harness as first-class design constraints — not as engineering chores to be handled after the model picks a side. The infrastructure decisions made in the first month determine whether the project will be defensible in the eighteenth.
Plot the curve. Bend it. Then ship.