Confidential findings · prepared under NDA
A fictional, representative engagement showing the structure and depth of a real deliverable.
Savings here are expressed as unit cost — cost per 1,000 calls on a fixed sample of traffic, before vs after — not the total monthly bill, which moves with usage. Every figure on this page is illustrative.
Four mechanical inefficiencies account for essentially all of the recoverable spend. None require touching product behavior, model quality, or output. In rough order of recoverable dollars:
| Leak | Current / mo | Optimized / mo | Saved |
|---|---|---|---|
| Uncached repeated prefix (system prompt + tool defs, re-sent every call) | $6,400 | $1,400 | $5,000 |
| Frontier model on low-complexity calls (classification, routing) | $2,400 | $1,100 | $1,300 |
| Async jobs at real-time price (no Batch API) | $1,200 | $700 | $500 |
| Unbounded context + oversized max_tokens | $1,000 | $1,000 | — |
| Total | $11,000 | $4,200 | $6,800 |
Per-1,000-call unit cost on the agreed sample falls from $12.20 to $4.66. The total bill may still rise as the product grows — which is exactly why the engagement is measured on unit cost, verifiable against the client's own invoices.
The 3–4k-token system prompt and tool schemas are identical on every call but reprocessed at full price ~30k times a day. Mark them with cache_control and reorder so everything static sits ahead of anything dynamic. Cache reads bill at ~10% of input. No output change.
Classification, intent detection, and routing are wired to the frontier tier out of habit. Route them to a model ~5× cheaper that produces identical labels; keep the top tier only for genuine reasoning. A few lines of conditional logic.
Nightly summarization and enrichment tolerate hours of latency but run on the synchronous endpoint. Move them to the Batch API for a 50% discount on input and output — same model, same result.
Introduce a sliding-window / summarization strategy for long conversations and right-size max_tokens. Mostly preventive — it stops unit cost from creeping back up as sessions get longer.
Figures throughout are illustrative and based on public 2026 API pricing mechanics. Actual results depend on a client's real traffic, context size, and model mix.