Inference Cost Assessment

A fictional, representative engagement showing the structure and depth of a real deliverable.

Client profile

Series-A · support automation

Provider

Anthropic API

Baseline spend

~$11,000 / mo

Workload

~30k calls / day

Sample window

30 days of usage export

Prepared

SAMPLE

Headline result

$11,000

Current — per month

$4,200

Optimized — per month

−62%

Unit-cost reduction

Savings here are expressed as unit cost — cost per 1,000 calls on a fixed sample of traffic, before vs after — not the total monthly bill, which moves with usage. Every figure on this page is illustrative.

Where the spend leaks

Four mechanical inefficiencies account for essentially all of the recoverable spend. None require touching product behavior, model quality, or output. In rough order of recoverable dollars:

Leak	Current / mo	Optimized / mo	Saved
Uncached repeated prefix (system prompt + tool defs, re-sent every call)	$6,400	$1,400	$5,000
Frontier model on low-complexity calls (classification, routing)	$2,400	$1,100	$1,300
Async jobs at real-time price (no Batch API)	$1,200	$700	$500
Unbounded context + oversized max_tokens	$1,000	$1,000	—
Total	$11,000	$4,200	$6,800

Before vs after

Before

$11,000 / mo

100%

After

$4,200 / mo

38%

Per-1,000-call unit cost on the agreed sample falls from $12.20 to $4.66. The total bill may still rise as the product grows — which is exactly why the engagement is measured on unit cost, verifiable against the client's own invoices.

The fix

1 · Prompt caching on the static prefix ≈ $5,000 / mo

The 3–4k-token system prompt and tool schemas are identical on every call but reprocessed at full price ~30k times a day. Mark them with cache_control and reorder so everything static sits ahead of anything dynamic. Cache reads bill at ~10% of input. No output change.

2 · Tier routing for mechanical calls ≈ $1,300 / mo

Classification, intent detection, and routing are wired to the frontier tier out of habit. Route them to a model ~5× cheaper that produces identical labels; keep the top tier only for genuine reasoning. A few lines of conditional logic.

3 · Batch API for async work ≈ $500 / mo

Nightly summarization and enrichment tolerate hours of latency but run on the synchronous endpoint. Move them to the Batch API for a 50% discount on input and output — same model, same result.

4 · Context & output hygiene monitored

Introduce a sliding-window / summarization strategy for long conversations and right-size max_tokens. Mostly preventive — it stops unit cost from creeping back up as sessions get longer.

How savings would be verified

Fix scope agreed up front against a fixed sample of representative traffic.
Unit cost (cost / 1,000 calls) measured before and after on that same sample.
Result cross-checked against the client's own provider invoices — nothing taken on faith.
Any savings-based fee is owed only if the agreed reduction is achieved.

Figures throughout are illustrative and based on public 2026 API pricing mechanics. Actual results depend on a client's real traffic, context size, and model mix.