LLM Cost Audit Findings Report
SAMPLE REPORT — illustrative figures only. Contains no real company or client data.

Confidential findings · prepared under NDA

Inference Cost Assessment

A fictional, representative engagement showing the structure and depth of a real deliverable.

Client profile
Series-A · support automation
Provider
Anthropic API
Baseline spend
~$11,000 / mo
Workload
~30k calls / day
Sample window
30 days of usage export
Prepared
SAMPLE

Headline result

$11,000
Current — per month
$4,200
Optimized — per month
−62%
Unit-cost reduction

Savings here are expressed as unit cost — cost per 1,000 calls on a fixed sample of traffic, before vs after — not the total monthly bill, which moves with usage. Every figure on this page is illustrative.

Where the spend leaks

Four mechanical inefficiencies account for essentially all of the recoverable spend. None require touching product behavior, model quality, or output. In rough order of recoverable dollars:

Leak Current / mo Optimized / mo Saved
Uncached repeated prefix (system prompt + tool defs, re-sent every call) $6,400 $1,400 $5,000
Frontier model on low-complexity calls (classification, routing) $2,400 $1,100 $1,300
Async jobs at real-time price (no Batch API) $1,200 $700 $500
Unbounded context + oversized max_tokens $1,000 $1,000
Total $11,000 $4,200 $6,800

Before vs after

Before
$11,000 / mo
100%
After
$4,200 / mo
38%

Per-1,000-call unit cost on the agreed sample falls from $12.20 to $4.66. The total bill may still rise as the product grows — which is exactly why the engagement is measured on unit cost, verifiable against the client's own invoices.

The fix

1 · Prompt caching on the static prefix  ≈ $5,000 / mo

The 3–4k-token system prompt and tool schemas are identical on every call but reprocessed at full price ~30k times a day. Mark them with cache_control and reorder so everything static sits ahead of anything dynamic. Cache reads bill at ~10% of input. No output change.

2 · Tier routing for mechanical calls  ≈ $1,300 / mo

Classification, intent detection, and routing are wired to the frontier tier out of habit. Route them to a model ~5× cheaper that produces identical labels; keep the top tier only for genuine reasoning. A few lines of conditional logic.

3 · Batch API for async work  ≈ $500 / mo

Nightly summarization and enrichment tolerate hours of latency but run on the synchronous endpoint. Move them to the Batch API for a 50% discount on input and output — same model, same result.

4 · Context & output hygiene  monitored

Introduce a sliding-window / summarization strategy for long conversations and right-size max_tokens. Mostly preventive — it stops unit cost from creeping back up as sessions get longer.

How savings would be verified

Figures throughout are illustrative and based on public 2026 API pricing mechanics. Actual results depend on a client's real traffic, context size, and model mix.