Claude API Pricing — Tokens, Caching, and Limits

How the Claude API charges for input and output tokens, when prompt caching cuts costs, and what to check before estimating a production budget.

Cost signal

Output tokens cost more per token than input tokens on every Claude model tier. If your workload generates long completions, that asymmetry drives most of your bill. Profile completion length before assuming input size is the main lever.

How token pricing works

The Claude API charges separately for the tokens you send and the tokens the model generates. Input tokens are everything in your prompt — system text, conversation history, any documents you pass in context. Output tokens are what the model writes back. The billing is straightforward: count the tokens on each side, multiply by the per-tier rate, sum them. No minimum charges, no per-request fees in the standard tier.

Token counting is not always intuitive. A thousand English words is roughly 700–800 tokens depending on vocabulary, but code and non-English text can tokenise at quite different densities. The official SDKs provide token counting utilities so you can measure your actual prompts without needing to estimate. Getting those measurements early — before you finalise a production architecture — saves budget surprises later. A prompt that seems short in characters can carry an unexpectedly large token count if it contains a lot of code or structured data.

Output vs input pricing asymmetry

Output tokens cost more per token than input tokens, typically three to five times more depending on the model. That ratio is significant when you are estimating costs for workloads that generate long responses. A request that sends 2000 tokens of context and receives 500 tokens of output will cost more in output than input even though the input is four times larger in token count. Workloads that generate full documents, long code files, or multi-step reasoning traces with extended thinking enabled will see this asymmetry dominate their cost profile.

The practical implication: optimising prompt compression reduces input cost, but for long-generation tasks the bigger lever is reducing output length — using concise instruction framing, requesting structured short answers, or capping max_tokens. Prompt caching, covered below, is the most effective single optimisation for workloads with large, stable input contexts repeated across many requests.

Prompt caching

Prompt caching lets you designate portions of a prompt as cacheable. When a subsequent request reuses that cached prefix, the API charges a reduced rate for the cache-read tokens rather than the full input rate. Writing tokens to the cache on the first request costs slightly more than a standard input call; from the second request onward, cache hits pay roughly 10% of the normal input token rate. For workloads where the same large system prompt or document is prepended to every request, the savings compound quickly.

Caching is most effective when the stable portion of your prompt is large — a several-thousand-token system context, a reference document, or a persistent tool manifest. Short system prompts do not save enough to offset the cache write cost unless the request volume is very high. The cache has a TTL (time-to-live) that varies by tier; once expired, the next request repopulates it at the write rate. Understanding the TTL behaviour matters for workloads that run in scheduled bursts rather than continuously.

Pricing dimensions at a glance

Dimension	Input	Output	Caching
Opus	Highest input rate	~3–5× input rate	~10% of input on cache hit
Sonnet	Mid input rate	~3–5× input rate	~10% of input on cache hit
Haiku	Lowest input rate	~3–5× input rate	~10% of input on cache hit
Batch API	~50% of sync rate	~50% of sync rate	Caching applies within batch

Batch API pricing

The batch API processes requests asynchronously and charges approximately half the synchronous rate for both input and output tokens. The trade-off is latency: batch jobs are not guaranteed to complete within a specific window, though in practice they often finish within hours. The batch endpoint is well suited to overnight data-processing jobs, large-scale document summarisation, and any use case where throughput matters more than response time. Rate limits on the batch endpoint are separate from interactive limits, so batch jobs do not consume your synchronous request quota.

Budgeting in practice

The most reliable way to estimate monthly cost is to run a representative sample of your actual requests, record the token counts from the API response's usage field, and multiply by current rates. Token counts in the response are exact and vendor-authoritative. For new projects without existing data, sketch three scenarios — light, medium, and heavy usage — and use those as your planning range rather than a single point estimate.

Set a monthly spend cap in the API console before going live. The cap stops billing at the limit without throwing errors synchronously, so it prevents runaway spend from a bug without taking down a user-facing feature immediately. Most teams review actual versus estimated spend in the first two weeks of production and adjust the cap and model choice based on observed token distributions. For research context on cost modelling in AI systems, the NSF CISE directorate archives relevant published work on efficient inference at scale.

"The caching section here saved us significant spend. We were regenerating the same 8k-token system prompt on every request. One afternoon's work to cache it cut our monthly bill by almost 40%."

— Alejandro H. QuesadaCLI Developer · Brasero Logistical · Buenos Aires

Frequently asked questions about Claude API pricing

How does Claude API pricing work?

The API charges separately for input tokens (your prompt) and output tokens (the model's response). Each is billed per million tokens at a rate that varies by model tier — Opus is most expensive, Haiku is cheapest. There are no per-request fees in the standard tier; you pay for tokens consumed.

What is prompt caching and how does it reduce costs?

Prompt caching lets stable portions of a prompt — a large system context, a reference document — be cached server-side. Cache hits pay roughly 10% of the normal input rate. Writing to the cache costs slightly more than a standard input call, but repeated requests against the same cached prefix pay the discounted rate, making caching highly effective for high-volume workloads with shared context.

Are output tokens more expensive than input tokens?

Yes. Output tokens typically cost three to five times more per token than input tokens across all Claude tiers. For workloads that generate long completions — full documents, extended reasoning traces, long code files — output cost dominates the bill. Reducing completion length is often more impactful than compressing the input prompt.

Does the batch API change Claude pricing?

Yes. The batch API offers roughly 50% of the synchronous rate for both input and output tokens, in exchange for asynchronous processing with no guaranteed latency. Batch rate limits are separate from interactive limits, so batch jobs do not consume your synchronous quota. Well suited for overnight processing and large-scale document tasks.

How should I budget for Claude API costs?

Run a representative sample of real requests and read the token counts from the API response's usage field — those are exact. Sketch light, medium, and heavy usage scenarios as your planning range. Set a monthly spend cap in the console before going live to prevent runaway spend from bugs or unexpected traffic spikes.

Compare models before you commit

The models overview puts all three Claude tiers in one table — context, pricing signals, and best-fit use cases side by side.

View models overview