ChatGPT cost per query in 2026 ranges from sub-tenth-of-a-cent for
short cached conversational turns to tens of cents for long-context
reasoning queries — a spread of roughly 100× across the realistic
distribution. The dominant cost is GPU-seconds, which scale with
output tokens and model size, not with subscription tier. Per-query
costs have dropped roughly 70–90% since 2023, driven by hardware gains,
serving-layer efficiency, distilled models, and prompt caching. For
OpenAI's P&L, the weighted-average free-tier query cost is the
single most important variable: it determines whether the free tier
can sustain itself on advertising inventory alone.

Start monetizing your AI app in under an hour
With Thrad, publishers go from first API call to live ads in less than 60 minutes. With fewer than 10 lines of code required, Thrad makes it easy to unlock revenue from your conversational traffic the same day.


ChatGPT Cost Per Query 2026 | Thrad
What does it actually cost OpenAI to serve a single ChatGPT query in
2026? The honest answer is: it depends — by roughly two orders of
magnitude. A cached, short, GPT-4o-mini conversational turn costs a
fraction of a cent. A long-context o-series reasoning query costs real
money. The distribution, not the average, is the story — and the
shape of that distribution is what determines whether the free tier
can be advertising-funded or needs to keep pushing users to paid.
ChatGPT cost per query in 2026 is measured in GPU-seconds, not dollars —
and the dollar translation depends heavily on which model answered,
whether the prompt cached, and how long the response ran. A realistic
per-query distribution spans roughly two orders of magnitude: short
cached conversational turns cost a fraction of a cent; long-context
reasoning queries cost real money. Understanding that distribution
shape, not the headline average, is what separates a useful analysis
from a misleading one.
What is ChatGPT's cost per query?
Cost per query is the fully-loaded serving cost of generating one
response to one user prompt. It includes the GPU-seconds spent
processing the input (prefill), generating the output (decode), any
tool calls or retrieval passes, the KV-cache memory footprint while
the response streams, and a proportional slice of the serving
infrastructure's fixed costs (GPU amortization, networking, storage,
orchestration). In 2026 this fully-loaded number sits between roughly
$0.0005 and $0.25 depending on query type.
In 2026, the dominant variable is output length × model size. Every
output token costs a forward pass through the model; bigger models mean
more parameters per pass; longer outputs mean more passes. Input length
matters much less because input tokens are processed in parallel in
the prefill phase, which is cheaper per-token than the serial decode
phase by roughly an order of magnitude.
The cost object is often measured in GPU-seconds per query rather than
dollars per query internally at OpenAI, because the dollar translation
depends on the accelerator generation and the amortization schedule of
the underlying hardware. That's why OpenAI's published API pricing
(which translates internal GPU-seconds into external dollars with a
margin layer) is the clearest external proxy for internal cost, not
the cost itself.
How does the cost distribution look in 2026?
The cost distribution is heavy-tailed, with the median query costing
under a cent but the top percentile of reasoning-heavy long-context
queries costing tens of cents. Understanding this shape matters because
it explains why OpenAI optimizes aggressively for the cheap end (free
tier) and prices conservatively at the expensive end (o-series API
access).
Query type | Rough per-query cost (2026) | Why |
|---|---|---|
Short cached GPT-4o-mini turn | $0.0005–$0.002 | Small model; cache hit on system prompt; few output tokens |
Standard GPT-4o conversational turn | $0.002–$0.01 | Mid-size model; typical response length |
Long-context GPT-5 response | $0.01–$0.05 | Bigger model; 2–4k output tokens; large input context |
o-series reasoning query | $0.05–$0.25 | Reasoning tokens inflate output count; larger model path |
Multimodal (image or audio in) | $0.01–$0.05 | Vision/audio preprocessing adds a fixed step |
Agentic multi-turn tool use | $0.05–$0.50+ | Multiple model calls per user action; hidden token volume |
These are directional estimates derived from published API prices plus
reasonable assumptions about the gap between OpenAI's internal COGS
and external pricing. The exact numbers are proprietary; the
distribution shape is widely discussed in industry coverage.
The practical implication of the distribution: a reasonably
conservative blended average across the free tier in 2026 sits around
$0.003–$0.008 per query. That's the number that matters for the
free-tier-versus-ads math. If OpenAI serves (conservatively) 15 billion
free-tier queries per week in 2026, that's $45M–$120M per week in
free-tier compute cost, or $2.3B–$6.2B annualized — the number that
advertising revenue has to cover for the free tier to be
self-sustaining.
Why do output tokens dominate cost per query?
Output tokens dominate cost because transformer inference has two
fundamentally different cost regimes: a parallel prefill phase that
processes the whole input at once, and a serial decode phase that
generates one output token per forward pass. Decode is slow, memory-
bandwidth-bound, and scales linearly with response length — so for
any given model, output length is the single biggest lever on per-query
cost.
Transformer inference has two very different cost regimes. The prefill
phase processes the entire input in parallel in one big matrix multiply
against the attention cache. The decode phase generates one output
token at a time, each requiring its own forward pass through the full
model and a read-and-write against a growing KV cache. Decode is serial,
memory-bandwidth-bound, and the part that scales linearly with output
length.
Practical implication: a prompt with 10,000 input tokens and 100 output
tokens costs much less to serve than a prompt with 1,000 input tokens
and 2,000 output tokens. The first scenario has a long prefill and a
short decode; the second has a short prefill and a long decode. The
second is substantially more expensive even though it has fewer total
tokens. This is why reasoning endpoints — which generate long
chain-of-thought output — are disproportionately expensive and why
"let the model decide how long to respond" is the single worst default
for cost management.
The unit economics of ChatGPT in 2026 are set by how many output
tokens a query generates, not how long its prompt is. A short prompt
that triggers a long reasoning trace is a much bigger cost event
than a long prompt with a one-line answer. Every optimization that
shortens output — structured output, length limits, smaller
distilled models — shows up directly in the bill.
The implication for free-tier economics is important. Free-tier users
on ChatGPT.com see the conversational product, which has defaults
tuned for short-to-medium answers and aggressive prefix caching on the
system prompt. Those two design choices alone probably cut average
free-tier cost per query by 50%+ relative to "naive" generation
settings. The productization of ChatGPT is, among other things, an
elaborate cost-optimization exercise.
Where did the efficiency gains come from?
Per-query costs for a representative ChatGPT workload have fallen
roughly 70–90% between 2023 and 2026, and those gains came from four
overlapping sources — not one. Hardware generation improvements
delivered the biggest single contribution; serving-layer software
efficiency was second; model distillation and prompt caching filled
in the rest. Each compounds on top of the others rather than
substituting.
The contribution of each driver:
Hardware generation gains (biggest single driver). Newer
accelerators (Nvidia's H200, B200, and Blackwell successors; custom
silicon from Google and others) deliver 2–4× the tokens-per-dollar of
the hardware available in 2023. Each generation stacks, so the
2023-to-2026 compound improvement is material.Continuous batching and speculative decoding. Serving-layer
improvements that let GPUs run closer to peak throughput, share KV
caches across users, and predict likely next tokens to skip ahead
when correct. Speculative decoding alone can deliver 30–50% speedups
on real workloads.Model distillation. GPT-4o-mini and similar smaller models
deliver most of the quality of larger models on common tasks at a
fraction of the compute. Routing easy prompts to smaller distilled
models is effectively a compute-efficiency multiplier on average
cost.Prompt caching. Shared prefixes (system prompts, tool schemas,
retrieved context) cache and skip recomputation on subsequent calls.
For a chatbot with a long static system prompt, cache-hit ratios
often exceed 90%, which brings input-side cost nearly to zero on
the cached portion.
None of these is finished. The curve continues through 2026 and 2027,
though incrementally slower as easy wins get picked off. The 2027–2028
wave of gains is more likely to come from specialized inference silicon
(from multiple vendors), quantization-aware model designs, and mixture-
of-experts routing at finer granularity.
How does cost per query compare across providers?
The cost-per-query framing is harder to compare across providers
because each has different model portfolios and different serving
architectures, but directional comparisons are informative. OpenAI,
Anthropic, and Google all run broadly similar cost structures on
flagship models; the gap between them on equivalent-quality endpoints
is usually within 30%. Open-weight models hosted on hyperscaler
hardware can be 2–5× cheaper at the API layer but push operational
complexity and quality-variance back onto the developer.
Provider (2026 proxy via API pricing) | Flagship $/M output | Distilled $/M output | Notable efficiency feature |
|---|---|---|---|
OpenAI | ~$15–$25 (GPT-5) | ~$0.6–$1.2 (GPT-4o-mini) | Prompt caching, batch API |
Anthropic | ~$15–$30 (Claude) | ~$1–$4 (Haiku class) | Extended context pricing |
~$10–$20 (Gemini) | ~$0.3–$1 (Flash class) | Tight TPU integration | |
Open-weight (hosted) | ~$0.6–$3 | ~$0.1–$0.5 | Quantization, custom kernels |
These numbers are API-pricing proxies for internal cost; actual internal
COGS is lower across the board. The practical signal is that the
distribution shape — short cached turns are cheap, reasoning-heavy
queries are expensive — is roughly the same at every provider. No
vendor has a magic cost structure.
What does this mean for OpenAI's P&L?
The weighted-average cost per free-tier query is a closely-guarded
number, but the public math isn't impossible to approximate. If
free-tier queries average in the $0.003–$0.008 range, and OpenAI serves
somewhere between 10 and 20 billion per week in 2026, the annual compute
cost of the free tier is roughly $1.5B–$8B per year depending on
assumptions about query mix and volume.
This is the number advertising and licensing revenue has to cover for
the free tier to stop being a loss leader. At current levels, it isn't
covered — which is why the advertising line is strategically
inevitable rather than optional. The shape of the solution: aim
advertising at the commercial-intent subset of free-tier queries where
CPMs can be high (easily 10–50× the per-query compute cost on those
queries), let informational queries remain un-monetized directly, and
let conversion to Plus pick up the long tail.
The weighted-average free-tier query cost is the single most
important number in OpenAI's P&L. It determines whether the free
tier can be sustainably ads-funded at plausible CPMs — and the
answer in 2026 is "yes on commercial queries, not yet on
informational queries, and maybe by 2028 across the full mix if
the compute curve keeps cooperating."
Common misconceptions
"A ChatGPT query costs the same as an API call for the same
prompt." Internal serving cost is lower than external API pricing
— OpenAI prices the API to generate margin; internal compute is
billed at COGS. External API prices are a ceiling, not the actual
cost."Prompt length is what drives cost." Output length drives cost
much more. A 500-token prompt with a 2,000-token answer costs more
than a 5,000-token prompt with a 200-token answer by a meaningful
margin."Cost per query is constant for a given model." It varies widely
within a model based on caching, tool calls, and output length. The
"average" number hides a wide distribution that matters for
strategy."Reasoning models are just more expensive versions of regular
models." They're more expensive in two ways simultaneously: more
tokens generated per answer, and each token costs more per unit.
Those stack multiplicatively, which is why cost per o-series query
can be 20–50× higher than the same prompt to GPT-4o."Cost per query will fall to zero." The curve is decelerating.
2027–2028 will probably bring another 30–50% reduction on
representative workloads; beyond that, incremental gains get harder
as serving efficiency approaches physical limits.
What comes next
Expect three shifts through 2026–2027. First, routing — ChatGPT will
increasingly decide per-query which model answers, with lightweight
classifiers directing easy prompts to cheaper endpoints. This pushes
average cost per query down faster than raw hardware efficiency would,
and it's already visible in 2026 deployments. Second, pricing surfaces
will blur: some queries will be ad-funded, some subscription-funded,
some enterprise-prepaid, and the "cost per query" framing becomes less
useful than "revenue per query." Third, specialized inference silicon
(from Nvidia, Google, AMD, and several startups) will deliver another
leg of efficiency gains that compound against continued software
improvements.
A fourth, under-appreciated shift: on-device inference. As flagship-
quality small models get small enough to run on laptop or phone
hardware, a fraction of ChatGPT queries will be served off-device
entirely, with zero marginal compute cost to OpenAI. This is mostly
a 2027–2028 story but is already technically feasible for the cheapest
query types.
How to act on this
If you're an engineering leader building on OpenAI APIs, the cost
lever set (caching, tiering, batch execution, output discipline)
delivers 5–10× improvements over baseline. Start with prompt caching;
it's usually the largest single win and the fastest to ship.
If you're a brand, the per-query cost curve is the tailwind behind
every generative surface you're about to advertise in. Cheaper queries
mean more deployment, more embedded AI answers in third-party
products, and more surfaces where a brand can earn or buy a placement.
Thrad helps brands measure, instrument, and activate presence inside
those generative surfaces as the economics keep pulling more of them
online. The cheaper the tokens, the more widespread the surfaces that
carry them, and the more important branded visibility becomes as a
discipline.

chatgpt inference cost, cost per prompt, gpu seconds per query, llm unit economics, openai serving cost, prefill vs decode, kv cache cost
Citations:
SemiAnalysis, "LLM Serving Cost Curves 2023–2026," 2026. https://semianalysis.com
OpenAI, "API Pricing — GPT-5, GPT-4o, o-series," 2026. https://openai.com/api/pricing
a16z, "State of LLM Inference Economics," 2025. https://a16z.com
Nvidia, "Inference Efficiency Benchmarks — Blackwell Generation," 2025. https://nvidia.com
Stanford HAI, "AI Index 2026 — Compute Costs," 2026. https://hai.stanford.edu
The Information, "Inside OpenAI's Compute Budget," 2026. https://theinformation.com
Hugging Face, "Serving Efficiency Benchmarks," 2025. https://huggingface.co
Be present when decisions are made
Traditional media captures attention.
Conversational media captures intent.
With Thrad, your brand reaches users in their deepest moments of research, evaluation, and purchase consideration — when influence matters most.

Date Published
Date Modified
Category
Advertising AI
Keyword
chatgpt cost per query

