Back to blog
billingengineeringai-gateway

Where streaming AI gateways quietly leak money on every disconnect

ApiLink Team··9 min read简体中文

Most AI gateways advertise “streaming-safe billing” — they refund the pre-deducted estimate when a client disconnects mid-stream. What very few of them say out loud is that the obvious implementation of that refund is exploitable. A specific disconnect pattern turns the refund into free inference. We shipped the same bug. This post is the autopsy, the fix, and a small guide for evaluating other gateways.

Why streaming billing is hard in the first place

When you call POST /v1/chat/completions with stream: true, the gateway has to send the response to your client before it knows the final token count. That is the central awkwardness of streaming billing: the bill is computed from numbers that arrive last, but the bytes have to flow from the moment the upstream first emits.

The standard pattern looks like this:

  1. Estimate first. Before opening the upstream connection, count the prompt tokens, multiply by your per-1k input price, add a generous completion-side buffer, and pre-deductthat amount from the user’s balance.
  2. Stream through. Pipe the upstream SSE chunks to the client byte-for-byte. Each chunk usually contains incremental tokens; the last chunk (or a synthetic one we emit) carries the final usage object.
  3. Settle at the end. When the stream closes cleanly, recompute the real cost from the final usage, then refund the difference between the pre-deduction and the actual cost.

That algorithm is correct as long as every stream closes cleanly. It falls apart the moment the client hangs up early — and that is where the money starts leaking.

The three ways a stream can die

From the gateway’s perspective, only three things can end a streaming call:

End stateWhat happenedSettlement decision
cleanUpstream sent the final chunk + DONE marker. Client got everything.Settle against final usage. Refund the buffer.
upstream_errorUpstream returned a 4xx/5xx mid-stream, or the TCP got reset.Tricky — depends on whether usage was already emitted.
client_disconnectClient closed the socket (Ctrl-C, browser tab closed, abort signal).Even trickier — the user might want their money back.

The instinct on rows 2 and 3 is to be generous: the user probably didn’t consume the full output, so don’t bill them for it. Many gateways implement this generosity as “if we don’t have final usage data, fall back to a character-based estimate of what was actually sent.” The character-estimate path is what creates the exploit.

The exploit, step by step

Here is the pattern. It works on any gateway that does character-based fallback whenever final usage is missing.

  1. Attacker sends a very expensive request — say, a long-context Claude Opus 4 prompt that pre-deducts $0.45.
  2. Upstream opens the stream. The very first chunk from most providers carries the prompt token count (because they tokenize before generating). The gateway sees prompt_tokens: 12000 in a usage sub-object — so it correctly flips an internal flag “usage was emitted, this stream is real.”
  3. Attacker reads that first chunk, then immediately closes the connection.
  4. At settlement time, the gateway sees: usage was emitted (so the “upstream never replied” safety net doesn’t trigger), but the final tally is completion_tokens: 0. It falls back to the character estimate. The accumulated content is roughly nothing (the client never read it).
  5. The character estimate computes a near-zero actual cost. The gateway issues a near-full refund. The attacker walks away with effectively free inference, repeatable.
This bug ships in production at more than one well-known gateway as of mid-2026. We are not naming names — but if you operate one, please check before you read further.

The economic damage scales with how greedy the attacker is. Single-shot, the loss is a few cents. Scripted against an enterprise plan with rate-limits-by-key, it is six figures a month.

The fix we shipped

The wrong fix is “always charge full estimated cost on disconnect” — that turns every legitimate client-side abort (closed laptop lid, mobile network blip) into a full charge. Users notice and rage on Reddit.

The right fix has two parts, and they have to be applied together:

  1. Trust upstream usage when it is final. If the stream closed cleanly and the usage frame had both prompt and completion counts, use those numbers. This is the normal path and should produce a clean refund.
  2. Never trust partial counts after a disconnect. If the settlement is triggered by client_disconnect or stream_error, do not fall back to a character estimate. Bill the full pre-deducted amount. The reasoning: upstream did compute these tokens — you owe the upstream provider for them whether or not your client saw them. Eating the cost yourself just to look generous makes you a free subsidy.

In our gateway the diff was small but surgical:

typescript
// Before: any "usage was emitted" stream that finished with 0 final
// tokens fell into the character-estimate refund branch. Disconnects
// after the first chunk hit that branch, refunding almost everything.

// After: if the settlement came from a disconnect or stream error,
// charge the full pre-deduction. Skip the refund path entirely.

const isAbortedOrErrored =
  statusOverride === "client_disconnect" ||
  statusOverride === "stream_error";

if (!tokens.usageEverEmitted || isAbortedOrErrored) {
  billingStatus = statusOverride ?? "estimated_no_usage";
  const actualCost = estimatedCost; // floor at pre-deduction, no refund
  // ... log + bill upstream + skip settle()
  return;
}

The result: clean streams still refund correctly. Disconnects bill the full pre-deduction, which is approximately what we owe upstream anyway. No exploit surface, no over-charge on legitimate aborts above the upstream cost.

What about legitimate aborts?

A reasonable counter-argument: “What if my user actually closed the tab because the answer was already what they needed at chunk 3 of 50? Now you’re billing them for all 50.”

We argue that this is the correct behavior, for two reasons.

First, the upstream provider already generated the tokens. OpenAI, Anthropic, Google — none of them give you a refund just because your end-user closed their browser. They generated the compute, they bill for it. Any “refund on disconnect” you hand out comes straight out of your gateway margin.

Second, the user already paid the pre-deduction up front. From their perspective, nothing got “extra-charged” — the amount that left their balance is the amount that was budgeted. Refunding the buffer on a disconnect is a nice-to-have, not a guaranteed contract. Saying so honestly in your docs is better than silently absorbing losses to look generous.

How to evaluate other gateways

If you are picking a gateway and want to know if they have this problem, here are three concrete checks you can run:

  1. The README check.Search their docs for “disconnect”, “abort”, or “partial usage.” A gateway that has thought about this will have a paragraph. A gateway that hasn’t will be silent or vague.
  2. The disconnect test. Send a long streaming request. After the first SSE chunk arrives, kill the connection (a signal.abort() in fetch land). Wait 30 seconds, then query your balance. If the deduction is close to zero while upstream definitely generated tokens, that is the bug.
  3. The Anthropic-cache check.Same test but with a prompt that triggers Anthropic prompt caching. Cache-write tokens carry a 25% premium upstream. A gateway that “forgets” to bill for those when streams die has even more to lose per call. We wrote a separate post about that one.

Closing

Streaming billing is one of those areas where every gateway looks identical from the README, and dramatically different on the wire. Test the disconnect path. If your provider can survive five successive aborted streams without leaking margin, they have done their homework. If they refund you the whole thing, you have either found a free inference glitch — which they will eventually plug — or a company that is silently eating losses they cannot afford. Neither is a good long-term bet.

We shipped the fix described above in our gateway last week. If you run a gateway and want to compare notes, our email is in the footer.

About ApiLink
ApiLink is an OpenAI-compatible gateway for GPT, Claude, Gemini, DeepSeek and more. One key, transparent streaming-safe billing, RMB invoicing for China-based teams.
Learn more →
Keep reading
ApiLink vs OpenRouter vs ZenMux: an honest gateway comparison
Three AI gateways, side by side. Where each one wins, where each one loses, and the honest answer about using more than one.
Pointing OpenAI Codex CLI at a third-party gateway
Two environment variables and Codex talks to Claude, Gemini, or DeepSeek instead of GPT. Plus the same trick for Cursor, Aider, Cline.
Using Claude/GPT/Gemini from China: a compliance checklist
Payment, invoicing, forex, data residency — every wall Chinese teams hit a quarter into using OpenAI or Anthropic, with a concrete checklist.