The Inference Substrate, or Why Every AI App Pays a Tax It Cannot See

This is a different kind of post from the build logs. The sovereign coding agent I wrote about recently, and the little YouTube analyser before it, were concrete systems with concrete code. This one is a thesis. It is the worldview that sits underneath the infrastructure work I keep finding myself drawn to, and it is the thing I am most convinced is true about the next few years of AI.

Here it is in one sentence: every application built on a closed AI model is paying a permanent, growing tax, and there is no infrastructure layer collecting the refund.

Open Table of contents

The tax nobody is itemizing
The empty middle
What goes in the middle
The moat is the traffic, not the trick
Why the model companies cannot build this
The objection that kills most middlemen, and the answer
Aligned pricing: get paid only when the customer saves
Why this is the moment
Why I think it is my problem to work on

The tax nobody is itemizing

If you build on GPT, Claude, or Gemini, you pay three costs that do not show up as line items but are very real.

Cost. Frontier model APIs are expensive, and at scale a single application can burn well over half a million dollars a month in tokens. The uncomfortable part is the direction: this does not go down as models get smarter. It goes up. Better models cost more per token and tempt you into using them for more things.

Latency. A round trip to a frontier API is 200ms to 2 seconds or more. For anything real-time, voice agents, customer service, a coding assistant that is supposed to feel instant, that round trip is the entire difference between magical and unusable.

Generality. This is the subtle one. Frontier models are trained for everyone and therefore optimized for no one. For any specific application, the overwhelming majority of the model’s capability is dead weight. You are renting a model that can write sonnets, debate philosophy, and speak forty languages, in order to classify support tickets into one of five buckets. You pay for all of it and use almost none of it.

These three are not bugs that a future release fixes. They are structural, and they get worse precisely as models get more powerful. And they apply to every AI application on earth. That is a strange situation: a universal, growing, structural cost, with no infrastructure category dedicated to reducing it.

The empty middle

Draw the stack. At the top is the application layer, where companies build products. At the bottom is the model layer, where OpenAI, Anthropic, and Google operate. In between is nothing.

There should be a layer there. A layer that sits in the path of the traffic, watches how AI is actually being consumed, learns from those patterns, and systematically removes the cost, latency, and generality penalties. Automatically. Continuously. Without demanding that the customer hire an ML team.

That empty middle is the whole opportunity. I have been calling the thing that fills it the Inference Substrate.

What goes in the middle

The substrate does three things, and each one borrows an idea from somewhere infrastructure people already trust.

1. Continuous automated distillation. The system watches an application’s real query traffic and automatically trains a small, task-specific model, on the order of 1 to 3 billion parameters, that handles the bulk of the queries at a fraction of the cost and latency. The frontier model stops being a permanent dependency and becomes a teacher: it generates the training signal, then steps aside for the queries the small model has learned to handle. Done right, 80 to 90 percent of queries get served by distilled models at 10 to 100 times lower cost and 5 to 10 times lower latency. The crucial constraint is that this pipeline has to be fully automated, because the customer does not have, and should not need, an ML team.

2. Semantic speculative execution. This is lifted straight from CPU design. A processor does not wait to be told what to do next; it predicts the likely branch and computes ahead. The substrate does the same with queries: a lightweight local model watches the session context, predicts the most likely next few queries, and pre-computes their answers before the user asks. When the prediction lands, perceived latency is close to zero. The hit rate improves the more traffic a domain sees, which is the first hint of the flywheel.

3. Outcome-learned routing. Every query-response pair throws off a signal: the user accepted it, corrected it, or abandoned it. The substrate learns, per application, which model actually wins for which kind of query, and routes dynamically across providers. The important word is learned. This is not benchmark-based routing, which is mostly marketing. It is routing learned from real production outcomes, model-agnostic, and continuously re-learned as providers ship updates and usage shifts.

The moat is the traffic, not the trick

None of those three mechanisms is an unrepeatable secret. The moat is what they produce together: a compounding data flywheel.

More queries through the substrate produce better distillation recipes, higher prediction accuracy, and smarter routing. That lowers cost and latency. That attracts more customers. That generates more queries. Run that loop at scale for eighteen months and the accumulated intelligence is not something a competitor can clone from a standing start, because they do not have the traffic that produced it.

This is exactly the Cloudflare shape. Cloudflare’s moat was never the initial reverse proxy. It was the intelligence accumulated from sitting at the center of an enormous fraction of web traffic. The product got you in the door; the position made you indispensable. The substrate is the same bet, applied to inference instead of HTTP.

Why the model companies cannot build this

This is the part I find most clarifying. The substrate only works if it is genuinely model-agnostic. It has to route to Claude when Claude is better, and to GPT when GPT is better, and to a 2B distilled model when neither is needed.

OpenAI cannot build a layer whose job is to route traffic away from OpenAI. Anthropic cannot ship something that tells a customer to use GPT for a certain class of query. Their structural incentive is lock-in. The substrate’s structural incentive is neutrality. That position, above all models and loyal to none, is permanently unavailable to anyone who makes a model. It is only available to a neutral party in the middle.

That is not a temporary advantage that a bigger lab erases with a bigger model. It is a structural one.

The objection that kills most middlemen, and the answer

The first thing any serious enterprise says is: “We are not routing our AI traffic through a third party that can read our queries.” They are right to say it, and for most would-be middlemen that is the end of the conversation.

The answer has to be architectural, not a promise. You solve it at the protocol level with zero-knowledge techniques, so the substrate learns aggregate patterns and routing signals without ever seeing plaintext. The customer gets the full optimization and the substrate never holds their data in the clear. This is not a privacy feature bolted onto a product. It is a foundational decision that flips regulatory compliance from a blocker into the reason a regulated enterprise chooses you.

Aligned pricing: get paid only when the customer saves

The business model should make the incentive impossible to misread. Charge a percentage of the cost reduction delivered, around 20 percent of the savings versus direct API consumption.

If a customer’s monthly spend drops from $500K to $80K through distillation, routing, and caching, you charge $84K. They keep $336K they would otherwise have burned. You only ever earn when they save, and your revenue per customer grows on its own as the system gets better, with no repricing and no upsell. The interests are welded together.

Why this is the moment

Three things make this acute now rather than someday.

Regulatory tailwind. The EU AI Act and US executive orders are pushing enterprises toward auditable, controllable AI consumption. A layer that gives you observability and control over model usage is sliding from nice-to-have toward compliance requirement.
A multi-model world. The one-model-fits-all era is ending. Every large enterprise is now evaluating three to five providers at once, which makes intelligent routing across them essential rather than optional.
Cost pressure at scale. Early adopters are hitting seven-figure annual API bills and actively shopping for cost-optimization infrastructure. This market did not exist eighteen months ago. It is real now.

Why I think it is my problem to work on

Here is the reframe that made me want to build it. The Inference Substrate is not, at its core, an AI research problem. It is an infrastructure and cryptographic systems problem that happens to be pointed at AI consumption: protocol design, incentive alignment, privacy-preserving computation, and coordination across parties that do not trust each other.

That is the same toolkit I keep reaching for. The sovereign coding agent was, at heart, about running computation inside a boundary you cannot cross and proving what happened inside it. Privacy-preserving learning, verifiable behavior, coordination across parties that do not trust each other: those are crypto-systems problems, and they are exactly the problems the empty middle of the AI stack is full of. I will be writing about more systems built on that toolkit soon, from agents that hold their own money to agents you can constrain at the cryptographic level. The substrate is the largest version of the idea.

The model is rarely the hard part. The layer around it almost always is. That has been the quiet thesis under everything I have written, and the Inference Substrate is just the biggest version of it: an order of magnitude cheaper, faster, and smarter AI consumption, built not by training a better model, but by building the intelligent layer that no model company is allowed to.