Fears Around Inference Costs Are Misguided

At a recent DPP industry event, inference cost was cited as a primary concern around AI adoption. That fear is understandable, but misguided. This week's analysis from Arena tied to the release of GLM 5.2 and a fantastic (albeit lengthy) post from Ricky Ho frame why.

The Arena price-performance data tells the story visually... per-token costs have fallen by orders of magnitude in under three years and they continue to drop. For anyone treating current inference pricing as a fixed constraint, the trajectory is unmistakable: what costs a dollar today will cost pennies tomorrow. What's expensive now becomes a rounding error.

A thread this week from Ricky Ho tees up the case elegantly.

The pattern is familiar from every prior wave of technology infrastructure. Cheaper storage didn't mean we stored the same data for less. We stored exponentially more. Cheaper bandwidth didn't shrink internet bills. It gave us streaming video. Cheaper cloud infrastructure didn't reduce server purchases. It spawned entirely new categories of software.

Jevons Paradox. Efficiency doesn't reduce demand. It accelerates it, because falling costs unlock applications that weren't economically viable before.

Ho makes a useful observation about enterprise adoption: "No rational organization staffs every position with Nobel Prize winners". You match capability to the task. The same will be true for AI models. Most enterprise workloads will run on capable, cost-effective models, with frontier reasoning reserved for the tasks that genuinely require it. The assumption that organizations will converge on a single model for everything may feel logical today, but enterprise technology almost never consolidates that way.

Enterprises operating across multiple territories, languages, rights regimes, and distribution partners will need different model routing depending on cost, speed, workload complexity, data residency, security posture, and local regulation. Selecting the right model for each task becomes a complex optimization problem that goes well beyond benchmark scores.

The same logic will almost certainly extend to generative AI models for creating and processing images, audio, and video. As those models mature and their costs follow the same downward curve, the question won't be whether media companies can afford to use them. It will be whether their operations are designed to take advantage of them at scale.

One implication Ho draws is worth digesting: orchestration, the systems that manage model selection, context routing, observability, and workflow management across all of this, is positioned to become a consequential layer in the enterprise AI stack.

For media operations specifically, inference cost is not the constraint the industry thinks it is. The real constraint is whether your operations are architected to ride the cost curve as it continues to fall. The companies that build for orchestration, for multi-model flexibility, for agentic workflows that can absorb cheaper inference into more ambitious operations, will find that falling costs don't just reduce bills. They expand what's possible.

The ones waiting for inference to get cheap enough before they start building will find that the companies who started earlier have already compounded the advantage.

Get new posts and founder notes.