Inference Economics Will Determine AI Winners

Abstract

Generative AI is fundamentally altering the cost structure of software. Unlike traditional SaaS, where marginal costs approach zero, AI systems incur measurable, usage-linked costs driven by inference, retrieval, orchestration, and infrastructure utilization. These costs scale with engagement, complexity, and model size.

Public disclosures from Microsoft, Adobe, NVIDIA, and DocuSign demonstrate that leading firms are now treating inference efficiency, throughput, and cost per request as strategic performance variables.^[1]^[2]^[3] This paper argues that cost observability and optimization are becoming as central to competitive advantage as model quality and user experience. Firms that fail to operationalize cost discipline early will face structural disadvantages when competing against scaled incumbents.

1. The Return of Variable Cost in Software

For three decades, software economics were defined by fixed development costs and high marginal profitability. Once a product was built, incremental users generated revenue with minimal incremental cost.

Generative AI breaks this paradigm.

Each AI interaction requires compute, memory, storage, and orchestration. These costs vary by:

Model size and architecture
Prompt length and retrieval depth
Frequency of user engagement
System level optimizations such as caching and batching

As a result, revenue and cost are once again directly coupled. Highly engaged users and complex workflows increase both value and expense.

For example, Microsoft has reported processing more than one hundred trillion AI tokens per quarter, representing approximately a fivefold increase year over year.^[4] This volume implies not only extraordinary demand but also exponential growth in variable infrastructure consumption.

Microsoft Quarterly Processed Tokens

The implication is clear. In AI driven software, scaling usage without proportional efficiency gains leads to margin compression.

2. Empirical Evidence from Public Companies

Recent earnings disclosures provide direct evidence that leading firms are prioritizing inference efficiency and cost control.

Adobe

Adobe leadership has stated that it actively monitors cost per inference and manages GPU procurement between reserved and on demand capacity to optimize margins.^[5] This indicates that inference economics have become a first class financial control, similar to cloud infrastructure management in the prior SaaS era.

Adobe's continued revenue growth alongside margin management demonstrates that AI cost efficiency is a prerequisite for profitable scaling.^[6]

"Adobe's leadership emphasized balancing investments in AI innovation with maintaining strong operating margins, illustrating the need to control economics as AI scales."

Adobe Revenue v Gross Margin

DocuSign

DocuSign has positioned its AI platform as delivering high quality model performance at a low cost per inference.^[7] This reflects a shift from feature centric to economics centric product differentiation.

By framing cost per inference as a competitive attribute, DocuSign acknowledges that sustainable AI adoption requires predictable and controlled unit economics.

"We also introduced Docusign Iris, our AI engine purpose-built for agreement management that delivers leading LLM performance at a low cost per inference."^[7]

NVIDIA

NVIDIA now markets hardware architectures in terms of cost per inference token and throughput per dollar.^[8] This represents a fundamental shift in how AI infrastructure is evaluated.

Rather than raw compute, the key metric is how much usable AI output can be delivered per unit of capital and energy.

"As models evolve and generate more demand and create more tokens, enterprises need to scale their accelerated computing resources to deliver the next generation of AI reasoning tools or risk rising costs and energy consumption… Inference costs have been trending down… thanks to major leaps in optimization."^[8]

Microsoft

Microsoft has reported significant gains in AI throughput per unit of infrastructure, enabling materially higher token volumes on the same hardware footprint.^[9] This demonstrates that software and systems level optimization can directly reduce effective cost per request.

Taken together, these disclosures show that AI cost efficiency is no longer implicit. It is being measured, reported, engineered, and prioritized.

"Through software optimization alone, we are delivering 90% more tokens for the same GPU compared to a year ago."^[9]

3. Competitive Implications

A common assumption among emerging AI companies is that large incumbents enjoy an insurmountable cost advantage due to capital scale and organizational resources. While capital still matters, this assumption is incomplete.

With advancements in new tooling, the same level of visibility delivered by teams at large companies, is now being democratised for new competitors offering modern systems generating detailed telemetry on:

Token volume
Latency
Model utilization
Cache performance
GPU consumption
Cost per request

With contemporary observability and financial tooling, these signals can be transformed into real time cost intelligence. What previously required large finance and operations teams can now be actioned on with the right tech stack.

This creates a new competitive dynamic. The advantage is no longer organizational size but the degree to which leadership prioritizes economic visibility and optimization.

4. Product Quality Without Economic Scalability Is Not Sustainable

High performing AI models and strong user experience are necessary but insufficient.

If a competing firm can deliver similar value at materially lower cost, it can:

Offer more attractive pricing
Subsidize free or low cost usage tiers
Invest more aggressively in distribution
Absorb volatility in infrastructure or model pricing

This transforms cost efficiency into a strategic lever for market expansion and defense.

5. Mechanisms for AI Cost Optimization

Leading firms apply a consistent set of controls.

Model orchestration

Tasks are dynamically routed to models with appropriate cost and performance characteristics.

Inference efficiency

Quantization, batching, and decoding optimizations increase effective throughput and reduce cost per output unit.

Retrieval and caching

High cache hit rates and efficient retrieval pipelines prevent redundant computation.

Infrastructure strategy

Reserved capacity, hardware generation selection, and workload placement are optimized for cost efficiency.

Product level governance

Usage limits, credit systems, and feature level metering align demand with cost.

6. Metrics Required for Competitive Parity

Organizations seeking to compete at scale should track:

Cost per thousand tokens or per inference (time series)
Tokens per GPU hour
GPU utilization
Cache hit rate
Cost per feature
Cost per product
Cost per model
AI driven gross margin
Cost per user by cohort

7. Strategic Risk of Delayed Optimization

When AI systems are built without economic instrumentation, product decisions embed inefficiencies into prompts, retrieval systems, and user behavior. Once deployed at scale, these inefficiencies become structurally difficult to reverse.

Meanwhile, competitors that optimize continuously benefit from cumulative cost advantages allowing them to compete with the legacy incumbents.

AI Cost as % of Revenue

Conclusion

Generative AI has reintroduced variable cost into software. Cost observability and optimization are now core determinants of competitive performance.

The most successful AI driven companies treat inference economics as part of product design, infrastructure planning, and executive decision making. Organizations that do not adopt this discipline will find that even the best product cannot overcome structurally inferior unit economics.

References

¹ Microsoft Corporation. Azure AI and earnings commentary. Microsoft Investor Relations. https://www.microsoft.com/en-us/investor

² Adobe Inc. Q3 FY2025 earnings call transcript. Adobe Investor Relations. https://www.adobe.com/investor-relations.html

³ NVIDIA Corporation. Data center and AI platform disclosures. NVIDIA Investor Relations. https://investor.nvidia.com

⁴ Microsoft earnings coverage. Next Platform. https://www.nextplatform.com

⁵ Adobe earnings transcript. Adobe Investor Relations. https://www.adobe.com/cc-shared/assets/investor-relations.html

⁶ Adobe Form 10 Q and quarterly financials. Adobe Investor Relations. https://www.adobe.com/investor-relations/financial-documents.html

⁷ DocuSign earnings call transcript. DocuSign Investor Relations. https://investor.docusign.com

⁸ NVIDIA earnings call and platform architecture disclosures. https://investor.nvidia.com

⁹ Microsoft earnings transcript and AI infrastructure commentary. https://www.microsoft.com/en-us/Investor/earnings.aspx