Enterprise

Companies Deploy Model Routing, Caching to Cut AI Token Costs

As generative AI bills soar into hundreds of millions, enterprises are adopting infrastructure strategies to reduce token consumption without sacrificing productivity.

Omega Editorial· June 18, 2026· 3 min read

Companies Deploy Model Routing, Caching to Cut AI Token Costs

Generative AI adoption has created an unexpected financial challenge for enterprises: token costs are spiraling out of control. At least one company has already faced a $500 million AI bill, according to industry reports, prompting IT leaders to urgently seek methods to reduce token consumption while maintaining productivity gains.

Tokens—the fundamental units that large language models use to process text—have become the primary metric for measuring and pricing AI services. Google alone processes approximately 3.2 quadrillion tokens monthly, according to CEO Sundar Pichai. As these volumes grow, so do the bills.

Why it matters

Uncontrolled AI spending threatens to undermine the business case for generative AI adoption. Organizations that fail to implement token management strategies risk budget overruns that could stall or reverse AI initiatives. The emerging solutions—from architectural changes to hardware investments—will shape how enterprises structure their AI infrastructure for years to come.

Model routing cuts costs in half

One immediate cost-reduction strategy involves routing queries to less expensive models when frontier-level performance isn't required. Pichai noted that Google's Gemini 3.5 Flash delivers comparable capabilities at less than half the price of top-tier models. By mixing Flash with premium models based on task complexity, companies can achieve significant savings.

Deepak Seth, senior director analyst at Gartner, pointed out that many use cases don't require models trained on extensive literary works. "There is sometimes overkill with the [LLMs]," Seth said, as reported by Computerworld. "I don't always need a large language model which has been trained on the works of Charles Dickens and Shakespeare and Harry Potter."

Caching layers reduce redundant processing

Dheeraj Pandey, CEO of DevRev, compared the current token crisis to earlier challenges with cloud computing and virtualization. His company is building memory layers between AI agents and primary data sources like Salesforce or ERP systems. These layers maintain knowledge graphs with answers to common queries and run on cheaper CPUs rather than expensive GPUs.

"Sending agents straight at systems like ServiceNow and Salesforce will burn a lot more tokens," Pandey told Computerworld. "It's also not precise. And finally, it's not safe enough where I can roll it back in case an agent has committed a mistake."

Network automation firm NetBrains uses conventional computing to map network layouts, then feeds only essential information to AI models for planning and reasoning tasks where AI provides the most value.

Prompt optimization delivers measurable gains

Staffing firm ManpowerGroup reduced token consumption through prompt efficiency improvements. Users of its internal labor-market tool initially required 10 follow-up questions to complete queries. After a year of optimization, that number dropped to four, according to Max Leaming, head of data science and AI solutions at ManpowerGroup.

On-premise hardware offers unmetered alternatives

Nvidia and Microsoft recently unveiled RTX Spark, an agentic AI desktop PC capable of running 120-billion-parameter models locally on Windows. The goal, according to Microsoft CEO Satya Nadella, is "to deliver unmetered intelligence to every home and every desk with Windows."

Some enterprises are also installing their own AI hardware in data centers through vendors like HPE and Dell, driven by both cost concerns and geopolitical considerations.

Outcome-based pricing on the horizon

Gartner's Seth predicted that token-based pricing will eventually shift toward outcome-based models, where value is measured by results rather than word fragments. "When people start realizing the real cost of tokens, then companies will start looking at token efficiency," Seth said.

These details were first reported by Computerworld.

#generative ai#ai costs#tokens#llm optimization#enterprise ai#ai infrastructure

This is an original analysis by the Omega editorial team. Source reporting: AI Watch.

Want systems like this working for your business?

Book a Call

Companies Deploy Model Routing, Caching to Cut AI Token Costs

Companies Deploy Model Routing, Caching to Cut AI Token Costs

Why it matters

Model routing cuts costs in half

Caching layers reduce redundant processing

Prompt optimization delivers measurable gains

On-premise hardware offers unmetered alternatives

Outcome-based pricing on the horizon

More in Enterprise

ServiceNow Cuts Hundreds of Jobs in Global Restructuring

Pentagon awards Accenture $821M to build War Data Platform core

Meta and BlackRock Form $14B Data Center Venture in El Paso