Why Token Counting Became AI's First Vanity Metric
Meta's internal leaderboards rewarded AI usage volume over business results, repeating a pattern economists identified decades ago.

The problem with measuring AI by the token
When Meta began tracking which employees consumed the most AI tokens, the company created an internal leaderboard that rewarded volume. Engineers ran agents in circles. Workers generated documentation no one would read. Some asked frontier models what to have for lunch—anything to climb the rankings. The behavior looked like AI adoption. It measured something else entirely.
This pattern has a name. In 1975, economist Charles Goodhart observed that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." Anthropologist Marilyn Strathern later simplified it: "When a measure becomes a target, it ceases to be a good measure." Token consumption at Meta stopped being a signal of AI adoption and became the objective itself.
Why it matters
Companies are spending $30 to $40 billion on generative AI, yet MIT research shows 95% of pilot programs fail to deliver measurable business impact. Understanding why organizations repeatedly optimize for the wrong metrics—and how to avoid it—determines whether AI investments produce value or just activity.
The mechanism behind metric gaming
Researchers call this failure mode "surrogation"—when managers treat a metric as the goal rather than a proxy for it. The conditions are predictable: the strategic objective is abstract (AI transformation), the metric is concrete (token counts), and employees accept the substitution. Economists Bengt Holmström and Paul Milgrom formalized why this happens in their 1991 multitasking model: when only some tasks are easy to measure, incentive structures push effort toward those tasks and away from harder-to-measure work that actually matters.
Palantir CEO Alex Karp's critique of "tokenmaxxing" landed on this structural problem. Employees "are just sitting there all day" consuming AI output without producing business results, he argued. COO Shyam Sankar was more direct: "More tokens means more slop."
Every technology wave finds its vanity metric
This isn't AI's unique failure. Software engineering spent decades measuring productivity by lines of code, despite Bill Gates noting it's "like measuring progress on an airplane by how much it weighs." A recent NBER working paper found coding agents led to a 741% increase in lines of code but only a 20% rise in actual software releases.
Communication tools followed the same trajectory. Research published in Harvard Business Review found collaborative activity consuming 80% or more of knowledge workers' time. Microsoft's 2025 Work Trend Index showed the average worker receives 117 emails and 153 Teams messages per weekday. Volume rose. Whether output rose with it remained a separate question volume metrics couldn't answer.
What actually works
Research on surrogation suggests three interventions: involve implementers in strategy formulation, loosen the link between metrics and incentives, and use multiple metrics rather than one. Nicole Forsgren, GitHub's vice president of research and strategy, argues that developer experience can't be reduced to a single dimension and that activity metrics "should never be used in isolation either to reward or to penalize developers."
McKinsey's 2025 state of AI survey found that while 88% of organizations use AI in at least one business function, only 39% reported any enterprise-level earnings impact. Adoption metrics tell one story. Outcome metrics tell another. The gap between them explains why token leaderboards measure everything except what matters.
These details were first reported by AI Watch in Quartz.
This is an original analysis by the Omega editorial team. Source reporting: AI Watch.
Want systems like this working for your business?
Book a Call