AI Math Models Score C+ on First Rigorous Benchmark Test

The First Proof project reveals that large language models can solve advanced mathematical problems—but only with massive compute costs and quality control.

Omega Editorial· June 10, 2026· 3 min read

Key takeaways

AI models solved 6-7 out of 10 research-level math problems in the First Proof benchmark, with the best system combining multiple LLMs including ChatGPT, Claude, and Gemini.
Success required expensive "scaffolding" where multiple AI models check each other's work and force persistence, with some failed attempts costing nearly $1,000 in compute charges.
The models frequently failed to cite sources properly—behavior that would constitute plagiarism for human researchers—and generated substantial incorrect output requiring expert human review.
Only OpenAI and three academic groups participated after organizers restricted testing to publicly available models rather than proprietary internal systems.
Expert mathematicians spent two days reviewing AI proofs at Harvard, compressing a process that normally takes six months for human-written work.

AI tackles real mathematical research—with mixed results

Large language models have achieved their best performance yet on a new benchmark designed by leading mathematicians to test whether AI can actually assist with research-level mathematics. In the first official round of the First Proof project, AI systems correctly solved six to seven out of ten advanced problems—a C+ grade that reveals both promise and significant limitations.

The benchmark represents a departure from AI companies' internal testing. Organized by mathematicians including Lauren Williams at Harvard University and Mohammed Abouzaid at Stanford University, First Proof focuses on problems that professional mathematicians genuinely care about, not metrics optimized for marketing purposes.

Only OpenAI and three academic groups participated in this round, submitting publicly available models for evaluation. The restriction to public models was deliberate: the team wanted to provide a genuine service to the research community rather than validate proprietary systems.

Why it matters

As AI companies increasingly tout mathematical capabilities as proof of general intelligence, the research community needs independent benchmarks that measure real-world utility. The First Proof results show that while AI can now contribute to mathematical research, the technology remains far from autonomous problem-solving. The economic implications are particularly stark: some failed attempts cost nearly $1,000 in computing charges, raising questions about whether grant funding will increasingly flow to cloud providers rather than researchers.

Scaffolding and stamina drive success

The highest-performing system, IMProofBench from ETH Zurich and Aarhus University, succeeded not through a single model but through elaborate "scaffolding." The system combines ChatGPT with a "council" of other LLMs including Anthropic's Claude and Google's Gemini. When the base model gets stuck or tries to evade a difficult problem, other models check its work and push it to persist.

OpenAI's ChatGPT-5.5 Pro, which solved four to five problems correctly, employs similar multi-model architecture behind its unified interface. The AI's advantage lies in computational stamina: in one case, a model successfully executed a strategy that human mathematicians had identified but found too tedious to pursue manually.

The systems also demonstrated genuine research capabilities, surfacing obscure references from mathematical literature and identifying novel applications of established techniques.

Garbage output and ethical concerns

The models generated substantial amounts of incorrect or nonsensical output alongside their successes. Expert graders spent two intensive days at Harvard's Center of Mathematical Sciences and Applications reviewing AI-generated proofs—a process that typically takes six months for human-written work.

Williams noted widespread citation failures that would constitute plagiarism if committed by human researchers. "If it was a human, one might call it plagiarism," she said, expressing hope that the mathematical community can pressure AI companies toward better scientific ethics.

Abouzaid highlighted the economic concern: some problems accumulated nearly $1,000 in query charges just to produce wrong answers. "I truly believe this is an economic question—about research funding and research productivity," he said.

The team plans to release additional problems for public testing over coming weeks, with the next official round scheduled for fall. Funding came from philanthropic foundations and unrestricted donations from major AI companies, including Google and Anthropic, neither of which submitted models for evaluation.

These details were first reported by Scientific American.

#large language models#mathematical ai#ai benchmarks#openai#research automation#ai ethics

This is an original analysis by the Omega editorial team. Source reporting: AI Watch.

Want systems like this working for your business?

Book a Call

AI Math Models Score C+ on First Rigorous Benchmark Test

AI tackles real mathematical research—with mixed results

Why it matters

Scaffolding and stamina drive success

Garbage output and ethical concerns

More in AI

Nvidia and SK Group announce $500B AI infrastructure deal

AMD MI455X Helios: Software Progress and Supply Chain Risks

Anthropic Launches Claude Opus 5 at Half the Price of Prior Model