AI Math Models Score C+ on First Rigorous Benchmark Test
The First Proof project reveals that large language models can solve advanced mathematical problems—but only with massive compute costs and quality control.

AI tackles real mathematical research—with mixed results
Large language models have achieved their best performance yet on a new benchmark designed by leading mathematicians to test whether AI can actually assist with research-level mathematics. In the first official round of the First Proof project, AI systems correctly solved six to seven out of ten advanced problems—a C+ grade that reveals both promise and significant limitations.
The benchmark represents a departure from AI companies' internal testing. Organized by mathematicians including Lauren Williams at Harvard University and Mohammed Abouzaid at Stanford University, First Proof focuses on problems that professional mathematicians genuinely care about, not metrics optimized for marketing purposes.
Only OpenAI and three academic groups participated in this round, submitting publicly available models for evaluation. The restriction to public models was deliberate: the team wanted to provide a genuine service to the research community rather than validate proprietary systems.
Why it matters
As AI companies increasingly tout mathematical capabilities as proof of general intelligence, the research community needs independent benchmarks that measure real-world utility. The First Proof results show that while AI can now contribute to mathematical research, the technology remains far from autonomous problem-solving. The economic implications are particularly stark: some failed attempts cost nearly $1,000 in computing charges, raising questions about whether grant funding will increasingly flow to cloud providers rather than researchers.
Scaffolding and stamina drive success
The highest-performing system, IMProofBench from ETH Zurich and Aarhus University, succeeded not through a single model but through elaborate "scaffolding." The system combines ChatGPT with a "council" of other LLMs including Anthropic's Claude and Google's Gemini. When the base model gets stuck or tries to evade a difficult problem, other models check its work and push it to persist.
OpenAI's ChatGPT-5.5 Pro, which solved four to five problems correctly, employs similar multi-model architecture behind its unified interface. The AI's advantage lies in computational stamina: in one case, a model successfully executed a strategy that human mathematicians had identified but found too tedious to pursue manually.
The systems also demonstrated genuine research capabilities, surfacing obscure references from mathematical literature and identifying novel applications of established techniques.
Garbage output and ethical concerns
The models generated substantial amounts of incorrect or nonsensical output alongside their successes. Expert graders spent two intensive days at Harvard's Center of Mathematical Sciences and Applications reviewing AI-generated proofs—a process that typically takes six months for human-written work.
Williams noted widespread citation failures that would constitute plagiarism if committed by human researchers. "If it was a human, one might call it plagiarism," she said, expressing hope that the mathematical community can pressure AI companies toward better scientific ethics.
Abouzaid highlighted the economic concern: some problems accumulated nearly $1,000 in query charges just to produce wrong answers. "I truly believe this is an economic question—about research funding and research productivity," he said.
The team plans to release additional problems for public testing over coming weeks, with the next official round scheduled for fall. Funding came from philanthropic foundations and unrestricted donations from major AI companies, including Google and Anthropic, neither of which submitted models for evaluation.
These details were first reported by Scientific American.
This is an original analysis by the Omega editorial team. Source reporting: AI Watch.
Want systems like this working for your business?
Book a Call