AI

AI Math Models Score C+ on First Rigorous Benchmark Test

The First Proof project reveals that large language models can solve advanced mathematical problems—but only with massive compute costs and quality control.

Omega Editorial· June 10, 2026· 3 min read

AI tackles real mathematical research—with mixed results

Large language models have achieved their best performance yet on a new benchmark designed by leading mathematicians to test whether AI can actually assist with research-level mathematics. In the first official round of the First Proof project, AI systems correctly solved six to seven out of ten advanced problems—a C+ grade that reveals both promise and significant limitations.

The benchmark represents a departure from AI companies' internal testing. Organized by mathematicians including Lauren Williams at Harvard University and Mohammed Abouzaid at Stanford University, First Proof focuses on problems that professional mathematicians genuinely care about, not metrics optimized for marketing purposes.

Only OpenAI and three academic groups participated in this round, submitting publicly available models for evaluation. The restriction to public models was deliberate: the team wanted to provide a genuine service to the research community rather than validate proprietary systems.

Why it matters

As AI companies increasingly tout mathematical capabilities as proof of general intelligence, the research community needs independent benchmarks that measure real-world utility. The First Proof results show that while AI can now contribute to mathematical research, the technology remains far from autonomous problem-solving. The economic implications are particularly stark: some failed attempts cost nearly $1,000 in computing charges, raising questions about whether grant funding will increasingly flow to cloud providers rather than researchers.

Scaffolding and stamina drive success

The highest-performing system, IMProofBench from ETH Zurich and Aarhus University, succeeded not through a single model but through elaborate "scaffolding." The system combines ChatGPT with a "council" of other LLMs including Anthropic's Claude and Google's Gemini. When the base model gets stuck or tries to evade a difficult problem, other models check its work and push it to persist.

OpenAI's ChatGPT-5.5 Pro, which solved four to five problems correctly, employs similar multi-model architecture behind its unified interface. The AI's advantage lies in computational stamina: in one case, a model successfully executed a strategy that human mathematicians had identified but found too tedious to pursue manually.

The systems also demonstrated genuine research capabilities, surfacing obscure references from mathematical literature and identifying novel applications of established techniques.

Garbage output and ethical concerns

The models generated substantial amounts of incorrect or nonsensical output alongside their successes. Expert graders spent two intensive days at Harvard's Center of Mathematical Sciences and Applications reviewing AI-generated proofs—a process that typically takes six months for human-written work.

Williams noted widespread citation failures that would constitute plagiarism if committed by human researchers. "If it was a human, one might call it plagiarism," she said, expressing hope that the mathematical community can pressure AI companies toward better scientific ethics.

Abouzaid highlighted the economic concern: some problems accumulated nearly $1,000 in query charges just to produce wrong answers. "I truly believe this is an economic question—about research funding and research productivity," he said.

The team plans to release additional problems for public testing over coming weeks, with the next official round scheduled for fall. Funding came from philanthropic foundations and unrestricted donations from major AI companies, including Google and Anthropic, neither of which submitted models for evaluation.

These details were first reported by Scientific American.

#large language models#mathematical ai#ai benchmarks#openai#research automation#ai ethics

This is an original analysis by the Omega editorial team. Source reporting: AI Watch.

Want systems like this working for your business?

Book a Call

More in AI

AI· 2 min read

Warner Music Acquires AI Tracking Startup Sureel AI

The acquisition gives WMG technology to monitor how artists' content is used in training generative AI models.

Via AI Watch · Jun 10, 2026
AI· 3 min read

Warner Music Acquires AI Detection Firm Sureel

The major label aims to track how artists' works are used in AI training and generated music as copyright battles intensify.

Via AI Watch · Jun 10, 2026
AI· 3 min read

FDA Reviews AI Tool That Scans Medical Records for Cancer Risk

C the Signs analyzes patient histories in under 60 seconds, potentially becoming the first approved device in its category.

Via AI Watch · Jun 10, 2026