Blackwell Ultra NVL72 Runs 20x More AI Agents Per Megawatt
New AgentPerf benchmark reveals dramatic efficiency gains for agentic workloads that chain dozens of LLM calls together.

NVIDIA's GB300 NVL72 platform delivers 20 times more AI agents per megawatt than its previous-generation Hopper architecture, according to results from AgentPerf, the first industry benchmark designed specifically for agentic AI workloads.
The benchmark, developed by Artificial Analysis, measures performance on fundamentally different tasks than traditional AI inference tests. Where conversational AI involves a single large language model call and response, agentic AI breaks complex goals into dozens or hundreds of chained LLM calls, each passing expanding context to the next while executing tool calls for code compilation, database searches, and web browsing.
A New Class of Benchmark
Existing AI inference benchmarks measure how quickly a system responds to individual requests and how many simultaneous requests it can handle. Those metrics fail to capture the multiplicative complexity of agentic workloads, where chained calls, tool delays, and growing context windows stress computing infrastructure in entirely different ways.
AgentPerf addresses this gap by measuring real coding agent trajectories drawn from public repositories across more than 12 programming languages. Agents receive tasks, read files, write and edit code, execute commands, and iterate based on results—all while the benchmark tracks how many concurrent agentic tasks a platform can support while meeting defined performance thresholds for responsiveness and output token rate.
In initial results testing DeepSeek V4 Pro, a large mixture-of-experts model representing frontier-class capabilities, the GB300 NVL72 achieved the highest performance scores in the benchmark.
Full-Stack Architecture Advantage
The performance gains stem from coordinated design across NVIDIA's entire stack. The GB300 NVL72 connects 72 GPUs into a single rack-scale system, enabling large mixture-of-experts models to distribute execution efficiently at scale. CUDA kernels overlap communication and compute operations, absorbing coordination costs rather than adding latency. NVIDIA TensorRT LLM maintains efficiency as concurrent agent sessions scale by separating input processing from output generation for independent optimization.
Why it matters
For enterprises deploying AI agents at production scale, these metrics translate directly into infrastructure economics. The number of concurrent agentic tasks per accelerator and per megawatt determines how much productive work a given capital and power investment can deliver. As agents move from experimental deployments to production systems handling customer service, code generation, and business process automation, understanding true operational efficiency becomes critical for budgeting and capacity planning.
Production Deployments Underway
Inference providers including Baseten, DeepInfra, and Together AI are already serving agentic workloads on Blackwell infrastructure. Together AI powers real-time inference for Cursor, an AI-powered coding platform where agents debug issues and generate features while developers work. DeepInfra runs Pam.ai, an AI workforce platform for car dealerships that deploys agents to book service appointments and handle sales campaigns.
NVIDIA's Vera Rubin architecture is now in full production, bringing additional infrastructure capacity for agentic AI demands.
These details were first reported by NVIDIA.
This is an original analysis by the Omega editorial team. Source reporting: AI Watch.
Want systems like this working for your business?
Book a Call
