Math Benchmark Tests - Search News

News

AI’s math problem: FrontierMath benchmark shows how far technology still has to go

On benchmarks like GSM-8K and MATH, leading AI systems now score over 90%, but those tests are starting to approach saturation. One major issue is data contamination—AI models are often trained ...

Hosted on MSN6mon

A new math benchmark just dropped and leading AI models can solve 'less than 2%' of its problems... oh dear

AI models have traditionally not been great at extended reasoning in general, let alone for super-advanced math ... mathematical benchmark capable of really putting them to the test—2% isn ...

Ars Technica6mon

New secret math benchmark stumps AI models and PhDs alike

The benchmark tests AI language models (such as GPT-4o, which powers ChatGPT) against original mathematics problems that typically require hours or days for specialist mathematicians to complete.

Phys.org6mon

Testing AI systems on hard math problems shows they still perform very poorly

As developers of AI systems work to improve the math skills of their models, they have developed benchmarks to serve as a means to test their progress. Two of the most popular are MATH and GSM8K.

10d

Gemini 2.5 Pro just got Deep Think, a new hypothesis mode

On Tuesday at Google I/O, the company's annual developer conference, Google announced Deep Think, an "enhanced" reasoning ...

The Hechinger Report5mon

6 observations from a devastating international math test

Another way of understanding the shrinking middle is to see how few American children met basic math benchmarks. The test found that 13 percent of fourth graders could not add and subtract numbers ...

TechRepublic1mon

OpenAI’s o3: AI Benchmark Discrepancy Reveals Gaps in Performance Claims

OpenAI’s o3: AI Benchmark Discrepancy Reveals Gaps in Performance Claims Your email has been sent The FrontierMath benchmark from Epoch AI tests generative models on difficult math problems.

TechCrunch1mon

OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

“We’re seeing [internally], with o3 in aggressive test-time compute settings ... We evaluated the new models on our suite of math and science benchmarks. Results in thread!

Some results have been hidden because they may be inaccessible to you

Show inaccessible results