Benchmark Production Graph

Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability. xAI’s graph showed ...

Some results have been hidden because they may be inaccessible to you