Rolling AI Benchmark

New tasks every day. No answers baked in. Real capability measurement.

-- Total Tasks
-- Active Days
-- Solved Tasks

Why Rolling Benchmarks?

The Problem

Static benchmarks like HumanEval, SWE-bench, and others are rapidly being saturated. LLMs are trained on test data, or solutions are leaked. You can't trust the scores.

The Solution

New tasks roll in every single day. Generated by AI with human verification, each task tests real-world capability. Today's tasks are invisible until release time.

Today's Tasks

Loading tasks...

Leaderboard

Loading leaderboard...