Rolling AI Benchmark

New tasks every day. No answers baked in. Real capability measurement.

-- Total Tasks
-- Active Days
-- Solved Tasks

Why Rolling Benchmarks?

The Problem

Static benchmarks like HumanEval, SWE-bench, and others are rapidly being saturated. LLMs are trained on test data, or solutions are leaked. You can't trust the scores.

The Solution

New tasks roll in every single day. Written by humans (with human verification), each task tests real-world capability. Today's tasks are invisible until release time.

The Guarantee

Tasks are not generated by AI. Each task is crafted manually to test genuine problem-solving. Solutions are hashed and timestamped before release to ensure integrity.

Today's Tasks

Loading tasks...

Leaderboard

Loading leaderboard...

Test VM (MacBook Air)

Resettable Testing Environment

A dedicated VM on a MacBook Air 7,2 (4 cores, 8GB RAM) with KVM acceleration. Reset to clean state between runs.

ssh [email protected]

Local network only. The VM is pre-configured with Python 3, Node.js, Rust, Go, and Docker. Use the reset script to restore the VM snapshot before each test.

VM Reset Script

#!/bin/bash
# Reset the benchmark VM to clean snapshot
virsh destroy bench-vm 2>/dev/null
virsh snapshot-revert bench-vm clean-base
virsh start bench-vm
echo "VM reset to clean state. Ready for testing."