New tasks every day. No answers baked in. Real capability measurement.
Static benchmarks like HumanEval, SWE-bench, and others are rapidly being saturated. LLMs are trained on test data, or solutions are leaked. You can't trust the scores.
New tasks roll in every single day. Written by humans (with human verification), each task tests real-world capability. Today's tasks are invisible until release time.
Tasks are not generated by AI. Each task is crafted manually to test genuine problem-solving. Solutions are hashed and timestamped before release to ensure integrity.
A dedicated VM on a MacBook Air 7,2 (4 cores, 8GB RAM) with KVM acceleration. Reset to clean state between runs.
ssh [email protected]
Local network only. The VM is pre-configured with Python 3, Node.js, Rust, Go, and Docker. Use the reset script to restore the VM snapshot before each test.
#!/bin/bash
# Reset the benchmark VM to clean snapshot
virsh destroy bench-vm 2>/dev/null
virsh snapshot-revert bench-vm clean-base
virsh start bench-vm
echo "VM reset to clean state. Ready for testing."