Solution: A Benchmark of "Novel Games"

We need benchmarks with "genuine, deep, and continually refreshing novelty." The agent must infer latent dynamics through active, limited interaction. Here are some examples:

Virtual Tools Game:

Uses intuitive physics. You have to pick a tool to get the red ball into the basket. You have never seen this exact puzzle, but you can simulate it in your head.

The Virtual Tools Game

ARC-AGI-3:

An interactive version of ARC. You have to poke and prod the environment to figure out the rules of the game before you can even try to win.

Abstract Reasoning Corpus - AGI - 3

AutumnBench:

Tests for defect detection. After exploring, can you spot the one frame where the game's rules are suddenly broken? This proves you learned the rules.

AutumnBench: World Model Learning in Humans and AI

15 / 20