We need benchmarks with "genuine, deep, and continually refreshing novelty." The agent must infer latent dynamics through active, limited interaction. Here are some examples:
Virtual Tools Game:
Uses intuitive physics. You have to pick a tool to get the red ball into the basket. You have never seen this exact puzzle, but you can simulate it in your head.
ARC-AGI-3:
An interactive version of ARC. You have to poke and prod the environment to figure out the rules of the game before you can even try to win.
AutumnBench:
Tests for defect detection. After exploring, can you spot the one frame where the game's rules are suddenly broken? This proves you learned the rules.