A benchmark for the capability existing game benchmarks don't measure: discovering degenerate strategies in a novel rule system, and repairing them with minimal intervention.
executable rules oracle counter-prior mechanics 106 documented model confusions deterministic replayAs a game's component count grows, candidate interactions grow combinatorially while design attention grows linearly. Past a threshold, a game-collapsing degenerate strategy exists by statistical default — the question is only who finds it first. Long-lived complex games stay open because they are continuously audited and patched: every ban list is the empirical signature. Depth is complexity that survived optimization.
A degenerate strategy is reward hacking with a rulebook. An AI that can run the audit — find the collapse, propose the minimal repair — is doing specification-robustness work: the capability class behind reward-hacking detection, mechanism design, and protocol security.
An executable oracle. A server-authoritative engine implements all 287 abilities with deterministic seeded replay; every benchmark answer is machine-checkable.
Counter-prior by design. CP looks like victory points but is a loss condition. "Start of Turn" actions fire on the opponent's turn. Both players draw from one shared deck. Statuses follow their caster. Genre intuition is systematically wrong here.
A measured confusion corpus. 106 documented rule misreadings by a frontier model during the engine's own development — a labeled dataset of where language-level rule comprehension fails, each paired with the correct rule and a regression test.
| tier | task | ground truth |
|---|---|---|
| T1 | Rules comprehension — predict legality and outcomes from card text | engine |
| T2 | Play strength vs. the reference bot ladder | engine |
| T3 | Collapse discovery — find the strategy that voids the design | engine + Collapse Index |
| T4 | Collapse repair — the minimal patch, verified by re-running T3 | engine + design history |
| condition | correct | followed genre prior | abstained |
|---|---|---|---|
| forced choice, no rules | 11/20 | 9/20 | 0 |
| abstention allowed | 2/20 | 1/20 | 17 |
| rule excerpt provided | 20/20 | 0 | 0 |
Game knowledge base and the confusion corpus:
wiki.burgergames.us
Whitepaper draft, raw pilot data, and the benchmark harness are
available on request while the suite is in research preview.