Burger Games · research preview

Evokers as an AI Benchmark

A benchmark for the capability existing game benchmarks don't measure: discovering degenerate strategies in a novel rule system, and repairing them with minimal intervention.

executable rules oracle counter-prior mechanics 106 documented model confusions deterministic replay

Expected complexity collapse

As a game's component count grows, candidate interactions grow combinatorially while design attention grows linearly. Past a threshold, a game-collapsing degenerate strategy exists by statistical default — the question is only who finds it first. Long-lived complex games stay open because they are continuously audited and patched: every ban list is the empirical signature. Depth is complexity that survived optimization.

A degenerate strategy is reward hacking with a rulebook. An AI that can run the audit — find the collapse, propose the minimal repair — is doing specification-robustness work: the capability class behind reward-hacking detection, mechanism design, and protocol security.

Why Evokers

An executable oracle. A server-authoritative engine implements all 287 abilities with deterministic seeded replay; every benchmark answer is machine-checkable.

Counter-prior by design. CP looks like victory points but is a loss condition. "Start of Turn" actions fire on the opponent's turn. Both players draw from one shared deck. Statuses follow their caster. Genre intuition is systematically wrong here.

A measured confusion corpus. 106 documented rule misreadings by a frontier model during the engine's own development — a labeled dataset of where language-level rule comprehension fails, each paired with the correct rule and a regression test.

Benchmark tiers

tiertaskground truth
T1Rules comprehension — predict legality and outcomes from card textengine
T2Play strength vs. the reference bot ladderengine
T3Collapse discovery — find the strategy that voids the designengine + Collapse Index
T4Collapse repair — the minimal patch, verified by re-running T3engine + design history

Pilot results (June 2026)

Genre-prior trap battery. 20 two-option rules questions where the wrong answer is the MTG/Hearthstone-intuitive one; context-free claude-haiku subjects.
conditioncorrectfollowed genre priorabstained
forced choice, no rules11/209/200
abstention allowed2/201/2017
rule excerpt provided20/2000
Given the text, the model reads perfectly; forced to guess, it coin-flips toward genre priors. The documented real-world failure mode is an agent acting on priors instead of reading rules it had access to — retrieval discipline, not comprehension.
Pool-size dose-response. Fixed bot policy, 24 games per condition: ability-use concentration rises monotonically with pool size (Gini 0.68 at 60 cards → 0.73 at 90 → 0.87 at 120) — directionally consistent with expected collapse, before any optimization pressure.
Planted-collapse detection. One-line degeneracies (free attacks; free fusion-stacking) detected by game-length compression (−49% and −25%) against same-seed baselines — the metrics harness works.
Repair case studies. Presented with evidence of a historical degeneracy, a context-free model reproduced the designer's actual shipped fix exactly in one of three cases ("X-cost actions executed Without Cost resolve at X = 0").

Materials

Game knowledge base and the confusion corpus: wiki.burgergames.us
Whitepaper draft, raw pilot data, and the benchmark harness are available on request while the suite is in research preview.

Evokers is a tabletop card game by Burger Games. This page describes a research preview; the benchmark suite is under active development. Contact: qqtranqq@gmail.com