Burger Games · research preview

Evokers as an AI Benchmark

A benchmark for the capability existing game benchmarks don't measure: discovering degenerate strategies in a novel rule system, and repairing them with minimal intervention.

executable rules oracle counter-prior mechanics 106 documented model confusions deterministic replay

Expected complexity collapse

As a game's component count grows, candidate interactions grow combinatorially while design attention grows linearly. Past a threshold, a game-collapsing degenerate strategy exists by statistical default — the question is only who finds it first. Long-lived complex games stay open because they are continuously audited and patched: every ban list is the empirical signature. Depth is complexity that survived optimization.

A degenerate strategy is reward hacking with a rulebook. An AI that can run the audit — find the collapse, propose the minimal repair — is doing specification-robustness work: the capability class behind reward-hacking detection, mechanism design, and protocol security.

Why Evokers

An executable oracle. A server-authoritative engine implements all 287 abilities with deterministic seeded replay; every benchmark answer is machine-checkable.

Counter-prior by design. CP looks like victory points but is a loss condition. "Start of Turn" actions fire on the opponent's turn. Both players draw from one shared deck. Statuses follow their caster. Genre intuition is systematically wrong here.

A measured confusion corpus. 106 documented rule misreadings by a frontier model during the engine's own development — a labeled dataset of where language-level rule comprehension fails, each paired with the correct rule and a regression test.

Benchmark tiers

tier	task	ground truth
T1	Rules comprehension — predict legality and outcomes from card text	engine
T2	Play strength vs. the reference bot ladder	engine
T3	Collapse discovery — find the strategy that voids the design	engine + Collapse Index
T4	Collapse repair — the minimal patch, verified by re-running T3	engine + design history

Pilot results (June 2026)

Genre-prior trap battery. 20 two-option rules questions where the wrong answer is the MTG/Hearthstone-intuitive one; context-free claude-haiku subjects.

condition	correct	followed genre prior	abstained
forced choice, no rules	11/20	9/20	0
abstention allowed	2/20	1/20	17
rule excerpt provided	20/20	0	0

Given the text, the model reads perfectly; forced to guess, it coin-flips toward genre priors. The documented real-world failure mode is an agent acting on priors instead of reading rules it had access to — retrieval discipline, not comprehension.

Pool-size dose-response. Fixed bot policy, 24 games per condition: ability-use concentration rises monotonically with pool size (Gini 0.68 at 60 cards → 0.73 at 90 → 0.87 at 120) — directionally consistent with expected collapse, before any optimization pressure.

Planted-collapse detection. One-line degeneracies (free attacks; free fusion-stacking) detected by game-length compression (−49% and −25%) against same-seed baselines — the metrics harness works.

Repair case studies. Presented with evidence of a historical degeneracy, a context-free model reproduced the designer's actual shipped fix exactly in one of three cases ("X-cost actions executed Without Cost resolve at X = 0").

Materials

Game knowledge base and the confusion corpus: wiki.burgergames.us
Whitepaper draft, raw pilot data, and the benchmark harness are available on request while the suite is in research preview.

Evokers is a tabletop card game by Burger Games. This page describes a research preview; the benchmark suite is under active development. Contact: qqtranqq@gmail.com