Why an engine makes a good answer key
A world model predicts what happens next. To grade one, you need to know what actually would happen next — exactly, as a distribution, not a sample. Euca can provide that because of the properties on the rest of these pages: the world is deterministic, its logic is rules-as-data with explicit outcome distributions, and it is forkable so the engine can compute the exact post-action next step without disturbing the run. That exact next-step distribution is the answer key. GTA-Bench is the benchmark built on this: host a world, hide its ground truth behind a server, let an external model predict the next step, and score the prediction against the truth.Prequential regret, in nats
A contestant is scored by prequential log-loss: at each step it assigns a probability to the realized outcome, and accumulates−ln P̂(realized). The engine accumulates the same
quantity for its own true distribution, −ln P*(realized). The score is the difference:
Regret is reported in nats (natural-log units). Zero means a perfectly calibrated
world model; positive means the model spent more bits than the ground truth required.
Action-conditioned, by construction
A world model must answer “what happens next given this action” — not just “what happens next.” So the contestant is shown the pre-action observation and the action, and is scored against the engine’s exact post-action next-step distribution (the engine forks, applies the action, and reads the truth). A real action genuinely moves the odds, so a model that ignores the action scores strictly worse than one that uses it.The protocol
Evaluation runs over a truth-hiding HTTP boundary. Euca never runs the model; the model connects from outside and only ever sees observations and a prediction schema — never the ground-truth distribution. The boundary is auditable by construction: there is deliberately no route that returns the truth during a scored run.| Method + path | Returns |
|---|---|
POST /bench/sessions | Open a session; returns the session id and step budget |
GET /bench/sessions/{id}/observe | The current observation + the prediction schema (variable names and cardinalities only) |
POST /bench/sessions/{id}/predict | Submit a probability distribution; the truth is scored server-side and never returned during a scored run |
GET /bench/sessions/{id}/result | Cumulative regret and the irreducible reference loss for the episode |
Connect your model
Run a/bench server — locally for development, or point at the hosted benchmark endpoint:
A Rust
BenchClient ships with euca-bench. A packaged Python client (euca_bench) that
wraps these calls is release-pending — until it is published, use the raw-HTTP form above
(any language works). Pass "practice": true to the session to get the realized outcome and
per-step regret back while developing; a scored session reveals neither.