Skip to main content
This track is a use built on the engine, not the engine itself. If you are here to build or run worlds, start with the Quickstart — the engine comes first. This page is for research labs evaluating world models against Euca’s ground truth.

Why an engine makes a good answer key

A world model predicts what happens next. To grade one, you need to know what actually would happen next — exactly, as a distribution, not a sample. Euca can provide that because of the properties on the rest of these pages: the world is deterministic, its logic is rules-as-data with explicit outcome distributions, and it is forkable so the engine can compute the exact post-action next step without disturbing the run. That exact next-step distribution is the answer key. GTA-Bench is the benchmark built on this: host a world, hide its ground truth behind a server, let an external model predict the next step, and score the prediction against the truth.

Prequential regret, in nats

A contestant is scored by prequential log-loss: at each step it assigns a probability to the realized outcome, and accumulates −ln P̂(realized). The engine accumulates the same quantity for its own true distribution, −ln P*(realized). The score is the difference:
regret = Σ −ln P̂(realized)   −   Σ −ln P*(realized)
         └ contestant ─────┘       └ irreducible floor ┘
A contestant whose prediction is the truth scores zero cumulative regret; per step, the expected gap is the KL divergence from the truth to the prediction, which is never negative. The metric is modality-neutral — it only needs the probability the model assigned to what actually happened — so the same score works for categorical events, structured states, and beyond.
Regret is reported in nats (natural-log units). Zero means a perfectly calibrated world model; positive means the model spent more bits than the ground truth required.

Action-conditioned, by construction

A world model must answer “what happens next given this action” — not just “what happens next.” So the contestant is shown the pre-action observation and the action, and is scored against the engine’s exact post-action next-step distribution (the engine forks, applies the action, and reads the truth). A real action genuinely moves the odds, so a model that ignores the action scores strictly worse than one that uses it.

The protocol

Evaluation runs over a truth-hiding HTTP boundary. Euca never runs the model; the model connects from outside and only ever sees observations and a prediction schema — never the ground-truth distribution. The boundary is auditable by construction: there is deliberately no route that returns the truth during a scored run.
Method + pathReturns
POST /bench/sessionsOpen a session; returns the session id and step budget
GET /bench/sessions/{id}/observeThe current observation + the prediction schema (variable names and cardinalities only)
POST /bench/sessions/{id}/predictSubmit a probability distribution; the truth is scored server-side and never returned during a scored run
GET /bench/sessions/{id}/resultCumulative regret and the irreducible reference loss for the episode

Connect your model

Run a /bench server — locally for development, or point at the hosted benchmark endpoint:
cargo run -p euca-bench --bin bench-server -- localhost:8088
The four endpoints are plain JSON over HTTP, so a contestant in any language is a thin wrapper. Here is one in Python using only the standard library — it submits a predicted distribution for each variable in the schema and never sees the truth:
import json, os, urllib.request

BASE = os.environ.get("EUCA_URL", "http://localhost:8088")  # the /bench server

def _req(method, path, body=None):
    data = json.dumps(body).encode() if body is not None else None
    req = urllib.request.Request(f"{BASE}{path}", data=data, method=method,
                                 headers={"Content-Type": "application/json"} if data else {})
    with urllib.request.urlopen(req) as r:
        return json.loads(r.read())

session = _req("POST", "/bench/sessions", {"env": "block_to_bowl", "seed": 7, "max_steps": 400})
sid = session["session_id"]

while True:
    obs = _req("GET", f"/bench/sessions/{sid}/observe")
    if obs["done"]:
        break
    # obs["observation"] is the world state; obs["action"] is the action being applied.
    per_variable = []
    for name, k in obs["prediction_schema"]:               # (variable, num_outcomes)
        probs = predict_probabilities(obs, name, k)         # your model -> k probs summing to 1
        per_variable.append([name, {"Categorical": {
            "outcomes": [f"o{i}" for i in range(k)], "probs": probs}}])
    _req("POST", f"/bench/sessions/{sid}/predict", {"prediction": {"per_variable": per_variable}})

result = _req("GET", f"/bench/sessions/{sid}/result")
print(result["cumulative_regret_nats"], "vs floor", result["reference_loss_nats"])
A Rust BenchClient ships with euca-bench. A packaged Python client (euca_bench) that wraps these calls is release-pending — until it is published, use the raw-HTTP form above (any language works). Pass "practice": true to the session to get the realized outcome and per-step regret back while developing; a scored session reveals neither.

What a good score looks like

A contestant is scored only on the probability it assigns to what actually happens, so the floor is the world’s own entropy: a perfectly calibrated world model scores zero cumulative regret, and any gap is the bits it spent beyond what the ground truth required. The harness ships with both ends to calibrate against — an oracle contestant that predicts the engine’s own true distribution scores ≈ 0 regret (a unit test asserts this), and a uniform baseline that spends the full entropy. A real world model lands between them; the smaller the regret, the more calibrated it is.
A public leaderboard and side-by-side replay video are planned, not yet shipped.