World-model evaluation

This track is a use built on the engine, not the engine itself. If you are here to build or run worlds, start with the Quickstart — the engine comes first. This page is for research labs evaluating world models against Euca’s ground truth.

Why an engine makes a good answer key

A world model predicts what happens next. To grade one, you need to know what actually would happen next — exactly, as a distribution, not a sample. Euca can provide that because of the properties on the rest of these pages: the world is deterministic, its logic is rules-as-data with explicit outcome distributions, and it is forkable so the engine can compute the exact post-action next step without disturbing the run. That exact next-step distribution is the answer key. The benchmark built on this works like this: host a world, hide its ground truth behind a server, let an external model predict the next step, and score the prediction against the truth.

Prequential regret, in nats

A contestant is scored by prequential log-loss: at each step it assigns a probability to the realized outcome, and accumulates −ln P̂(realized). The engine accumulates the same quantity for its own true distribution, −ln P*(realized). The score is the difference:

regret = Σ −ln P̂(realized)   −   Σ −ln P*(realized)
         └ contestant ─────┘       └ irreducible floor ┘

A contestant whose prediction is the truth scores zero cumulative regret; per step, the expected gap is the KL divergence from the truth to the prediction, which is never negative. The metric is modality-neutral — it only needs the probability the model assigned to what actually happened — so the same score works for categorical events, structured states, and beyond.

Regret is reported in nats (natural-log units). Zero means a perfectly calibrated world model; positive means the model spent more bits than the ground truth required.

Action-conditioned, by construction

A world model must answer “what happens next given this action” — not just “what happens next.” So the contestant is shown the pre-action observation and the action, and is scored against the engine’s exact post-action next-step distribution (the engine forks, applies the action, and reads the truth). A real action genuinely moves the odds, so a model that ignores the action scores strictly worse than one that uses it.

How the harness ships today

The scoring machinery is an in-process Rust library (euca-online), not a hosted server — it works directly against a running world:

The engine reads its own exact next-step distribution with peek (which equals step — see Query verbs).
To score an action-conditioned prediction, it forks the world, applies the action, and reads the true post-action distribution.
A RegretAccumulator accumulates prequential regret in nats against the realized outcome, step by step.

The oracle ≈ 0 property is asserted by a unit test — a contestant that predicts the engine’s own distribution scores essentially zero cumulative regret. A euca-online-py PyO3 binding exposes peek / step / regret to Python, so an in-process contestant can be written in Python with no network hop.

Planned: a truth-hiding HTTP boundary

The natural way to grade an external model is over an HTTP boundary that never returns the truth during a scored run: Euca hosts the world, the model connects from outside and only ever sees observations and a prediction schema, and there is deliberately no route that returns the truth during scoring.

This HTTP benchmark server (bench-server) and its client SDKs are a planned delivery surface — not yet on main. The table and code below are the intended shape, not a shipped API. Today, grade against the in-process euca-online library described above.

Method + path	Returns
`POST /bench/sessions`	Open a session; returns the session id and step budget
`GET /bench/sessions/{id}/observe`	The current observation + the prediction schema (variable names and cardinalities only)
`POST /bench/sessions/{id}/predict`	Submit a probability distribution; the truth is scored server-side and never returned during a scored run
`GET /bench/sessions/{id}/result`	Cumulative regret and the irreducible reference loss for the episode

The intended client shape

Once shipped, a /bench server would launch locally for development or point at a hosted endpoint:

# planned — not yet on main
cargo run -p euca-bench --bin bench-server -- 127.0.0.1:8088

The four endpoints would be plain JSON over HTTP, so a contestant in any language is a thin wrapper. The intended Python shape (standard library only):

import json, urllib.request

BASE = "http://localhost:8088"  # your /bench server; point this at the host you're scored against

def _req(method, path, body=None):
    data = json.dumps(body).encode() if body is not None else None
    req = urllib.request.Request(f"{BASE}{path}", data=data, method=method,
                                 headers={"Content-Type": "application/json"} if data else {})
    with urllib.request.urlopen(req) as r:
        return json.loads(r.read())

session = _req("POST", "/bench/sessions", {"env": "block_to_bowl", "seed": 7, "max_steps": 400})
sid = session["session_id"]

while True:
    obs = _req("GET", f"/bench/sessions/{sid}/observe")
    if obs["done"]:
        break
    # obs["observation"] is the world state; obs["action"] is the action being applied.
    per_variable = []
    for name, k in obs["prediction_schema"]:               # (variable, num_outcomes)
        probs = predict_probabilities(obs, name, k)         # your model -> k probs summing to 1
        per_variable.append([name, {"Categorical": {
            "outcomes": [f"o{i}" for i in range(k)], "probs": probs}}])
    _req("POST", f"/bench/sessions/{sid}/predict", {"prediction": {"per_variable": per_variable}})

result = _req("GET", f"/bench/sessions/{sid}/result")
print(result["cumulative_regret_nats"], "vs floor", result["reference_loss_nats"])

A Rust BenchClient and a packaged Python client (euca_bench) are planned alongside the server above. Until they ship, grade in-process against the euca-online library, or use the raw-HTTP form once the server lands. A "practice": true session is intended to return the realized outcome and per-step regret while developing; a scored session would reveal neither.

What a good score looks like

A contestant is scored only on the probability it assigns to what actually happens, so the floor is the world’s own entropy: a perfectly calibrated world model scores zero cumulative regret, and any gap is the bits it spent beyond what the ground truth required. The harness ships with both ends to calibrate against — an oracle contestant that predicts the engine’s own true distribution scores ≈ 0 regret (a unit test asserts this), and a uniform baseline that spends the full entropy. A real world model lands between them; the smaller the regret, the more calibrated it is.

A public leaderboard and side-by-side replay video are planned, not yet shipped.

Get started

Core concepts

Build a world

Simulate & play

Render & present

Experiment & verify

Platform & tools

World-model evaluation

Why an engine makes a good answer key

Prequential regret, in nats

Action-conditioned, by construction

How the harness ships today

Planned: a truth-hiding HTTP boundary

The intended client shape

What a good score looks like

​Why an engine makes a good answer key

​Prequential regret, in nats

​Action-conditioned, by construction

​How the harness ships today

​Planned: a truth-hiding HTTP boundary

​The intended client shape

​What a good score looks like

Why an engine makes a good answer key

Prequential regret, in nats

Action-conditioned, by construction

How the harness ships today

Planned: a truth-hiding HTTP boundary

The intended client shape

What a good score looks like