Open falsification benchmark A pre-registered, reproducible science demo — not operational advice, and not a consciousness or AGI claim. "UNI" is a brand; active inference is the science. "Autopoiesis" here means one thing only: viable-set maintenance (keeping the cell inside its operating band) — not life and not sentience. The agent never sees the hidden state; it acts on observations alone. Built to be proven wrong. Framing after Mikkilineni (2022), Information 13(1):24.

Cell LabA self-managing digital "service cell" under hidden disturbances — a UNI active-inference controller vs rule-based and random baselines, designed so it can be proven wrong.

A hidden 216-state world (health × demand × resource × dependency × config) is hit by disturbance families it never announces. Controllers see only four noisy telemetry modalities and must keep the cell viable. The UNI controller infers a belief q(s) and plans by expected free energy; every number below is labeled (VFE / EFE / surprisal / ambiguity / risk / RecoveryScore). This is a benchmark, and a UNI loss is reported as plainly as a win.

216 hidden states · 10 actions · 7 disturbance families observation-only agent pre-registered & seeded
Service cell — live topology
Canvas is unavailable in this browser, so the animation is disabled. The lab still runs; the panels on the right report the agent's belief, chosen action, expected free energy, and viable-set status as text.
viable tick 0 · RecoveryScore —
Inferred belief q(s) — factor marginals

Marginal probability the agent assigns to each level of each hidden factor, from its 216-state belief. The agent never observes these directly — it infers them from telemetry.

Telemetry (what the agent sees) + decision
Chosen action
Expected free energy of the choice (labeled)
EFE (G): —   ambiguity: —   risk: —
Top-5 policies — ranked by expected free energy G (lower is better)
Run
RecoveryScore
% viable
0
tick
Controller & disturbance

Disturbances are hidden from the agent — it must infer trouble from telemetry alone.

Depth-2 = 100 policies (live). Depth-3 = 1000 (heavier; offline bench uses depth-3).

Mulberry32 seed — any run reproduces exactly from its seed.

Baseline compare (seeded, 80 ticks)
ControllerRecoveryScore
Run a comparison to populate.

The downloaded run carries the seed + per-tick log + scores so anyone can reproduce or refute it.

How this could be wrong
  • If a baseline (rule-based or random) matches or beats UNI on RecoveryScore across seeds, the "active inference helps here" claim is falsified for this world.
  • If the UNI controller's belief is uninformative (high entropy throughout), its planning signal (EFE) is noise.
  • If results don't reproduce from the seed, the benchmark is broken.
  • If a Rao-style structural controller wins, that is a real result — shown, not hidden.
Rao challenge mode — Prove UNI Wrong

Upload a knowledge structure (knowledge_nodes + relations) — the operational conditions a self-managing cell recognizes from telemetry, and the remediations between them — then run it head-to-head against UNI on shared seeds. The Rao-native controller acts directly from that graph; it is our own structural controller (Class C — code-indicated), framed after Mikkilineni (2022), Information 13(1):24, and it does not reproduce Rao's verified method. The same structure is also compiled into UNI priors (the UNI-translated controller). Results are reported honestly: a Rao-native win is shown as plainly as a UNI win — and so is a rule-based or random win.

Paste a structure and validate it. Malformed, empty, or wrong-dimensioned uploads are rejected with a specific, human-readable reason.

database_flaky is the committed case where a heuristic already beats UNI — a fair place to try to prove UNI wrong.

Controllermedian RecoveryScore (6 seeds)outcome
Load the example (or upload your own structure), then run the head-to-head. Contenders: Rao-native, UNI-translated, UNI (no overlay), rule-based, random.

Higher RecoveryScore is better. Significance is a bootstrap 95% CI on the median paired difference (excludes 0), never a single seed. The whole run is reproducible from its seeds.

Leaderboard — offline benchmark (committed cache, seeded)

Loading the committed benchmark…

Disturbance familyRanked by RecoveryScore (controller: score)UNI vs neural (median Δ, 95% CI)
Loading…

Ranked from the committed cell-bench-cache.json (no in-browser recompute; seeds shown above). "sig" = the bootstrap 95% CI for the median paired difference excludes 0. A UNI loss (neural or rule-based ahead) is shown as plainly as a win.