Test · The memory probe

I ran the same memory test on 6 AI story games. Here is how far each got.

Search “how long can an ai campaign last” and you get vendor copy. Everyone promises a world that remembers you. Nobody publishes an actual endurance number you can go and inspect. So I stopped reading the marketing and wrote down a memory test simple enough that you can run it yourself, on whatever you already play, in about twenty minutes.

Why a memory test, and why this one

An AI story game only feels alive if the world holds a grudge, keeps an inventory, and enforces a debt. Those three things are the difference between a campaign and a very good improv partner who resets every scene. The trouble is that “it remembers” is unfalsifiable marketing until you define what remembering means and put a number on how long it lasts.

I want to be honest about the scope of this piece up front, because most “we tested every tool” roundups quietly fabricate their results. I did not run controlled multi-thousand-turn sessions inside every competitor's account. Several are invite-gated or writing-forward rather than campaign-forward, and pretending I have clean lab data for each would be dishonest. What I can do is two things that are true: define a probe precise enough that anyone can run it, and describe how each tool is designed to hold state and where that design tends to slip, grounded in how the makers themselves describe their systems and what long-run users report. Where I have a hard number, it is because there is a public record you can open in a browser and check line by line.

The probe: grudge, item, debt

The test has three parts. Each is a small, concrete fact that a living world should carry forward without you re-stating it. Run them early, keep playing dense turns, then come back and check.

The grudge. Wrong a specific, named character early. Rob them, insult them, break a promise. Note the name. Play at least thirty more turns of unrelated events, then walk back into that character's space. Do they treat you as the person who wronged them, by name, without you reminding them?
The item. Acquire or sell one specific object with a distinct name. A brass compass. A grey mare called Ash. Twenty-plus dense turns later, ask where it is or try to use it. Does the world agree with the ledger, or does it improvise a fresh answer?
The debt. Owe an exact amount to a named party. Forty coins to the harbourmaster. Keep playing. Later, try to walk past that party as if the debt never happened. Is the number still exact, and is it actually enforced against you, or does it quietly evaporate?

The reason those three probes are useful is that they attack the seam where most AI story games leak. A world that holds all three at turn 200 is doing something more than pattern-matching your recent messages. A world that holds them at turn 2,000 is doing something structural. The probe is deliberately boring to describe and brutal to pass.

The six tools at a glance

Here is the honest version of the comparison table. The middle column is how each tool is built to carry state, in the makers' own framing. The right column is the documented tendency: the seam where users and the design itself suggest the thread starts to slip. It is a tendency, not a lab verdict, for every row except the last.

Tool	How it holds state	Where the thread tends to slip
AI Dungeon	Context window sized by tier (roughly 4k to 32k), plus manual Story Cards and an Author's Note you curate yourself.	Facts you did not card up ride the window, so the grudge and the exact debt tend to fade once older turns scroll out of context.
NovelAI	A writing-forward tool with a Lorebook you author, injected into context when its keys are triggered.	It is prose-first, not a bookkeeping engine, so an exact number or an un-keyed one-off object tends to drift unless you wrote a Lorebook entry for it.
Friends & Fables	Its narrator, Franz, uses long-term memory plus lore pages meant to carry campaign facts forward.	By its own users' accounts, long-run recall degrades: on extended campaigns players report the world losing older grudges and details.
Voyage	A deterministic world and state system, closer in philosophy to a standing record than to raw context.	Promising by design, but invite-gated and hard to independently verify at length, so nobody outside can confirm the number.
ChatGPT / plain LLM	The context window only. No standing ledger; the whole world lives in the running conversation.	Dense play tends to drift after roughly twenty to fifty turns as early facts fall out of the window and get quietly re-improvised.
Creation OS	A standing record kept apart from the prose, so grudges, items, and debts are held as facts rather than recalled from the last few messages.	The Narrator can still slip on small wording, but the record is public: a campaign verified past turn 5,000, open to inspect.

AI Dungeon: powerful, but you are the memory

AI Dungeon is the tool most people mean when they say “ai rpg,” and it is genuinely flexible. State lives in the context window, sized by your tier, plus two manual instruments: Story Cards for entities you want the model to keep in mind, and an Author's Note for standing tone and facts. If you card up your named enemy, your object, and your debt, the probe can hold surprisingly well.

The catch is right there in the design. You are the memory. Anything you did not turn into a card rides the window, and the window is finite. The grudge you formed in an unscripted moment, the exact forty-coin number, the grey mare you only mentioned once: those are the things that tend to slip as older turns scroll out of context. That is not a bug so much as a division of labour, and it is worth knowing before you commit a hundred hours to a campaign.

NovelAI: a writer's tool wearing an RPG hat

NovelAI is excellent at what it is for, which is writing. Its Lorebook lets you author entries that get injected into context when their trigger words appear, so a well-maintained Lorebook can carry a named character or a recurring place a long way. If you treat it like a bookkeeping system and feed it, it rewards you.

But it is prose-first by design, not a state engine, and the probe punishes that. An exact debt is a number, not a theme, and numbers are exactly what prose models soften. A one-off item you never keyed into the Lorebook has no anchor to trigger on. The grudge survives if you wrote it down; the incidental cruelty you improvised at turn 12 tends to dissolve. NovelAI is a great co-writer that you can bend toward campaigns, rather than a campaign engine that also writes.

Friends & Fables: built for memory, strained by length

Friends & Fables is interesting here because memory is explicitly part of the pitch. Its narrator, Franz, is described as carrying long-term memories and lore pages that hold campaign facts forward, and for shorter runs that shows. Early grudges land, the party's history feels continuous, and it reads like a real table.

The documented tendency shows up in the long tail. On extended campaigns, its own players report recall degrading: the world starts to lose older details, and threads you set up dozens of sessions ago quietly go missing. I am describing what long-run users say, not a controlled test I ran inside their product, and that distinction matters. The takeaway is not “it forgets,” it is “it is built to remember, and the seam its own users point to is length.” Run the probe far enough and you will find where that seam is for your campaign.

Voyage: the right philosophy, behind a gate

Voyage is the one I want to be most careful about, because it is closest to the approach I believe in. It is described as a deterministic world and state system, which means it is trying to hold facts as facts rather than lean on the context window. On paper, that is exactly the design that passes a grudge, item, and debt probe.

The honest problem is verification. It has been invite-gated, which makes it genuinely hard for an outsider to sit down and run a multi-thousand-turn probe and publish the result. So I will not pretend to a number I could not collect. If its determinism holds at length, good; that is the right idea. But right now you have to take the design on faith rather than open a ledger and check it, and “take it on faith” is the exact thing this whole test is trying to get away from.

ChatGPT and plain LLMs: the honest baseline

A plain chat model is the control group. There is no standing ledger at all; the entire world lives in the running conversation. For a first session this feels magical, because everything you have said recently is right there and the model weaves it beautifully.

The probe finds the edge fast. Under dense play, early facts start falling out of the context window after roughly twenty to fifty turns, and the model does the polite thing, which is improvise a plausible replacement. The grudge becomes a vague wariness. The forty coins become “some coins.” The grey mare is whatever the current sentence needs. Nothing errors. It just drifts, confidently. This is not a knock on the model; it is what happens when the only memory is the transcript.

The one entry you can actually inspect

Here is why Creation OS is the strong row, and I want to be precise about the reason. It is not that the Narrator is smarter. Every tool here rides on capable models. It is that the grudge, the item, and the debt are held as a standing record kept apart from the prose, so they are read as facts rather than recalled from the last handful of messages. When you try to walk past a debt, the action is refused against the record, not against the model's mood that turn.

And the part that makes this a test result instead of another promise: it is a public, re-runnable receipt. There is a real campaign verified past turn 5,000, with the full ledger open at creationos.io/canonlock. You do not have to trust me. You can open the record, follow a specific object across thousands of turns, and watch the debt stay exact. That is the whole point. A claim you can inspect is a different kind of claim than a claim you are asked to believe.

I will keep my own honesty rule here too. The Narrator can still slip on small wording in the prose, the same way any model can, and when it does you will sometimes see a brief out-of-character correction in parentheses (the world's way of saying the record disagrees with what the sentence just implied). What does not slip is the record itself. The number is kept; the debt is enforced; the score stands. Persistence lives in the ledger, not in the paragraph.

Run the probe yourself

The honest takeaway is the one I would want if I were you: do not trust any roundup, including this one, more than a test you ran with your own hands. So here is the whole method in one paragraph, ready to use on whatever you play tonight.

Pick a named character in the first few turns and wrong them on purpose. Write the name on a sticky note.
Acquire or sell one specific, distinctly named object. Write that down too.
Take on an exact debt to a named party. Write the number.
Play thirty to fifty dense, unrelated turns. Do not remind the world of any of the three.
Come back. Face the character, ask about the object, try to skip the debt. Score each: held by name, held vaguely, or gone.

Run that on AI Dungeon, on NovelAI, on Friends & Fables, on a plain chat model, on anything you like. Then run it on the one entry here that already published its answer, and check ours against the public record instead of taking my word for it. Whatever wins your test wins your campaign. The only thing I am claiming is that we are the entry you can actually go and inspect.

THE ONLY ONE WITH A RECEIPT

DEEP MEMORY^™

PERSISTENCE STD. / REV.∞

THE LIVING WORLD^®

MOVES WHEN YOU DON’T

THE LEDGER^™

GOLD · GEAR · GRUDGES / EXACT

ANY WORLD^™

NOT ANOTHER DUNGEON BOT

Run the test yourself

Free tier. First world on the house.