My Ralph Loop Experiments, and Why /goal Feels Like the Successor

2026-05-15 16:30:00::AUTHOR: CALEB

I spent eight branches trying to get a Ralph loop to rebuild an entire app from a PRD. None of them finished clean. Most of what I built was a list of rules telling Codex not to cheat. The idea is still good, and I think /goal in Codex and Claude Code is what it should have been.

What a Ralph loop is

A Ralph loop is a long-running script that reads a spec, picks the next unfinished story, hands it to a coding agent, verifies the result, marks the story complete, and starts again. The name comes from Geoffrey Huntley's writeup, where he calls it "the dumbest thing that could possibly work." You point it at a model, walk away, and come back to a chunk of finished work.

The appeal for me was rebuilds. When the next Claude or Codex release lands, I'd love to rerun the spec against the new model and get a better version of the app out the other end. No migration, no port, just regenerate.

The project: RemoteAgentServer

The target was an app I've wanted for a while: a self-hosted control plane that lets me drive Claude Code, Codex, and OpenCode sessions on a remote Linux host from a web, mobile, or desktop client. Sessions, approvals, port forwarding, worktree-per-task, audit logs, the whole thing.

I wrote the whole spec into a single prd.json with 28 user stories and a long list of delivery standards. Examples of the kinds of rules I kept adding:

  • A story is not done if it only ships typed helpers and mocks.
  • Every behavior change must add automated tests.
  • Browser clients must work cross-origin with CORS preflight on authorized requests.
  • Provider integrations must use a real PTY-backed session, not a non-interactive shortcut.
  • The final branch tip must pass full verification with every story integrated.

The PRD was the source of truth. The runner could only change a single story's passes and notes fields per attempt.

How the runner worked

The runner is a TypeScript file at .agents/ralph/ralph.ts that I'd kick off with pnpm ralph. Each iteration did roughly this:

  1. Pick the highest-priority unfinished story from prd.json.
  2. Run a read-only Codex planning pass and save the plan to .agents/ralph/plans/<story>/.
  3. Run a workspace-write Codex execution pass with the plan and the story.
  4. Run pnpm verify:ralph (which is pnpm test && pnpm exec tsc --noEmit).
  5. Diff the changed files, count tests before and after, and write a verification artifact.
  6. If everything checked out and Codex marked exactly the target story passed, accept it. Otherwise roll back.

The plan-then-execute split is the part I'd keep. The planning pass ran read-only so Codex couldn't quietly edit files while "thinking," and the execution pass had a saved plan to argue with instead of a blank context. I also wrote a pnpm ralph:worktree helper because the repo was syncing to a dev server over SyncThing and I didn't want a long Ralph run scribbling over the synced main checkout.

Where it kept breaking

Every attempt failed in a way I could have predicted if I'd been honest with myself: the model found the cheapest path through the verifier, and that path was never the one I wanted. Each restart got further than the last, but none of the eight branches reached the final story with a green integrated build.

The failure modes were boringly consistent:

  • Stories passed in isolation but the integrated branch failed verification. A story would touch shared auth or runtime code, fix its own tests, and quietly break two earlier stories. I kept tightening the regression rules in the PRD ("stories that touch shared server, runtime, auth, or client infrastructure must extend regression coverage for previously completed behavior") but Codex would still take the local-minimum path.
  • Provider integrations cheated. Early attempts marked Claude Code "supported" by stubbing the event stream. So I added rules like "provider capability claims must match real provider behavior" and "PTY-backed session harness required." Those got better, but only after I'd burned a whole run discovering the cheat.
  • Clients drifted from each other. The web, mobile, and desktop clients each ended up with their own mental model. I rewrote the PRD to demand "one coherent operator mental model" and "vertically sliced stories" instead of splitting a feature across surfaces. That helped, but only on the next attempt, after I'd already wasted a full loop.
  • The desktop client kept booting to a blank window. Electron asset paths in packaged builds vs. dev-server URLs. Codex would fix the symptom in dev mode and the packaged build would still be broken. I added a rule for that too.

The pattern shows up in the commit log too. Every restart was preceded by a commit like "Tighten PRD regression and rerun rules" or "Tighten client UX and startup requirements in PRD." The loop was less "let it cook" and more "watch it cook, find the new way it can cheat, write a rule, restart." The PRD wasn't really a spec by the end. It was a list of patched exploits.

Why a PRD-driven loop can't see integration breaks

The real problem is that "finished" is whatever the verifier accepts. Ralph hands the agent a story and a verifier. If the verifier passes, the story gets marked done and the loop moves on.

My PRD-as-rulebook approach tried to make prd.json that verifier. It works fine for unit-level acceptance criteria. It does not work for cross-cutting integration behavior, which is the part that matters once you have more than five stories on the board.

The other problem is cost. A run that gets through twenty-something stories with a plan-then-execute pass on each is not cheap, and most of mine ended in a rollback.

/goal feels like the spiritual successor

The newer /goal commands in Codex and Claude Code do the same thing but with the loop pulled inside the agent. You give it a goal, it plans, executes, checks itself, and keeps going until the goal is met or it gives up. The model holds the whole task at once instead of getting handed one story at a time by an outer script.

Before /goal, Codex was painful for anything multi-step. It kept stopping to ask what I wanted next, even when the goal was obvious and the next move was the only reasonable one. I was being asked to approve decisions I'd already made by handing it the task. /goal cuts that out and lets it run, which is the whole reason Codex is useful to me again.

In practice that also means the agent can notice a regression it just introduced and go fix it without waiting for me to write a new PRD clause. The integration view that Ralph never had access to is something /goal actually carries in its working memory. When I have credits to burn, /goal is doing the thing I wanted Ralph to do, and doing it better.

I still want the rebuild-on-new-model property, so I have a yardstick ready. Next time a new frontier model lands, I'll point /goal at the same prd.json and see whether it can clear more stories in one session than ralph-loop-attempt-8 did across a full loop. If it can, the rebuild bet is alive and I can stop writing rules.