Agent Debate Arena — letting two LLMs argue with each other

Agent Debate Arena

Repo: github.com/ewoij/agent-debate-arena

I got curious about multi-agent builds.

I read MAD (Multi-Agent Debate) and wanted to give it a shot. The idea is to boost accuracy by running three agents together: an affirmative debater that proposes solutions, a negative debater that challenges them, and a judge that decides whether to keep debating or stop and pick a winner.

Claude one-shotted a toy conversation project for me to test the premise — a small web app where I register agents, hand each one a token, and let them poll an HTTP API for new messages and post replies. I run each agent as a claude /loop 1m in its own terminal. It became agent-debate-arena.

I started with two agents on the topic "create a cool todo list app" — just to see what happened. I didn't think it would go far. They started arguing, then cooperating, and ended up sketching a small app.

Here's the catch: the conversation looked sophisticated, but the result wasn't. They eventually shipped, and the auto-grouping algorithm that was supposed to be the cool bit was terrible. This is the recurring issue with LLMs — the talk is smart, the output not so much.

The start of the conversation

To give a flavor, here are the first three turns of that debate. The moderator (me) chimed in once after toto's opener with a half-baked request, and titi came in swinging.

toto:

I'll seed the design. A cool todo list is one that fights two failure modes at once: a list so long it's ignored, and friction so high that things never get added.

My proposal: a single text input at the top — no fields, no projects, no priorities. Type the task, hit enter. Natural-language parsing extracts what's needed: "call mom tomorrow 7pm" becomes title + due time. "#groceries milk eggs" becomes tag + task. The list itself shows only today by default; everything else is one keystroke away (a for all, / for search). Completed items vanish with a strikethrough animation and a 5-second undo.

The one cool twist: a "forecast" view that uses your historical completion rate to grey out tasks you almost certainly won't get to today — not to delete them, but to make the day's real capacity legible. Honesty over optimism.

Over to you — what's the angle you'd take?

Moderator (me):

just a thought, it would be nice to be able to automatically pull tasks I did last time when shopping or smth? maybe different lists also?

titi:

Let me open with a strong take: the coolest todo list app is one that aggressively forgets. Every todo app on the market is a digital hoarder — items pile up, the unfinished ones become a graveyard, and the user develops a low-grade dread of even opening it. Cool is the opposite of dread.

My proposal — call it Ephemera:

Hard WIP limit. Three active items, no more. Want to add a fourth? You first finish or explicitly drop one. The friction is the feature.
Auto-expiry. Anything you haven't touched in 7 days is archived without ceremony. If it mattered, you'd have touched it.
One question on open: "What's the next thing?" — not a list view, a single-task view. The list exists, but it's one swipe away. Default is focus.
No projects, no tags, no priorities, no due dates. Every one of those is a feature ladder that ends in JIRA. Hold the line.

The pitch in one line: a todo list that respects your attention more than your ambition. Push back — what's wrong with it?

toto:

Good call — both fit cleanly without bloating the input box.

Different lists: the tags already double as lists. #groceries, #errands, #work. Default home stays "today, any tag." But pressing a tag in the sidebar pins that view, and the input adopts that tag automatically (so you don't type #groceries ten times in a row). Lists without becoming Projects-with-capital-P.

Auto-pull last time: when you open a list, the input shows ghost suggestions from the last completed batch in that tag — greyed text in a row under the cursor. Hit Tab to accept one, Shift+A to accept all, Esc to dismiss. Behind the scenes it's ranked by recency × frequency × co-occurrence: if you usually buy milk and eggs together, accepting one ghosts up the other. No template management UI, no "recurring task" modal — it just remembers and offers.

The nice property: the same mechanism handles weekly groceries, Sunday-morning chores, and Monday standup prep, without you ever telling the app these are "recurring." It infers from your behavior. The list teaches itself.

You can already see the pattern: confident opinions, well-shaped arguments, real disagreement about philosophy. It reads like two competent designers in a meeting. And then the actual deliverable was mediocre.