The Deep-Decide Skill: Multi-Model AI Debate for Hard Decisions

A portable Claude skill that pressure-tests decisions with four adversarial AI perspectives across Claude, GPT-5 and Gemini — it caught a $56K/yr error.

Four seats · one verdict

Most “AI for decisions” tools give you more information. That is rarely the problem. deep-decide stages a fair fight between your options instead — and it just caught a $56,000-a-year mistake that a single top model, at maximum effort, had already waved through.

TL;DR

deep-decide is a portable Claude skill: four adversarial AI perspectives argue your decision in parallel, spread across Claude, GPT-5 and Gemini.
Every perspective must state the assumption its case rests on, and the final verdict names the objection it overrules — no fake consensus.
It shows you the exact API bill before spending anything. One plain Python script, no hosted service, your own keys.
Real result: it caught a $56,000-per-year error in a document that a single top model, reviewing at maximum effort, had already passed.

The problem

When a decision is genuinely hard, you usually already have the facts. What is missing is a fair fight between the options — and an honest look at the objection you are about to overrule.

Asking a single model to “pressure-test” its own recommendation does not fix this. It brings the same blind spots to the critique that it brought to the draft.

One model interrogating itself is theater, not counsel.

Self-review

One model grading its own work. Same blind spots in, same blind spots out.

Structured debate

Four independent seats forced to disagree, then synthesized into one verdict.

Monologue vs. counsel — why the skill exists

What it is

deep-decide is a small, portable Claude skill that runs the fight for you. You bring a decision, the real options, and your criteria — then trigger it with deep-decide this, pressure-test this decision, or rank these options. It spins up four independent perspectives and makes each one argue the strongest honest case from its seat. Each seat must also state the assumption its whole case rests on. A final synthesis pass then ranks your options and writes down, explicitly, which objection you would be overruling if you go with the winner.

It is deliberately boring under the hood: one bundled Python script (scripts/deep_decide.py), standard library only, no frameworks, no hosted service. Keys come from your environment; if a key isn’t set, that provider simply isn’t used.

The shape of a run

Your decisionoptions + criteria, restated back to you first

Runs in parallel · every seat must disclose its key assumption

Operator

Execution reality. Dependencies, sequencing, the friction nobody budgeted for.

claude-opus-4-8 must dissent

Skeptic

Red team. Hidden assumptions, failure modes, the way this goes wrong.

gpt-5 must dissent

Finance

ROI, cash and time cost, opportunity cost, where the leverage actually is.

gemini-2.5-pro must dissent

Stakeholder

Adoption. Incentives, trust, and the communication load of each option.

claude-opus-4-8 must dissent

Synthesisranks options · flags risk · never averages away disagreement

Verdict+ the dissent you are overriding, in writing

One run of deep-decide, seats rotated across providers

Why one model is weak counsel

Good counsel comes from a diversity of intelligence. Different models are trained differently, aligned differently, and — this is the useful part — fail in different places. When Claude, GPT and Gemini each argue a seat and still land on the same answer, that agreement means something. When they disagree, the disagreement itself is the finding: it shows you exactly where your decision is fragile.

This is not a hunch. Research consistently shows that independent AI reasoners, made to disagree, produce more truthful and more reliable answers than a single model checking its own work — and that combining different model families beats any one family alone. Here is the reading list behind the design:

20221 paper

20237 papers

20243 papers

20251 paper

The reading list behind the design — 12 peer-reviewed papers, 2022–2025 (hover a square)

deep-decide turns that research into two hard rules:

Forced dissent. Every perspective must state the key assumption behind its case and the strongest argument against itself. No seat is allowed to be only a cheerleader.
Preserved disagreement. The synthesis ranks the options and picks a winner, but it never hides a 3–1 split behind fake consensus. The objection being overruled is written into the verdict.

How a run works

You state the decision freeSay “deep-decide this.” The skill restates the decision, options, and criteria back to you. If your options aren’t real alternatives, this is where that becomes obvious.
It shows you the bill first freeA no-cost plan: which providers, which seat runs on which model, exactly how many API calls. Nothing is spent yet.
You approveOnly then does --execute run.
Four seats argue in parallelEach on its assigned model, each filing its key assumption and counter-argument.
SynthesisOptions ranked, risk flagged, dissent preserved — never averaged away.
Everything lands on diskPer-run bundle, so you can audit any seat’s reasoning later.

deep-decide · terminal

# see which providers are configured (reads keys from env)
$ python3 scripts/deep_decide.py --providers

# run it for real, after the plan is approved
$ python3 scripts/deep_decide.py --execute

# the output bundle
deep-decide-runs/<slug>/
├── verdict.md        # the ranked call + overridden dissent
├── combined.md       # all four cases, side by side
├── manifest.json     # models, calls, effort — the audit trail
└── perspectives/     # operator.md · skeptic.md · finance.md · stakeholder.md

Watch a run

Press the button and watch the whole loop play out on a sample decision. The seat responses below are scripted for illustration, but the sequence — plan, approve, argue, synthesize — is exactly what the skill does.

Renegotiate the vendor contract, switch vendors, or do nothing? Simulated demo

plan · 5 API calls · $0 spent so far
operator → claude-opus-4-8 · skeptic → gpt-5 · finance → gemini-2.5-pro · stakeholder → claude-opus-4-8 · synthesis → claude-opus-4-8

Operatorclaude-opus-4-8

Renegotiation is a two-week path with zero migration risk. Switching burns a quarter of the ops team’s year.

assumes: the incumbent will actually come to the table

Skepticgpt-5

The challenger’s reference customers are all smaller than us. That discount is bait until proven at our volume.

assumes: challenger pricing does not hold at scale

Financegemini-2.5-pro

Switching saves more on paper; renegotiation captures most of it with none of the transition cost or write-offs.

assumes: quoted incumbent concessions are real

Stakeholderclaude-opus-4-8

The ops team trusts the incumbent, and a mid-year switch lands on their busiest quarter. Adoption risk is real.

assumes: team capacity is as reported

1. Renegotiate current vendor — 3 seats for, 1 against

Overridden dissent — Stakeholder: renegotiation resets the relationship clock; book the savings as two-year, not recurring.

The full loop: plan → approve → four seats argue → synthesis → verdict with dissent

Configuration

These are the defaults; every one of them can be overridden with an environment variable. A single API key is enough to run. The moment a second key is present, the seats spread across providers automatically — and that is where the cross-model dissent comes from.

Anthropic any one is enough claude-opus-4-8 ANTHROPIC_API_KEY

OpenAI optional gpt-5 OPENAI_API_KEY

Google optional gemini-2.5-pro GEMINI_API_KEY

Reasoning effort — DEEP_DECIDE_EFFORT

low

medium

high · default

max

Your dial between cost and depth.

The run that paid for itself

The reason I’m writing this up instead of leaving it in the repo: I pointed deep-decide at a final deal document. That document had already been reviewed by Claude Opus at maximum effort, with extended thinking on — and the review found nothing wrong.

$0 / YEAR

An error was hiding in that “final” document, worth about $56,000 in extra cost every year. deep-decide’s adversarial pass caught it before the document was signed.

Cost of the error, per year$56,000

Cost of the deep-decide runa few dollars of API calls

Drawn to scale — the run barely registers. That is the point. (Run cost illustrative; varies with effort and providers.)

The skeptic seat, arguing against the document instead of summarizing it, flagged a figure the confident single-pass review had glided over. That is the entire thesis in one anecdote.

The model didn’t need to be smarter. It needed an opponent.

Inside the verdict

Here is what a verdict.md looks like. This is a made-up example — not the real deal document — but the format is exact:

deep-decide-runs/vendor-switch/verdict.md Illustrative output

Rank	Option	Operator	Skeptic	Finance	Stakeholder
1	Renegotiate current vendorLowest execution risk, captures most of the savings	✓ for	✓ for	✓ for	✕ against
2	Switch to challenger vendorBest unit price, highest migration risk	✕ against	✕ against	✓ for	✓ for
3	Do nothing this cycleOnly defensible if switching costs suddenly rise	✓ for	✕ against	✕ against	✕ against

Overridden dissent

Stakeholder: the vendor will likely push prices back up at the next renewal. If you proceed, count these savings for two years — not forever.

The 3–1 split stays visible — the objection ships with the verdict

What it’s not

It never invents a fact, number, or source. If the inputs are thin, the verdict says so instead of pretending.
It doesn’t browse the web mid-run, and it doesn’t act on the decision. It evaluates; you decide.
It doesn’t dress up uncertainty as confidence. A close call reads like a close call.

When to skip it

This is a tool for decisions that are expensive or slow to reverse. If the choice is cheap and easy to undo, just decide. If you haven’t done the underlying research yet, do that first — deep-decide weighs the options you give it; it doesn’t gather the facts for you. And if you only have one real option, you don’t have a decision — you have a task.

A quick pressure-test checklist

Would this decision be costly or slow to reverse?If not, skip the ceremony and just decide. Do you already have the facts, so the hard part is weighing them?If not, do the research first. Do you have at least two options you could genuinely defend?If not, it’s a task, not a decision. Would you pay a few dollars to hear the strongest case against the option you’re leaning toward?If yes, run it.

0 / 4 — tick what applies

Research basis

None of the design is guesswork — every mechanism traces back to published research.

Debate beats monologue — Du et al. 2023 (arXiv:2305.14325); Liang et al. 2023 (arXiv:2305.19118); Khan et al. 2024 (arXiv:2402.06782); Chan et al. 2023, ChatEval (arXiv:2308.07201).
Diversity of models beats one model — Wang et al. 2024, Mixture-of-Agents (arXiv:2406.04692); Wang et al. 2022, Self-Consistency (arXiv:2203.11171); Madaan et al. 2023, Self-Refine (arXiv:2303.17651).
Synthesis with preserved dissent — Zheng et al. 2023, LLM-as-a-Judge (arXiv:2306.05685); Gu et al. 2024, survey (arXiv:2411.15594); Guerdan et al. 2025, on why judges shouldn’t fake certainty (arXiv:2503.05965).
Multi-agent architecture — Wu et al. 2023, AutoGen (arXiv:2308.08155); Liu et al. 2023, AgentBench (arXiv:2308.03688).

FAQ

Do I need API keys for all three providers?

No. Any one key is enough to run — the four seats will share that provider. Cross-model dissent switches on automatically the moment a second key is present, and that is where the uncorrelated judgment comes from, so two or more is recommended for decisions that matter.

What does a run cost?

A handful of API calls — the plan step shows you the exact provider breakdown and call count before anything is spent, and nothing runs until you approve. The DEEP_DECIDE_EFFORT dial (low → max) trades cost against depth; the default is high.

Is my decision data sent anywhere besides the model providers?

No. There is no hosted service and no framework — one Python script, standard library only, reading keys from your environment. Every artifact of the run (verdict, per-seat analyses, manifest) lands on your own disk under deep-decide-runs/.

When should I not use it?

When the decision is cheap and reversible (just decide), when you haven’t done the underlying research (deep-decide arbitrates between options; it doesn’t generate facts), or when there is only one real option — that’s a task, not a decision.

The skill, the script, and the full reading list are on GitHub: bogdanbaciu21/skills → deep-decide. Clone it, set one key, and pressure-test the next decision you were about to make alone.