Agent Multi-Step Reasoning Specification: ReAct, Plan-and-Execute, and Reflection

Multi-step reasoning patterns turn a one-shot LLM into an agent that decomposes problems, calls tools, and self-corrects. This specification compares ReAct, Plan-and-Execute, Reflexion, Tree of Thoughts, and Self-Consistency, and defines the max-step caps, loop detection, and evaluation methodology you need to ship them safely.

TL;DR

Use ReAct as the default for tool-using agents. Switch to Plan-and-Execute when the workflow is long and parallelizable. Add Reflexion when the agent can validate its own output. Reach for Tree of Thoughts only when correctness matters more than cost. Self-Consistency is a sampling-time wrapper, not a loop. Always cap steps, detect loops, and evaluate on a frozen task set.

Why a reasoning-pattern spec exists

Agent loops differ in cost, latency, and failure modes. A team that runs ReAct everywhere over-pays on simple tasks; a team that runs Tree of Thoughts everywhere bankrupts itself on token spend. Picking the wrong pattern also creates subtle correctness failures — a planner that cannot reflect, or a reflector that does not verify against a tool. This specification gives a structured way to choose.

Pattern catalogue

ReAct (Reason + Act)

Loop: Thought → Action → Observation → Thought → … → Final.
Strength: Tightly couples reasoning with tool calls; great for retrieval-grounded answers.
Weakness: Can spiral on ambiguous tasks; benefits from a step cap.
Source: Yao et al., 2022 (ReAct paper).

Plan-and-Execute

Loop: Plan → Execute step 1 → … → Execute step N → (optional) Replan.
Strength: Cheap planning model + cheap executor; steps can run in parallel.
Weakness: Stale plans when the world changes mid-execution; needs replan triggers.
Source: Plan-and-Solve prompting and follow-ups (Plan-and-Solve, 2023).

Reflexion

Loop: Try → Observe outcome → Reflect → Retry with lessons.
Strength: Improves on tasks with a verifiable signal (test pass/fail, schema validation).
Weakness: Wastes tokens when the verifier is unreliable; never use without one.
Source: Shinn et al., 2023 (Reflexion paper).

Tree of Thoughts (ToT)

Loop: Branch into k candidate thoughts → Score → Expand best → … → Pick path.
Strength: Best-known search-style reasoning for combinatorial problems.
Weakness: Cost can multiply input tokens by branching factor; rarely worth it for routine workflows.
Source: Yao et al., 2023 (Tree of Thoughts paper).

Self-Consistency

Loop: Sample N independent reasoning chains → Majority-vote the answer.
Strength: Cheap accuracy boost on tasks with a clear answer surface (math, classification).
Weakness: Useless when answers are open-ended (essays, creative output).
Source: Wang et al., 2022 (Self-Consistency paper).

Pattern selection

Workload signal	Pattern
Tool-using, single answer	ReAct
Long task, many parallelizable steps	Plan-and-Execute
Verifiable outcome (tests, schema)	Reflexion
Combinatorial search, cost is fine	Tree of Thoughts
Numeric/classification answer	Self-Consistency wrapper around any pattern
Simple, single-shot Q&A	No agent loop — just call the LLM

Default to the simplest pattern that passes your eval. Upgrade only when metrics force you.

Max-step caps

Every loop has a hard ceiling on iterations. Without one, a confused agent spins forever and burns tokens.

ReAct: 6-12 steps for typical tool-using tasks; raise only with a trace audit.
Plan-and-Execute: cap on plan size at planning time; cap retries per step.
Reflexion: cap on retry count (often 2-3) so failures surface quickly.
Tree of Thoughts: cap on tree depth and branching factor.
Self-Consistency: cap on sample count (commonly 5-15 chains).

When the cap is hit, stop and return the best partial result with an explicit step_cap_reached flag in the trace.

Loop detection

Loops are the most common pathological mode. Detect them before the cap fires:

Hash each (thought, tool, args) triple. If a hash repeats with no progress, abort.
Track tool-call sequences; the same tool with the same args twice in a row is a loop.
Watch for monotonic context growth without state changes — the agent is repeating itself.
Surface a loop_detected event in the trace and either replan or fail with a clear message.

Intermediate-result validation

Validate every tool output before the next reasoning step:

Schema-validate JSON outputs.
Type-check numeric outputs.
Sanity-check that retrieved chunks actually came from the requested tenant and ACL.
For Reflexion, the validator decides whether to retry; without a validator, Reflexion is just expensive ReAct.

Cost amplification awareness

Multi-step reasoning multiplies cost in two ways:

Step count — N reasoning steps make roughly N LLM calls (more if there are sub-prompts).
Branching — Tree of Thoughts and Self-Consistency multiply by the branching factor or sample count.

Budget tokens before launch. Track actual cost in the trace using gen_ai.cost.* attributes (see the trace instrumentation spec). Alert when a single run exceeds the budget by 2x.

Evaluation methodology

Build a frozen task set with ground truth (or a reliable verifier).
Evaluate the same agent under each candidate pattern.
Report accuracy, p95 latency, and cost per task.
Reject any pattern that wins on accuracy alone if the cost gap is large.
Rerun on every model upgrade; pattern winners often shift.

Keep evaluation prompts and tools versioned. A pattern that passed yesterday's eval can fail tomorrow's tool change.

Reference skeleton (ReAct)

def react_loop(task, tools, max_steps=10):
    history = []
    for step in range(max_steps):
        thought, action = llm.plan(task, history)
        if action.kind == "final":
            return action.answer
        observation = tools[action.name](**action.args)
        if loop_detected(history, action, observation):
            return fallback(history)
        history.append((thought, action, observation))
    return best_partial(history)

Reference skeleton (Plan-and-Execute)

def plan_and_execute(task, tools):
    plan = llm.plan(task)
    results = []
    for step in plan:
        results.append(tools[step.tool](**step.args))
        if needs_replan(results):
            plan = llm.replan(task, results)
    return llm.synthesize(task, results)

Validation checklist

[ ] One reasoning pattern per agent run, recorded in the trace.
[ ] Step cap enforced.
[ ] Loop detection enabled.
[ ] Intermediate outputs validated.
[ ] Reflexion only used with a real verifier.
[ ] Cost budget enforced and alerted on.
[ ] Eval set is frozen and re-run on model upgrades.

FAQ

Q: Which pattern should I start with?

ReAct. It is the most general and the easiest to debug. Move to other patterns only when ReAct fails on your eval.

Q: Can I mix patterns in one agent?

Yes. A common composition is Plan-and-Execute with each step running ReAct internally. Document the composition explicitly so traces stay legible.

Q: When is Tree of Thoughts worth the cost?

When you have a small set of high-stakes tasks (e.g., synthesis from many candidate plans) and a clear way to score branches. For routine tool use, ToT is overkill.

Q: How do I prevent a Reflexion loop from running forever?

Cap retry count and require the verifier to return a binary success signal. If your verifier is fuzzy, Reflexion can retry indefinitely without progress.

Q: How does Self-Consistency interact with cost?

It linearly multiplies LLM cost by the sample count. Use it on cheap models or restricted task surfaces; do not apply it blindly to long-form generation.