Multi-Agent Architectures: Orchestration Patterns That Work in Production
Supervisor, router, chain, and consensus patterns for multi-agent systems. Failure modes, recovery strategies, and production code.
TL;DR
- Four orchestration patterns dominate production multi-agent systems: supervisor (central coordinator), router (classifier-based dispatch), chain (sequential handoff), and consensus (parallel execution with aggregation)
- Each pattern trades off between coordination overhead, failure isolation, latency, and complexity — there is no universal best pattern
- Failure recovery is the difference between a demo and a production system: timeouts, circuit breakers, fallback agents, and state checkpointing are non-negotiable
Multi-agent systems decompose a complex task into subtasks handled by specialized agents. The promise: agents that are experts in their domain produce better results than a single generalist agent handling everything. The reality: the orchestration layer that coordinates these agents is where production systems succeed or fail.
This guide covers the four orchestration patterns that work in production, their failure modes, and the recovery mechanisms that make them reliable.
Pattern 1: Supervisor
The supervisor pattern uses a central coordinator agent that receives the user request, decomposes it into subtasks, delegates each subtask to a specialist agent, and synthesizes the results into a final response.
1class SupervisorAgent:← Central coordinator that plans, delegates, and synthesizes2def __init__(self, planner_llm, specialists: dict[str, Agent]):3self.planner = planner_llm4self.specialists = specialists56async def run(self, request: str) -> SupervisorResult:7# Step 1: Plan — decompose into subtasks← The supervisor decides what needs to happen and in what order8plan = await self.planner.generate(9prompt=f'Decompose this request into subtasks: {request}',10schema=TaskPlan11)1213# Step 2: Delegate — send each subtask to the right specialist14results = {}15for task in plan.tasks:16agent = self.specialists[task.agent_type]17try:← Each delegation has a timeout — no unbounded waits18result = await asyncio.wait_for(19agent.execute(task),20timeout=task.timeout_seconds21)22results[task.id] = result23except asyncio.TimeoutError:24results[task.id] = TaskResult(status='timeout', output=None)2526# Step 3: Synthesize — combine results into final response← Synthesis handles partial failures — missing results are noted, not fatal27return await self.synthesize(request, plan, results)
When to use: The supervisor pattern works well when tasks have clear dependencies (task B depends on the output of task A), when the decomposition logic is complex enough to benefit from LLM planning, and when you need a single point of coordination for monitoring and debugging.
Failure modes:
- Supervisor bottleneck: Every request passes through the supervisor. If the planning LLM is slow or the synthesis step is expensive, the supervisor becomes the throughput bottleneck.
- Planning errors: The supervisor might decompose the task incorrectly — assigning a subtask to the wrong specialist, missing a required subtask, or creating unnecessary subtasks. Planning errors cascade into wasted computation and incorrect results.
- Single point of failure: If the supervisor fails, the entire pipeline fails. Unlike the router pattern, there is no fallback path that bypasses the coordinator.
Mitigation: Add a planning validation step where the supervisor checks its own plan against a schema before executing. Cache plans for recurring request types to avoid repeated planning overhead. Implement supervisor-level circuit breakers that switch to a simplified pipeline when the supervisor is degraded.
Pattern 2: Router
The router pattern uses a classifier (LLM-based or trained) to dispatch each request to a single specialist agent based on the request type. There is no central planning or synthesis — each request goes to exactly one agent.
1class RouterAgent:← Classifier-based dispatch — each request goes to one specialist2def __init__(self, classifier, specialists: dict[str, Agent], fallback: Agent):3self.classifier = classifier4self.specialists = specialists5self.fallback = fallback67async def run(self, request: str) -> RouterResult:8# Classify the request← Classification must be fast — this runs on every request9classification = await self.classifier.classify(request)1011# Route to specialist or fallback12if classification.confidence > 0.8:← High confidence: route to specialist13agent = self.specialists.get(classification.category)14if agent:15return await self.execute_with_fallback(agent, request)1617# Low confidence or unknown category: use fallback← Fallback handles anything the router cannot classify18return await self.fallback.execute(request)1920async def execute_with_fallback(self, agent: Agent, request: str) -> RouterResult:21try:22result = await asyncio.wait_for(agent.execute(request), timeout=15)23if result.quality_score < 0.5:24return await self.fallback.execute(request) # Quality gate25return result26except (asyncio.TimeoutError, AgentError):27return await self.fallback.execute(request)
When to use: The router pattern works well when requests fall into distinct categories, each category maps cleanly to a specialist, and you do not need to combine multiple specialists’ outputs. Customer support triage (billing questions, technical support, account management) is the canonical router use case.
Failure modes:
- Misclassification: The router sends a request to the wrong specialist. The specialist produces a confident but incorrect response because it is operating outside its domain. This is harder to detect than a failure — the output looks plausible but is wrong.
- Category gaps: A request that does not fit any category gets routed to the fallback agent. If the fallback is a generalist model, quality drops. If there is no fallback, the request fails.
- Boundary ambiguity: Requests that span multiple categories (a billing question that requires technical context) get routed to one specialist that only has part of the required knowledge.
Mitigation: Monitor misclassification rates by comparing the router’s classification with post-hoc evaluation of the specialist’s response quality. Add quality gates after specialist execution — if the output quality is below threshold, fall back to a generalist or re-route to a different specialist. Track category distribution and boundary cases to identify when new specialist categories are needed.
Pattern 3: Chain
The chain pattern connects agents in a fixed sequence. Each agent processes the request (or the previous agent’s output) and passes its result to the next agent. Think of it as a pipeline where each stage adds a transformation.
1class ChainOrchestrator:← Sequential pipeline — each agent transforms and passes to the next2def __init__(self, agents: list[ChainAgent]):3self.agents = agents45async def run(self, request: str) -> ChainResult:6context = ChainContext(original_request=request, current_input=request)7checkpoints = []89for i, agent in enumerate(self.agents):← Each agent sees the previous agent's output as input10try:11# Checkpoint state before each step12checkpoints.append(context.snapshot())1314result = await asyncio.wait_for(15agent.process(context),16timeout=agent.timeout_seconds17)18context.current_input = result.output19context.chain_history.append(StepResult(agent=agent.name, output=result))2021except asyncio.TimeoutError:← On failure: either skip the step or roll back to last checkpoint22if agent.required:23return ChainResult(status='failed', step=i, context=context)24# Optional step: skip and continue25context.chain_history.append(StepResult(agent=agent.name, skipped=True))2627return ChainResult(status='complete', output=context.current_input, context=context)
When to use: The chain pattern works well when the task has a natural sequential structure: extract entities, then classify them, then generate a summary based on the classified entities. Each agent in the chain has a clear, narrow responsibility. The chain pattern is also the easiest to debug because the execution path is fixed and each intermediate output can be inspected.
Failure modes:
- Error propagation: If agent 2 in a 5-agent chain produces a bad output, agents 3-5 operate on corrupted input. The final output may look plausible but is based on an early-stage error that is hard to trace back.
- Latency accumulation: Total latency is the sum of all agents’ latencies. A 5-agent chain where each agent takes 3 seconds has a 15-second minimum latency.
- Rigidity: The fixed sequence cannot adapt to inputs that would benefit from a different processing order or skipping unnecessary steps.
Mitigation: Add quality checkpoints between agents — verify the output of each stage before passing it to the next. Mark agents as required or optional so the chain can skip non-essential steps on failure. Implement state checkpointing so the chain can resume from the last successful step rather than restarting.
Pattern 4: Consensus
The consensus pattern runs multiple agents in parallel on the same task and aggregates their outputs. This is useful when reliability matters more than speed — multiple independent opinions reduce the probability of a single agent’s error reaching the user.
1class ConsensusOrchestrator:← Run multiple agents in parallel, aggregate for reliability2def __init__(self, agents: list[Agent], aggregator, min_agreement: float = 0.6):3self.agents = agents4self.aggregator = aggregator5self.min_agreement = min_agreement67async def run(self, request: str) -> ConsensusResult:8# Run all agents in parallel← Parallel execution — total latency is max(agent latencies), not sum9tasks = [agent.execute(request) for agent in self.agents]10results = await asyncio.gather(*tasks, return_exceptions=True)1112# Filter out failures13valid_results = [14r for r in results if not isinstance(r, Exception)15]1617if len(valid_results) < 2:18return ConsensusResult(status='insufficient_responses', confidence=0)1920# Aggregate — check agreement← Measure agreement between agents to assess confidence21consensus = self.aggregator.aggregate(valid_results)2223if consensus.agreement_score < self.min_agreement:24return ConsensusResult(25status='low_agreement',26confidence=consensus.agreement_score,27outputs=valid_results # Return all for human review28)2930return ConsensusResult(31status='consensus_reached',32output=consensus.merged_output,33confidence=consensus.agreement_score34)
When to use: High-stakes decisions where a single agent’s error is costly — medical triage, financial analysis, legal review. Also useful when you want to detect uncertainty: low agreement between agents signals that the task is ambiguous or that the agents lack sufficient information.
Failure modes:
- Correlated errors: If all agents use the same underlying model, they tend to make the same mistakes. Running GPT-4 three times gives you three copies of the same bias, not three independent opinions. Use different model families or different prompting strategies to get genuine diversity.
- Aggregation difficulty: For open-ended generation (as opposed to classification), merging multiple responses is non-trivial. The aggregator itself may introduce errors or lose nuance from individual responses.
- Cost multiplication: Running N agents in parallel costs Nx. For tasks where a single agent is usually correct, the consensus pattern wastes compute.
Mitigation: Use diverse agent configurations — different models, different prompting strategies, different context. For classification tasks, use majority voting. For generation tasks, use an LLM aggregator that identifies the consensus elements across responses. Set a cost budget and only apply consensus to high-stakes requests identified by a lightweight classifier.
Choosing a Pattern
Pattern Selection Criteria
- ×Supervisor: tasks have dependencies, need central coordination
- ×Router: requests fall into distinct non-overlapping categories
- ×Chain: task has a natural sequential processing order
- ×Consensus: reliability matters more than latency or cost
What Each Pattern Optimizes For
- ✓Supervisor: flexibility and complex task decomposition
- ✓Router: latency and throughput (single agent per request)
- ✓Chain: debuggability and narrow agent responsibilities
- ✓Consensus: reliability and uncertainty detection
In practice, production systems combine patterns. A router dispatches requests to different chains. A supervisor uses consensus for high-stakes subtasks. A chain includes a router step that selects the next agent based on intermediate results.
Failure Recovery: The Production Requirement
Every orchestration pattern needs four failure recovery mechanisms to be production-ready.
1. Timeouts
Every agent call must have a timeout. Without timeouts, a single slow agent can block the entire pipeline indefinitely. Set timeouts based on the agent’s observed latency distribution — typically p99 latency plus a buffer.
2. Circuit Breakers
If an agent fails repeatedly (3-5 consecutive failures), stop sending it requests for a cooldown period. This prevents cascading failures where a degraded agent consumes resources while producing bad outputs. After the cooldown, send a single probe request to check if the agent has recovered.
3. Fallback Agents
Every specialist agent needs a fallback path. This might be a generalist agent that handles the task at lower quality, a cached response from a similar previous request, or a graceful degradation message that tells the user what the system cannot do right now.
4. State Checkpointing
For long-running multi-agent pipelines, checkpoint intermediate state so the pipeline can resume from the last successful step rather than restarting from scratch. This matters for chains and supervisor patterns where early steps may complete successfully before a later step fails.
Observability
Multi-agent systems require observability beyond what single-agent systems need. Track:
- Per-agent latency: Identify bottleneck agents and set informed timeouts.
- Per-agent error rate: Detect degraded agents before they affect users.
- Orchestration overhead: Time spent in routing, planning, and synthesis versus time in agent execution.
- Token consumption per agent: Identify agents that consume disproportionate tokens relative to their contribution.
- End-to-end traces: Full request traces that show the path through the agent system, including retries and fallbacks.
Where Clarity Fits
Clarity’s self-model API adds user context to multi-agent orchestration. A supervisor that understands the user can make better delegation decisions — routing technical users to detailed analysis agents and non-technical users to summary agents. A router that knows the user’s history can disambiguate requests that would otherwise be ambiguous. The self-model is the shared context that makes agent coordination user-aware.
Key Takeaways
- Four patterns cover most production multi-agent architectures: supervisor, router, chain, and consensus — each trades off between coordination, latency, reliability, and cost
- The orchestration layer is where production systems succeed or fail — individual agent quality matters less than how agents are coordinated
- Failure recovery (timeouts, circuit breakers, fallbacks, checkpointing) is the difference between a demo and a production system
- Observe per-agent metrics, not just end-to-end metrics — bottleneck agents and cascading failures are invisible in aggregate dashboards
- Combine patterns: production systems use routers that dispatch to chains, supervisors that use consensus for high-stakes subtasks, and chains with routing steps
Building AI that needs to understand its users?
Key insights
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →