Orchestration Patterns · LangGraph vs CrewAI · MCP + A2A · Production Observability
Your team shipped a single-agent demo that works in Cursor—then production asks for parallel research, tool isolation, and human approval gates under a shared token budget. One monolithic agent hits context limits, jack-of-all-trades drift, zero concurrency, and a single point of failure. This guide is for AI engineers and tech leads moving to Multi-Agent Systems (MAS): six orchestration patterns, a LangGraph vs CrewAI vs AutoGen decision matrix, the MCP + A2A protocol stack, a six-step production runbook (PostgresSaver, HITL interrupts, circuit breakers), MAST observability data from 1,642 traces, pitfalls to avoid, and a 2026 trend map.
A lone LLM agent can demo well: one system prompt, one tool list, one conversation thread. Under real load it becomes the bottleneck. Google's internal Agent Bake-Off benchmark showed multi-agent teams completing complex workflows in 10 minutes versus 60 minutes for a single agent—a 6x speedup. Separately, the AdaptOrch study found that orchestration topology explained 12–23% more variance in task success than swapping the underlying model—architecture beats model shopping.
Before picking frameworks, map the structural limits that force a MAS split.
Context window saturation: Research, code, logs, and tool outputs accumulate in one thread. Retrieval quality drops; the agent forgets constraints set ten turns ago.
Jack-of-all-trades prompting: One persona cannot simultaneously excel at SQL tuning, legal review, and UI copy. Instruction interference raises hallucination rates.
No true concurrency: Sequential tool calls block each other. Independent subtasks (scrape three sites, run three test suites) waste wall-clock time.
Single point of failure: One bad tool result or one runaway loop kills the entire session. No isolation domain for retries or rollbacks.
Opaque cost attribution: Finance cannot answer which step burned tokens. Without per-agent budgets, one verbose researcher agent drains the monthly cap.
Topology beats model. AdaptOrch showed orchestration structure drives 12–23% more outcome variance than model choice—design the graph before upgrading GPT tiers.
A Multi-Agent System (MAS) is a coordinated set of LLM-powered agents that share state, delegate subtasks, and expose specialized capabilities. Each agent is not just a prompt variant—it is a bounded runtime with its own tools, memory scope, and termination policy.
| Trait | Meaning in LLM agents | Production signal |
|---|---|---|
| Autonomy | Chooses next action without per-step human input | Requires guardrails: max iterations, budget caps |
| Reactivity | Responds to tool results and peer messages | Needs structured message schema, not free text only |
| Proactivity | Initiates subtasks when goals are incomplete | Can cause runaway loops without supervisor checks |
| Social ability | Delegates to and negotiates with other agents | Depends on A2A discovery and clear handoff contracts |
| Topology | Control flow | Best for | Risk |
|---|---|---|---|
| Centralized | One orchestrator routes all messages | Predictable audit trails, strict policy enforcement | Orchestrator context bloat; SPOF at router |
| Decentralized | Peers message directly; no single boss | Resilient swarms, emergent collaboration | Hard to debug; termination not guaranteed |
| Hierarchical | Supervisor delegates to workers; workers report up | Enterprise workflows with approval tiers | Supervisor prompt complexity; latency stacking |
Most 2026 production stacks default to hierarchical with a thin centralized router for auth and budget enforcement—a hybrid of the first and third rows.
Patterns are composable. A customer-support stack might use a supervisor that fans out to parallel researchers, then pipelines synthesis into a writer. Pick the minimum pattern set that matches dependency structure.
Stages run in fixed order: ingest → analyze → draft → review. State passes through a shared graph node. Ideal when each step depends on the prior output (ETL, report generation). LangGraph models this as a linear StateGraph with typed state reducers.
The orchestrator spawns N independent branches, then aggregates results. LangGraph's Send API dispatches dynamic worker nodes from a map step; a reducer node merges outputs. Use for multi-source research, ensemble voting, or shard-level code review.
from langgraph.types import Send
def fan_out(state):
return [Send("research_worker", {"query": q}) for q in state["queries"]]
def fan_in(state):
return {"report": synthesize(state["worker_results"])}
A supervisor classifies intent and routes to specialists (coder, DBA, reviewer). Add a keyword fast path: regex or embedding match on high-confidence intents skips the LLM routing call, saving latency and tokens on FAQ-style queries.
Agents hand off conversation control via handoff tools. Microsoft AutoGen excels here: good for open-ended brainstorming where the next speaker is emergent. Harder to audit than fixed graphs.
Agents read/write a shared artifact store (blackboard) rather than direct messaging. A planner posts goals; specialists append sections. Fits collaborative document editing and shared knowledge bases with conflict resolution at the store layer.
Real systems combine patterns: hierarchical supervisor → parallel fan-out for research → sequential pipeline for final packaging. Explicitly draw which segments are sync vs async before writing code.
| Pattern | Concurrency | Debuggability | Typical framework |
|---|---|---|---|
| Sequential Pipeline | Low | High | LangGraph, CrewAI sequential |
| Fan-out / Fan-in | High | Medium | LangGraph Send |
| Supervisor-Worker | Medium | High | LangGraph, CrewAI hierarchical |
| Swarm | Medium | Low | AutoGen, Swarm SDK |
| Blackboard | Medium | Medium | Custom + shared store |
| Hybrid | Variable | Medium | LangGraph (most common) |
All three ship production users in 2026, but they optimize for different control styles. Match framework to topology, not brand affinity.
| Dimension | LangGraph | CrewAI | AutoGen |
|---|---|---|---|
| Mental model | Stateful directed graph | Role-based crew with tasks | Conversable agents + handoffs |
| State persistence | First-class checkpoints (PostgresSaver) | Memory backends, less graph-native | Chat history per agent |
| Human-in-the-loop | Native interrupt() nodes | Task-level human input hooks | UserProxyAgent pattern |
| Parallelism | Send API, subgraphs | Async task execution | Group chat parallelism |
| Best fit | Complex branching, prod checkpoints | Rapid crew prototypes, role clarity | Exploratory multi-agent chat |
| Watch out | Steeper graph DSL learning curve | Less fine-grained control at scale | Non-deterministic handoff chains |
Need durable checkpoints + HITL approval gates? → LangGraph.
Need a demo crew in an afternoon with readable role YAML? → CrewAI.
Need open-ended agent-to-agent negotiation? → AutoGen (or Swarm).
Need both graph control and chat handoffs? → LangGraph orchestrator wrapping AutoGen workers.
Tool integration and agent collaboration are different problems. 2026 stacks treat them as a two-layer protocol cake: vertical tool access below, horizontal agent delegation above.
| Layer | Protocol | Connects | Analogy |
|---|---|---|---|
| Vertical | MCP (Model Context Protocol) | Agent ↔ tools, data, prompts | USB-C for tool discovery |
| Horizontal | A2A (Agent-to-Agent) | Agent ↔ agent delegation | HTTP for service mesh |
Each agent publishes an Agent Card—a JSON document describing capabilities, input schemas, and endpoint URLs. Peers call discover_and_delegate to route subtasks without hard-coded agent lists.
{
"name": "sql-analyst-agent",
"description": "Read-only Postgres analysis and explain plans",
"url": "https://agents.internal/a2a/sql-analyst",
"capabilities": ["query", "explain", "schema-introspect"],
"input_schema": {
"type": "object",
"properties": { "question": { "type": "string" } }
}
}
async def discover_and_delegate(task: str, registry: AgentRegistry):
card = await registry.find_best_match(task)
if not card:
raise NoAgentError(task)
payload = {"task": task, "caller": "supervisor-01"}
return await a2a_client.send(card.url, payload)
MCP handles tools/list inside each agent; A2A handles which agent owns the task. See our MCP protocol guide for the vertical layer in depth.
Demos use in-memory state. Production needs crash recovery, human approval on high-risk actions, and cost ceilings. Four primitives cover most teams before custom infra.
MAX_ITERATIONS = 25
class ProductionGuardrails:
def __init__(self, budget: TokenBudgetManager, breaker: CircuitBreaker):
self.budget = budget
self.breaker = breaker
self.iterations = 0
def before_step(self, agent_id: str, est_tokens: int):
self.iterations += 1
if self.iterations > MAX_ITERATIONS:
raise RunawayLoopError()
self.budget.charge(agent_id, est_tokens)
self.breaker.check()
Draw the graph on paper first: Mark sync edges, parallel branches, and HITL interrupt points before writing LangGraph nodes.
Wire PostgresSaver: Point checkpoints at a managed Postgres instance; verify resume after process kill.
Register MCP tools per agent: Scope each agent to least-privilege tool subsets; never share one mega tool list.
Add interrupt nodes: Gate deploy, delete, payment, and PII-export tools behind human approval.
Enable TokenBudgetManager + CircuitBreaker: Set per-agent daily caps; alert at 80% burn rate.
Ship observability before features: OpenTelemetry spans per agent step; dashboard CORE_METRICS before adding agent #7.
Tip: Run a chaos drill: kill the worker mid-graph, restart, and confirm PostgresSaver resumes from the last checkpoint without duplicate side effects.
You cannot fix what you cannot attribute. The MAST study analyzed 1,642 multi-agent execution traces and found failure modes cluster predictably—most are design issues, not model IQ gaps.
Teams invest heavily in models but under-invest in telemetry: MAST respondents spent 57% of engineering time on production hardening versus only 8% on observability—an imbalance that repeats the same failures in production.
Wrap every agent invocation in OpenTelemetry spans: agent_id, parent_span, tool_name, token_in/out, latency_ms. Export to your existing APM. Define CORE_METRICS as the minimum dashboard:
| Metric | Why it matters |
|---|---|
| task_success_rate | End-to-end goal completion, not per-step accuracy |
| tokens_per_success | Cost efficiency; spikes reveal runaway loops |
| p95_agent_latency | Pinpoints slow specialist or tool |
| handoff_error_rate | A2A schema mismatches and dropped messages |
| hitl_queue_depth | Approval bottlenecks blocking graph progress |
Add LLM-as-Judge on a sample of traces: a separate evaluator agent scores goal alignment and factual consistency. Use it offline for regression tests, not inline on every request (cost).
Context pollution: Workers return full raw HTML dumps upstream. Truncate, summarize, or store in blackboard; pass handles not payloads.
Runaway loops: Agents re-delegate indefinitely. Enforce MAX_ITERATIONS, per-edge visit counts, and supervisor stop tokens.
Over-engineering: Fifteen agents for a three-step workflow. Stay in the 3–8 agent sweet spot unless domains are truly isolated.
Demo-to-prod gap: In-memory state and no budgets. Wrap graphs with ProductionGuardrails before exposing to customers.
Parallel branch sync: Fan-in runs before all branches finish. Use defer=True on LangGraph edges so the reducer waits for all Send workers.
graph.add_edge("fan_out", "fan_in", defer=True)
Warning: The most expensive mistake is adding agents to fix prompt issues. Tune specialist prompts and handoff schemas before spawning another node.
Are subtasks independent? Yes → Parallel fan-out. No → continue.
Is order strict? Yes → Sequential pipeline. No → continue.
Need emergent dialogue? Yes → Swarm / AutoGen. No → Supervisor-worker.
Need crash-safe resume? Yes → LangGraph + PostgresSaver. No → CrewAI rapid path.
Cross-team agent discovery? Yes → Publish Agent Cards + A2A. Tools only → MCP per agent.
Laptop-hosted agents sleep when the lid closes, lack reliable process supervision for long LangGraph checkpoints, and struggle with macOS-native toolchains (Xcode, Keychain, Apple-notarized CI). Pure Linux VPS handles stateless API workers but not iOS build farms. For teams running multi-agent graphs 24/7 alongside iOS CI/CD pipelines and MCP tool servers, VpsMesh Mac Mini M4 cloud rental bundles uptime, remote KVM, and predictable monthly OpEx into one host. Compare plans on the Mac Mini M4 rental pricing page, browse runbooks in the help center, or order online to validate a one-month pilot before committing your orchestration stack.
Most production systems land between 3 and 8 specialized agents. Fewer than three rarely justifies orchestration overhead; more than eight usually signals over-engineering unless you have clear domain boundaries and per-agent observability. Start with a supervisor plus two workers, measure tokens_per_success, then split only when one agent's context consistently overflows.
MCP is the vertical layer: each agent connects to tools and data via tools/list and JSON Schema descriptors. A2A is the horizontal layer: agents discover peers through Agent Cards and delegate subtasks. Use MCP inside every agent; use A2A between agents. See our MCP guide for the tool layer and this article's Section 05 for delegation patterns.
Not always. Stateless LangGraph workers and remote MCP over HTTP+SSE can run on Linux cloud VMs. When agents depend on macOS toolchains, Xcode builds, Keychain secrets, or you need uninterrupted checkpoint sessions, a rented Mac Mini M4 is lower friction than fighting laptop sleep cycles. Start with a one-month trial to measure checkpoint latency and token burn. Pricing: Mac Mini M4 rental pricing. Setup help: help center. Order: cloud order page.