Architecture
zenflow is a declarative multi-agent workflow engine for Go. A workflow is a YAML file, the engine is a DAG executor, and an LLM-driven coordinator routes events between running steps through hub-and-spoke mailboxes. This page walks through the layers from top to bottom: the three-layer stack, the executor, the coordinator, the messaging substrate, and the lifecycle that ties them together.
The three-layer stack
zenflow does not implement the LLM tool loop. That is goai's job. zenflow contributes:
- Workflow shape - what to run, in what order, with what dependencies.
- Coordination - which agent gets which message at which moment.
- Delivery guarantees - every message is either delivered to a mailbox or dropped with a typed reason.
If goai supports a provider, zenflow runs on it. Verified providers include Google gemini-3-pro-preview, AWS Bedrock (anthropic.claude-sonnet-4-6, minimax.minimax-m2.5), and Azure (DeepSeek-V3.2, claude-sonnet-4-6, gpt-5, gpt-5.3-codex).
Workflow as a DAG
A workflow is Workflow{ Name, Agents, Includes, Steps, Options }. Each Step has an ID, an agent reference, instructions, an optional dependsOn edge list, optional CEL condition, an optional loop block, and per-step model/tool overrides.
The executor builds a DAG in two passes:
- Validation. Every
dependsOnID exists. Every agent reference resolves. Loops, conditions, and includes parse against the schema. CEL expressions inconditionblocks compile against the available context. - Topological scheduling. Steps with all dependencies satisfied are ready. The scheduler picks ready steps up to
MaxConcurrency(default 5) and runs each in its own goroutine. As a step finishes, its dependents become ready and are scheduled in turn.
The graph is the workflow author's contract with the runtime. The executor never invents a dependency edge, never re-orders steps, and never silently drops a step. A step that cannot run because of a condition evaluating false is marked skipped with the reason recorded; downstream dependsOn is satisfied as if the step had succeeded.
Parallel fan-out
When two or more steps share the same set of completed dependencies, they are started in parallel:
steps:
- id: design
agent: architect
instructions: "..."
- id: api-server
dependsOn: [design]
- id: database
dependsOn: [design]
- id: ui-components
dependsOn: [design]
- id: integrate
dependsOn: [api-server, database, ui-components]api-server, database, and ui-components start at the same moment. integrate waits for all three. There is no fan-out keyword - parallel execution is implied by the graph shape.
Loops
Loops come in three flavours. All three are inner DAGs the executor expands at run time.
forEach- iterate an array drawn from a previous step's structured output. Each iteration is its own sub-DAG. Iterations can run in parallel.- repeat-until - run a sub-DAG until an
untilAgentreturnsdone: trueoruntilCEL evaluates true (max bounded bymaxIterations). outputMode: cumulativevslastcontrols whether downstream steps see every iteration's output or only the final one.
Each iteration's steps get namespaced IDs (loop-stages.0.worker, loop-stages[1].worker, deploy[0].deploy_step) so the coordinator can address them unambiguously. The same applies to includes: a sub-workflow's step IDs are prefixed by the include's parent step ID.
Conditions
A step's condition is a CEL expression evaluated against the context produced by upstream steps. When false, the step is skipped (not failed). The CEL surface is intentionally small: previous step outputs, shared memory entries, and a few built-in helpers.
The coordinator
The coordinator is itself an AgentRunner - the same primitive that drives every workflow step. What makes it special is the toolset and the wake loop.
Default tools
NewDefaultCoordRunner(llm) returns a runner pre-wired with three tools:
forward_to_agent(target_step_id, text, kind?)- route a message into a running step's mailbox.kind="context_update"injects context,kind="cancel"asks a step to stop,kind="info"(or omitted) is a general note.narrate(text)- emit a user-facing narration event. Does not route to any step.finalize(summary?)- signal that coordination is complete. The Run loop exits after this returns; the coordinator will not process more events.
SynthesizeOnly() drops narrate for the --summary-only CLI mode. WithCoordTools(...) appends caller-supplied tools (an SOP lookup, a human approval gate, a custom logger). WithCoordSystemPromptSuffix(extra) appends extra guidance to the tested baseline DefaultCoordSystemPrompt without forking it.
Wake cycles
Step lifecycle events (start, end, error) and inter-step send_message traffic land in the coordinator's mailbox. After each push, the executor pings the coordinator's Wake channel; the coordinator's Run loop wakes, reads everything in its mailbox, calls the LLM with the accumulated events as context, executes any tool calls (forward, narrate, or finalize), then exits the inner LLM loop and waits for the next wake.
The default wake-cycle cap is 100 (vs 10 for step runners). Coord runners are long-lived across the whole workflow and absorb every step lifecycle event plus every bridged send_message, so a higher cap is necessary. Override via WithCoordMaxWakeCycles(n).
When a forward_to_agent call drops (unknown step ID, mailbox full), the tool result tells the coordinator what went wrong and lists currently available step IDs. The system also preserves the dropped content as fallback narration so the user never loses LLM-generated text.
MessageRouter, Mailbox, and the delivery engine
zenflow's messaging substrate is three layers:
- MessageRouter (public alias of
router.Routerininternal/router/router.go, re-exported viarouter_facade.go) - hub-and-spoke addressing. Maps step IDs (and namespaced loop/include IDs) to mailbox slots. HandlesRegisterStep,RegisterInbox,Send,Close. Inner-DAG namespacing (loop-stages.0.worker) is resolved by MessageRouter delegation - the root MessageRouter holds a delegation entry pointing at the active iteration's nested router. - Mailbox (
internal/router/mailbox.go) - per-agent inbox queue. The defaultInMemoryMailboxStoreis bounded byWithMaxMailboxSize(n)(zero = unbounded). Custom backends (sqlite, redis) plug in viaWithMailboxStore(factory). - Delivery engine (internal) (
internal/router/delivery_engine.go) - race-safe coupling ofSendandWake. When a message arrives, the engine appends it to the recipient's mailbox AND signals the recipient's wake channel in a single atomic step. There is no possible interleaving where a message lands but the recipient never wakes, or the recipient wakes but the mailbox is empty.
Drop reasons
Every drop is typed. The complete list lives in internal/router/router.go as DropReason constants; the user-facing summary is:
DropReason | When | Where to look |
|---|---|---|
unknown-step | Target step ID was never registered and has no pending senders. | Coordinator mistakenly addressed a non-existent step. |
mailbox-full | Bounded in-memory mailbox at the WithMaxMailboxSize cap; the newest message is rejected (oldest-wins fairness). | Lower send rate, or raise WithMaxMailboxSize. |
mailbox-closed-by-finalize | Mailbox raced with a concurrent close; the closed flag won. | Sender lost the race against finalize; treat as terminal. |
target-terminal | Send to a step whose mailbox was closed because the target reached a terminal lifecycle state. | Coordinator addressed a step that already returned. |
workflow-cancelled | Workflow context was cancelled or abort fired before the message reached the target's LLM context. | Inspect the cancel cause; the run is shutting down. |
max-wake-cycles | Wake loop hit the maxWakeCycles cap with messages still pending; remainder drained as drops. | Bump WithMaxWakeCycles or WithCoordMaxWakeCycles. |
hold-timeout | Executor's hold-timeout fired before the 3-invariant termination rule could converge; buffered messages are flushed and the step is force-terminated. | Sender is stuck; raise WithHoldTimeout or fix the sender. |
no-transcript | Resume target's mailbox is closed AND the TranscriptStore has no saved transcript for the step. | Step ran before transcripts were persisted, or the transcript was deleted. |
transcript-too-large | Saved transcript exceeds WithMaxTranscriptMessages / WithMaxTranscriptBytes; resume would exceed the size bound. | Inspect the persisted transcript. |
resume-shutdown | Workflow context was cancelled mid-resume; the in-flight resume goroutine exited early. | Resume aborted by shutdown; retry on the next run. |
resolver-error | Configured ModelResolver returned an error when resolving a saved-transcript model identifier. | Resolver infrastructure failure; check provider/model wiring. |
Every drop also fires EventMessageDropped on the progress sink, and (when set) WithDropCallback for metrics-only consumers.
Hub-and-spoke, no peer-to-peer
A step agent has exactly one channel out: send_message(text). This sends to the coordinator. The coordinator decides what to do with the message - usually forward_to_agent to another step, sometimes narrate for the user. Step agents never address each other directly. This is enforced by tool surface: there is no peer-to-peer send tool, and forward_to_agent is only registered on the coordinator runner.
The reason for the hub topology is auditability. Every inter-step message passes through the coordinator's LLM, which means you get a single place to log, reason about, intercept, or replay the conversation. The cost is an extra LLM call per forward; the benefit is that no agent can quietly poison another agent's context.
Lifecycle
A typical run looks like this:
Orchestrator.Close() is idempotent and required for long-lived embedders. It cancels in-flight RunAgentAsync handles and rejects new RunAgent invocations with ErrOrchestratorClosed. Per-Run mailbox cleanup happens during Executor.Run unwind, not at Orchestrator.Close.Orchestrator.Close() is idempotent and required for long-lived embedders. It cancels in-flight RunAgentAsync handles and rejects new RunAgent invocations with ErrOrchestratorClosed. Per-Run mailbox cleanup happens during Executor.Run unwind, not at Orchestrator.Close. Examples in zenflow/examples/ always defer orch.Close().
Event flow, end-to-end
The MessageRouter is the only piece that touches both step mailboxes and the coordinator mailbox. Steps never see each other.
Why this design
A handful of design choices distinguish zenflow from peer multi-agent frameworks. They are deliberate tradeoffs.
Hub-and-spoke topology. Peer-to-peer messaging is the natural shape for free-form group chats; zenflow chose hub-and-spoke. The hub pays one extra LLM call per forward and in return you get a single audit point, a single place to enforce policy, and a single source of conversational truth. The tradeoff is less directness in the message graph.
Mailbox + Wake delivery. Inter-agent delivery uses an atomic Send + Wake pair: the message is appended to the recipient's mailbox AND the wake fires, or neither happens. Drops are typed via DropReason. The pair handles the corner cases that show up in concurrent message routing (racing senders, recipients mid-shutdown, panicking handlers) because the drop reason is part of the contract instead of an inferred state.
Declarative YAML. A YAML workflow is reviewable in a PR, diffable across versions, and runnable from any language that can shell out to a binary. Programmatic control flow gives you flexibility; the tradeoff is that every topology is bespoke and every change is a code change. Pick the shape that matches your team's review and deployment workflow.
Provider abstraction via goai instead of bespoke clients. zenflow does not know about Bedrock cross-region fallback, Azure deployment routing, Gemini multimodal parts, or OpenAI Responses-API streaming. goai does. zenflow's job is workflow orchestration; the LLM details belong one layer down.
Single Orchestrator lifecycle. Long-running embedders hold one *Orchestrator for the lifetime of the process and call Close() in shutdown. The orchestrator owns the handle registry, the factory cache, and any persistent stores. There is no global state.
Capabilities present in code, not yet promoted
A few capabilities exist in the public API but are not part of the documented happy path. They are mentioned here so a code reader does not mistake them for missing pieces.
- Resume from transcript.
Orchestrator.ResumeFlow(ctx, runID, wf)andWithTranscriptStore(factory)exist for cross-run resume. The defaultInMemoryTranscriptStoresupports intra-run resume only; persistent stores plug in via the factory. - Per-step
timeoutandretry. Fields exist in theStepstruct. The execution semantics are stable enough to use but not yet covered by the canonical tutorials. - Multimodal input via
contextFiles. Image and PDF attachments work for multimodal models. Seespec/v1/examples/context-files.yamlfor the wire shape.
These will pick up dedicated documentation as they stabilise. The shipped surface in this release is the engine, the coordinator, the messaging substrate, and the YAML spec.
Where the code lives
zenflow/
doc.go Package zenflow doc
interfaces.go Storage/Tracer/StepIsolation/ApprovalHandler aliases (root facade)
workflow.go Workflow / Step / Run / StepResult / AgentConfig aliases (root facade)
duration.go Duration alias + FormatDuration / ParseDurationStrict re-exports
router_facade.go Re-exports for internal/router/ public API
transcript_facade.go Re-exports for internal/resume/ public API
coord_facade.go Re-exports for internal/coord/ tool factories
agent_facade.go Re-exports for internal/exec/ AgentRunner ecosystem
orchestrator_facade.go Re-exports for internal/exec/ Orchestrator + 49 With* + Executor + Storage backends + parsers + coord factory + JSON coordinator + ~60 utility symbols
e2e_enabled_default.go build !e2e
e2e_enabled_e2e.go build e2e
internal/
types/ Event, EventType, MessageKind, Output, ProgressSink, PermissionHandler/Request
spec/ Workflow / Step / Run / StepResult / AgentConfig / Duration types + Storage / Tracer / StepIsolation / ApprovalHandler / ModelResolver interface contracts
router/ MessageRouter, MailboxStore, deliveryEngine
resume/ TranscriptStore, InMemoryTranscriptStore
coord/ RunnerHandle interface + 4 coord goai.Tool factories
exec/ AgentRunner + Executor + Orchestrator + JSON coordinator + RunFlow/RunGoal/RunAgent + ResumeFlow + 49 With* options + Storage backends + SharedMemory + parsers + validators + scheduler + CEL evaluator + portability lints + isolation default + lifecycle + prompt assembly
observability/otel/ OpenTelemetry tracing exporters
cmd/zenflow/ CLI entrypoint
sink/ Progress sinks (stdout, NDJSON)
examples/ 18 //go:build example mains
spec/v1/
schema.json Workflow JSON Schema
spec.md Authoritative YAML specification
examples/ 19 reference workflowsThere is no adapter/goai/ package - the goai SDK is consumed directly by internal/exec/executor.go and internal/exec/agent_runner.go via github.com/zendev-sh/goai imports. The Orchestrator API is stable. The internal package layout under internal/exec/ and the sibling internal packages is implementation detail and can change without notice.