Architecture

zenflow is a declarative multi-agent workflow engine for Go. A workflow is a YAML file, the engine is a DAG executor, and an LLM-driven coordinator routes events between running steps through hub-and-spoke mailboxes. This page walks through the layers from top to bottom: the three-layer stack, the executor, the coordinator, the messaging substrate, and the lifecycle that ties them together.

The three-layer stack

zenflow does not own the LLM tool loop - that's goai's job. zenflow contributes the workflow shape, coordination, and delivery guarantees on top of any provider goai supports.

zenflow does not implement the LLM tool loop. That is goai's job. zenflow contributes:

Workflow shape - what to run, in what order, with what dependencies.
Coordination - which agent gets which message at which moment.
Delivery guarantees - every message is either delivered to a mailbox or dropped with a typed reason.

If goai supports a provider, zenflow runs on it. Verified providers include Google gemini-3-pro-preview, AWS Bedrock (anthropic.claude-sonnet-4-6, minimax.minimax-m2.5), and Azure (DeepSeek-V3.2, claude-sonnet-4-6, gpt-5, gpt-5.3-codex).

Workflow as a DAG

A workflow is Workflow{ Name, Agents, Includes, Steps, Options }. Each Step has an ID, an agent reference, instructions, an optional dependsOn edge list, optional CEL condition, an optional loop block, and per-step model/tool overrides.

The executor builds a DAG in two passes:

Validation. Every dependsOn ID exists. Every agent reference resolves. Loops, conditions, and includes parse against the schema. CEL expressions in condition blocks compile against the available context.
Topological scheduling. Steps with all dependencies satisfied are ready. The scheduler picks ready steps up to MaxConcurrency (default 5) and runs each in its own goroutine. As a step finishes, its dependents become ready and are scheduled in turn.

The graph is the workflow author's contract with the runtime. The executor never invents a dependency edge, never re-orders steps, and never silently drops a step. A step that cannot run because of a condition evaluating false is marked skipped with the reason recorded; downstream dependsOn is satisfied as if the step had succeeded.

Parallel fan-out

When two or more steps share the same set of completed dependencies, they are started in parallel:

yaml

steps:
  - id: design
    agent: architect
    instructions: "..."

  - id: api-server
    dependsOn: [design]
  - id: database
    dependsOn: [design]
  - id: ui-components
    dependsOn: [design]

  - id: integrate
    dependsOn: [api-server, database, ui-components]

api-server, database, and ui-components start at the same moment. integrate waits for all three. There is no fan-out keyword - parallel execution is implied by the graph shape.

Loops

Loops come in three flavours. All three are inner DAGs the executor expands at run time.

forEach - iterate an array drawn from a previous step's structured output. Each iteration is its own sub-DAG. Iterations can run in parallel.
repeat-until - run a sub-DAG until an untilAgent returns done: true or until CEL evaluates true (max bounded by maxIterations).
outputMode: cumulative vs last controls whether downstream steps see every iteration's output or only the final one.

Each iteration's steps get namespaced IDs (loop-stages.0.worker, loop-stages[1].worker, deploy[0].deploy_step) so the coordinator can address them unambiguously. The same applies to includes: a sub-workflow's step IDs are prefixed by the include's parent step ID.

Conditions

A step's condition is a CEL expression evaluated against the context produced by upstream steps. When false, the step is skipped (not failed). The CEL surface is intentionally small: previous step outputs, shared memory entries, and a few built-in helpers.

The coordinator

The coordinator is itself an AgentRunner - the same primitive that drives every workflow step. What makes it special is the toolset and the wake loop.

Default tools

NewDefaultCoordRunner(llm) returns a runner pre-wired with three tools:

forward_to_agent(target_step_id, text, kind?) - route a message into a running step's mailbox. kind="context_update" injects context, kind="cancel" asks a step to stop, kind="info" (or omitted) is a general note.
narrate(text) - emit a user-facing narration event. Does not route to any step.
finalize(summary?) - signal that coordination is complete. The Run loop exits after this returns; the coordinator will not process more events.

SynthesizeOnly() drops narrate for the --summary-only CLI mode. WithCoordTools(...) appends caller-supplied tools (an SOP lookup, a human approval gate, a custom logger). WithCoordSystemPromptSuffix(extra) appends extra guidance to the tested baseline DefaultCoordSystemPrompt without forking it.

Wake cycles

Step lifecycle events (start, end, error) and inter-step send_message traffic land in the coordinator's mailbox. After each push, the executor pings the coordinator's Wake channel; the coordinator's Run loop wakes, reads everything in its mailbox, calls the LLM with the accumulated events as context, executes any tool calls (forward, narrate, or finalize), then exits the inner LLM loop and waits for the next wake.

The default wake-cycle cap is 100 (vs 10 for step runners). Coord runners are long-lived across the whole workflow and absorb every step lifecycle event plus every bridged send_message, so a higher cap is necessary. Override via WithCoordMaxWakeCycles(n).

When a forward_to_agent call drops (unknown step ID, mailbox full), the tool result tells the coordinator what went wrong and lists currently available step IDs. The system also preserves the dropped content as fallback narration so the user never loses LLM-generated text.

MessageRouter, Mailbox, and the delivery engine

zenflow's messaging substrate is three layers:

MessageRouter (public alias of router.Router in internal/router/router.go, re-exported via router_facade.go) - hub-and-spoke addressing. Maps step IDs (and namespaced loop/include IDs) to mailbox slots. Handles RegisterStep, RegisterInbox, Send, Close. Inner-DAG namespacing (loop-stages.0.worker) is resolved by MessageRouter delegation - the root MessageRouter holds a delegation entry pointing at the active iteration's nested router.
Mailbox (internal/router/mailbox.go) - per-agent inbox queue. The default InMemoryMailboxStore is bounded by WithMaxMailboxSize(n) (zero = unbounded). Custom backends (sqlite, redis) plug in via WithMailboxStore(factory).
Delivery engine (internal) (internal/router/delivery_engine.go) - race-safe coupling of Send and Wake. When a message arrives, the engine appends it to the recipient's mailbox AND signals the recipient's wake channel in a single atomic step. There is no possible interleaving where a message lands but the recipient never wakes, or the recipient wakes but the mailbox is empty.

Drop reasons

Every drop is typed. The complete list lives in internal/router/router.go as DropReason constants; the user-facing summary is:

`DropReason`	When	Where to look
`unknown-step`	Target step ID was never registered and has no pending senders.	Coordinator mistakenly addressed a non-existent step.
`mailbox-full`	Bounded in-memory mailbox at the `WithMaxMailboxSize` cap; the newest message is rejected (oldest-wins fairness).	Lower send rate, or raise `WithMaxMailboxSize`.
`mailbox-closed-by-finalize`	Mailbox raced with a concurrent close; the closed flag won.	Sender lost the race against `finalize`; treat as terminal.
`target-terminal`	Send to a step whose mailbox was closed because the target reached a terminal lifecycle state.	Coordinator addressed a step that already returned.
`workflow-cancelled`	Workflow context was cancelled or `abort` fired before the message reached the target's LLM context.	Inspect the cancel cause; the run is shutting down.
`max-wake-cycles`	Wake loop hit the `maxWakeCycles` cap with messages still pending; remainder drained as drops.	Bump `WithMaxWakeCycles` or `WithCoordMaxWakeCycles`.
`hold-timeout`	Executor's hold-timeout fired before the 3-invariant termination rule could converge; buffered messages are flushed and the step is force-terminated.	Sender is stuck; raise `WithHoldTimeout` or fix the sender.
`no-transcript`	Resume target's mailbox is closed AND the `TranscriptStore` has no saved transcript for the step.	Step ran before transcripts were persisted, or the transcript was deleted.
`transcript-too-large`	Saved transcript exceeds `WithMaxTranscriptMessages` / `WithMaxTranscriptBytes`; resume would exceed the size bound.	Inspect the persisted transcript.
`resume-shutdown`	Workflow context was cancelled mid-resume; the in-flight resume goroutine exited early.	Resume aborted by shutdown; retry on the next run.
`resolver-error`	Configured `ModelResolver` returned an error when resolving a saved-transcript model identifier.	Resolver infrastructure failure; check provider/model wiring.

Every drop also fires EventMessageDropped on the progress sink, and (when set) WithDropCallback for metrics-only consumers.

Hub-and-spoke, no peer-to-peer

A step agent has exactly one channel out: send_message(text). This sends to the coordinator. The coordinator decides what to do with the message - usually forward_to_agent to another step, sometimes narrate for the user. Step agents never address each other directly. This is enforced by tool surface: there is no peer-to-peer send tool, and forward_to_agent is only registered on the coordinator runner.

The reason for the hub topology is auditability. Every inter-step message passes through the coordinator's LLM, which means you get a single place to log, reason about, intercept, or replay the conversation. The cost is an extra LLM call per forward; the benefit is that no agent can quietly poison another agent's context.

Lifecycle

A typical run looks like this:

Orchestrator.Close() is idempotent and required for long-lived embedders. It cancels in-flight RunAgentAsync handles and rejects new RunAgent invocations with ErrOrchestratorClosed. Per-Run mailbox cleanup happens during Executor.Run unwind, not at Orchestrator.Close.

Event flow, end-to-end

The MessageRouter is the only piece that touches both step mailboxes and the coord mailbox. Steps never see each other directly; every inter-step message must transit the coord, by design.

The MessageRouter is the only piece that touches both step mailboxes and the coordinator mailbox. Steps never see each other.

Why this design

A handful of design choices distinguish zenflow from peer multi-agent frameworks. They are deliberate tradeoffs.

Hub-and-spoke topology. Peer-to-peer messaging is the natural shape for free-form group chats; zenflow chose hub-and-spoke. The hub pays one extra LLM call per forward and in return you get a single audit point, a single place to enforce policy, and a single source of conversational truth. The tradeoff is less directness in the message graph.

Mailbox + Wake delivery. Inter-agent delivery uses an atomic Send + Wake pair: the message is appended to the recipient's mailbox AND the wake fires, or neither happens. Drops are typed via DropReason. The pair handles the corner cases that show up in concurrent message routing (racing senders, recipients mid-shutdown, panicking handlers) because the drop reason is part of the contract instead of an inferred state.

Declarative YAML. A YAML workflow is reviewable in a PR, diffable across versions, and runnable from any language that can shell out to a binary. Programmatic control flow gives you flexibility; the tradeoff is that every topology is bespoke and every change is a code change. Pick the shape that matches your team's review and deployment workflow.

Provider abstraction via goai instead of bespoke clients. zenflow does not know about Bedrock cross-region fallback, Azure deployment routing, Gemini multimodal parts, or OpenAI Responses-API streaming. goai does. zenflow's job is workflow orchestration; the LLM details belong one layer down.

Single Orchestrator lifecycle. Long-running embedders hold one *Orchestrator for the lifetime of the process and call Close() in shutdown. The orchestrator owns the handle registry, the factory cache, and any persistent stores. There is no global state.

Capabilities present in code, not yet promoted

A few capabilities exist in the public API but are not part of the documented happy path. They are mentioned here so a code reader does not mistake them for missing pieces.

Resume from transcript. Orchestrator.ResumeFlow(ctx, runID, wf) and WithTranscriptStore(factory) exist for cross-run resume. The default InMemoryTranscriptStore supports intra-run resume only; persistent stores plug in via the factory.
Per-step timeout and retry. Fields exist in the Step struct. The execution semantics are stable enough to use but not yet covered by the canonical tutorials.
Multimodal input via contextFiles. Image and PDF attachments work for multimodal models. See spec/v1/examples/context-files.yaml for the wire shape.

These will pick up dedicated documentation as they stabilise. The shipped surface in this release is the engine, the coordinator, the messaging substrate, and the YAML spec.

Where the code lives

zenflow/
  doc.go                  Package zenflow doc
  interfaces.go           Storage/Tracer/StepIsolation/ApprovalHandler aliases (root facade)
  workflow.go             Workflow / Step / Run / StepResult / AgentConfig aliases (root facade)
  duration.go             Duration alias + FormatDuration / ParseDurationStrict re-exports
  router_facade.go        Re-exports for internal/router/ public API
  transcript_facade.go    Re-exports for internal/resume/ public API
  coord_facade.go         Re-exports for internal/coord/ tool factories
  agent_facade.go         Re-exports for internal/exec/ AgentRunner ecosystem
  orchestrator_facade.go  Re-exports for internal/exec/ Orchestrator + 49 With* + Executor + Storage backends + parsers + coord factory + JSON coordinator + ~60 utility symbols
  e2e_enabled_default.go  build !e2e
  e2e_enabled_e2e.go      build e2e
  internal/
    types/                Event, EventType, MessageKind, Output, ProgressSink, PermissionHandler/Request
    spec/                 Workflow / Step / Run / StepResult / AgentConfig / Duration types + Storage / Tracer / StepIsolation / ApprovalHandler / ModelResolver interface contracts
    router/               MessageRouter, MailboxStore, deliveryEngine
    resume/               TranscriptStore, InMemoryTranscriptStore
    coord/                RunnerHandle interface + 4 coord goai.Tool factories
    exec/                 AgentRunner + Executor + Orchestrator + JSON coordinator + RunFlow/RunGoal/RunAgent + ResumeFlow + 49 With* options + Storage backends + SharedMemory + parsers + validators + scheduler + CEL evaluator + portability lints + isolation default + lifecycle + prompt assembly
  observability/otel/     OpenTelemetry tracing exporters
  cmd/zenflow/            CLI entrypoint
  sink/                   Progress sinks (stdout, NDJSON)
  examples/               18 //go:build example mains
  spec/v1/
    schema.json           Workflow JSON Schema
    spec.md               Authoritative YAML specification
    examples/             19 reference workflows

There is no adapter/goai/ package - the goai SDK is consumed directly by internal/exec/executor.go and internal/exec/agent_runner.go via github.com/zendev-sh/goai imports. The Orchestrator API is stable. The internal package layout under internal/exec/ and the sibling internal packages is implementation detail and can change without notice.

Architecture ​

The three-layer stack ​

Workflow as a DAG ​

Parallel fan-out ​

Loops ​

Conditions ​

The coordinator ​

Default tools ​

Wake cycles ​

MessageRouter, Mailbox, and the delivery engine ​

Drop reasons ​

Hub-and-spoke, no peer-to-peer ​

Lifecycle ​

Event flow, end-to-end ​

Why this design ​

Capabilities present in code, not yet promoted ​

Where the code lives ​