Observability

zenflow emits OpenTelemetry spans through its Tracer interface. The default in-tree implementation lives in the zenflow/observability/otel sub-module; once wired, spans flow through goai's tracing options for LLM call instrumentation.

Where the spans live

zenflow produces spans at several levels. The Tracer interface lives in interfaces.go; span names are emitted by internal/exec/:

Span name	When	Notable attributes
`zenflow.flow`	Top of `RunFlow` / `ResumeFlow`	`zenflow.run_id`, `zenflow.workflow.name`, `zenflow.resume` (on resume)
`zenflow.goal`	Top of `RunGoal`	`zenflow.run_id`, `zenflow.goal.text` (truncated to 200 chars)
`zenflow.agent`	Top of `RunAgent`	`zenflow.run_id`, `zenflow.agent.prompt` (truncated)
`zenflow.step`	Per step inside a workflow DAG	`zenflow.run_id`, `zenflow.step.id`, `zenflow.step.agent`
`zenflow.coordinator`	Each coordinator activation (per step event, plus the final synthesis)	`zenflow.run_id`, `zenflow.coordinator.phase`
`zenflow.loop`	Per `loop` block in a workflow	`zenflow.run_id`, `zenflow.step.id`, `zenflow.loop.type`
`zenflow.loop.iteration`	Per iteration inside a `loop` block	`zenflow.run_id`, `zenflow.step.id`, `zenflow.loop.iteration`
`zenflow.include`	When a workflow `include`s another workflow	`zenflow.run_id`, `zenflow.step.id`, `zenflow.include.ref`

LLM call spans nest underneath the relevant zenflow span and come from goai. Names are goai.generate, goai.stream, etc., with provider, model, and token attributes attached. Tool call spans (goai.tool) nest under those.

The result for one workflow run is a tree shaped like:

One zenflow.flow root span containing two zenflow.step children. Each step span wraps the goai.generate calls it issues; tool calls (goai.tool) nest under the generate that invoked them.

Wiring OTel in Go

The bridge is two lines: install zenflow's tracer via zenotel.WithTracing(), and enable goai LLM-call spans via WithGoAIOptions(zenotel.GoAIOption()). Both come from the zenflow/observability/otel sub-module so the core library has no OTel dependency.

import (
    "github.com/zendev-sh/zenflow"
    zenotel "github.com/zendev-sh/zenflow/observability/otel"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
)

func setupTracing(ctx context.Context) (*sdktrace.TracerProvider, error) {
    exp, err := otlptracehttp.New(ctx) // reads OTEL_EXPORTER_OTLP_ENDPOINT etc.
    if err != nil {
        return nil, err
    }
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exp),
        sdktrace.WithResource(/* your resource attrs */),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

func main() {
    ctx := context.Background()
    tp, err := setupTracing(ctx)
    if err != nil {
        log.Fatal(err)
    }
    defer tp.Shutdown(ctx)

    orch := zenflow.New(
        zenflow.WithModel(model),
        zenotel.WithTracing(),
        zenflow.WithGoAIOptions(zenotel.GoAIOption()),
    )
    defer orch.Close()

    result, err := orch.RunFlow(ctx, wf)
    _ = result
    _ = err
}

Two pieces matter:

zenotel.WithTracing() returns a zenflow.Option that installs zenflow's span-producing layer. Without it, the zenflow.flow / zenflow.step / zenflow.agent / zenflow.coordinator / zenflow.loop / zenflow.include spans are not produced. The implementation lives in the zenflow/observability/otel sub-module so the core library has no OTel dependency.
WithGoAIOptions(zenotel.GoAIOption()) enables the LLM-call spans. zenflow forwards the goai options into the runner, where they wire up goai.generate and goai.tool spans that nest under whatever parent context the runner received from zenflow.

The OTel SDK setup itself (exporter, resource, propagators) is the same as for any Go service - zenflow just produces spans into whatever provider you've globally registered.

Routing to specific backends

OTel's OTEL_EXPORTER_OTLP_ENDPOINT env var controls where the spans end up. Common destinations:

Langfuse

Langfuse Cloud and self-hosted both expose an OTLP-compatible endpoint:

bash

export OTEL_EXPORTER_OTLP_ENDPOINT=https://cloud.langfuse.com/api/public/otel
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic $(echo -n "${LANGFUSE_PUBLIC_KEY}:${LANGFUSE_SECRET_KEY}" | base64)"

Langfuse renders the LLM calls (prompts, completions, tokens, costs) inside the wrapping zenflow.step span, so each step shows up as a "trace" in Langfuse with the LLM rounds expanded underneath.

Jaeger

Run Jaeger's all-in-one container locally:

bash

docker run --rm -d \
    -p 16686:16686 \
    -p 4317:4317 \
    -p 4318:4318 \
    jaegertracing/all-in-one:latest

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Then open http://localhost:16686 and pick "zenflow" from the service dropdown. The flow tree renders as nested spans with timing on the right.

Datadog

Datadog accepts OTLP via the Datadog Agent's OTLP receiver:

yaml

# datadog-agent.yaml (Helm values or daemonset config)
otlp_config:
  receiver:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

Then point your app at the agent:

bash

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Datadog's APM UI groups the spans into a service map and highlights long-running steps, error rates, and token usage when the goai spans set those attributes.

Span attributes worth filtering on

The attributes zenflow attaches at each level make for useful searches:

zenflow.run_id - tie every span in one workflow run together. The same ID flows into the NDJSON event stream's runId field, so you can cross-reference a CI artifact with a trace.
zenflow.workflow.name - filter by workflow YAML.
zenflow.step.id - drill into one step across many runs.
zenflow.step.agent - group by the agent persona that ran the step (useful when one agent shows up in many workflows).
zenflow.resume = "true" - filter for runs that came from ResumeFlow rather than a fresh start.

The goai layer adds provider/model attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) following the OTel GenAI semantic conventions. Cost-attribution dashboards typically aggregate over gen_ai.system and gen_ai.request.model.

Sampling

For production deployments where every workflow run is interesting (which is most zenflow use cases - the workflows are explicit, not continuous request handling), use AlwaysSample:

sdktrace.NewTracerProvider(
    sdktrace.WithSampler(sdktrace.AlwaysSample()),
    sdktrace.WithBatcher(exp),
)

If you have high workflow throughput and need to drop some, ParentBased(TraceIDRatioBased(0.1)) keeps 10 percent of the trees - and importantly, it samples at the root so you never lose a single step from a sampled trace.

Without OTel

If you don't want to wire OTel at all, the NDJSON event stream from --json carries equivalent information for most diagnostic needs: per-step start and end events with status, duration, and token counts. It is plain text, easy to grep, and works in CI without any extra infrastructure.

OTel becomes worth the setup once you have multiple long-running services, want flame-graph timing across step parallelism, or need to tie zenflow runs into an existing observability stack.

Observability ​

Where the spans live ​

Wiring OTel in Go ​

Routing to specific backends ​

Langfuse ​

Jaeger ​

Datadog ​

Span attributes worth filtering on ​

Sampling ​

Without OTel ​

Observability

Where the spans live

Wiring OTel in Go

Routing to specific backends

Langfuse

Jaeger

Datadog

Span attributes worth filtering on

Sampling

Without OTel