Observability
zenflow emits OpenTelemetry spans through its Tracer interface. The default in-tree implementation lives in the zenflow/observability/otel sub-module; once wired, spans flow through goai's tracing options for LLM call instrumentation.
Where the spans live
zenflow produces spans at several levels. The Tracer interface lives in interfaces.go; span names are emitted by internal/exec/:
| Span name | When | Notable attributes |
|---|---|---|
zenflow.flow | Top of RunFlow / ResumeFlow | zenflow.run_id, zenflow.workflow.name, zenflow.resume (on resume) |
zenflow.goal | Top of RunGoal | zenflow.run_id, zenflow.goal.text (truncated to 200 chars) |
zenflow.agent | Top of RunAgent | zenflow.run_id, zenflow.agent.prompt (truncated) |
zenflow.step | Per step inside a workflow DAG | zenflow.run_id, zenflow.step.id, zenflow.step.agent |
zenflow.coordinator | Each coordinator activation (per step event, plus the final synthesis) | zenflow.run_id, zenflow.coordinator.phase |
zenflow.loop | Per loop block in a workflow | zenflow.run_id, zenflow.step.id, zenflow.loop.type |
zenflow.loop.iteration | Per iteration inside a loop block | zenflow.run_id, zenflow.step.id, zenflow.loop.iteration |
zenflow.include | When a workflow includes another workflow | zenflow.run_id, zenflow.step.id, zenflow.include.ref |
LLM call spans nest underneath the relevant zenflow span and come from goai. Names are goai.generate, goai.stream, etc., with provider, model, and token attributes attached. Tool call spans (goai.tool) nest under those.
The result for one workflow run is a tree shaped like:
zenflow.flow root span containing two zenflow.step children. Each step span wraps the goai.generate calls it issues; tool calls (goai.tool) nest under the generate that invoked them.Wiring OTel in Go
The bridge is two lines: install zenflow's tracer via zenotel.WithTracing(), and enable goai LLM-call spans via WithGoAIOptions(zenotel.GoAIOption()). Both come from the zenflow/observability/otel sub-module so the core library has no OTel dependency.
import (
"github.com/zendev-sh/zenflow"
zenotel "github.com/zendev-sh/zenflow/observability/otel"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
)
func setupTracing(ctx context.Context) (*sdktrace.TracerProvider, error) {
exp, err := otlptracehttp.New(ctx) // reads OTEL_EXPORTER_OTLP_ENDPOINT etc.
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exp),
sdktrace.WithResource(/* your resource attrs */),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func main() {
ctx := context.Background()
tp, err := setupTracing(ctx)
if err != nil {
log.Fatal(err)
}
defer tp.Shutdown(ctx)
orch := zenflow.New(
zenflow.WithModel(model),
zenotel.WithTracing(),
zenflow.WithGoAIOptions(zenotel.GoAIOption()),
)
defer orch.Close()
result, err := orch.RunFlow(ctx, wf)
_ = result
_ = err
}Two pieces matter:
zenotel.WithTracing()returns azenflow.Optionthat installs zenflow's span-producing layer. Without it, thezenflow.flow/zenflow.step/zenflow.agent/zenflow.coordinator/zenflow.loop/zenflow.includespans are not produced. The implementation lives in thezenflow/observability/otelsub-module so the core library has no OTel dependency.WithGoAIOptions(zenotel.GoAIOption())enables the LLM-call spans. zenflow forwards the goai options into the runner, where they wire upgoai.generateandgoai.toolspans that nest under whatever parent context the runner received from zenflow.
The OTel SDK setup itself (exporter, resource, propagators) is the same as for any Go service - zenflow just produces spans into whatever provider you've globally registered.
Routing to specific backends
OTel's OTEL_EXPORTER_OTLP_ENDPOINT env var controls where the spans end up. Common destinations:
Langfuse
Langfuse Cloud and self-hosted both expose an OTLP-compatible endpoint:
export OTEL_EXPORTER_OTLP_ENDPOINT=https://cloud.langfuse.com/api/public/otel
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic $(echo -n "${LANGFUSE_PUBLIC_KEY}:${LANGFUSE_SECRET_KEY}" | base64)"Langfuse renders the LLM calls (prompts, completions, tokens, costs) inside the wrapping zenflow.step span, so each step shows up as a "trace" in Langfuse with the LLM rounds expanded underneath.
Jaeger
Run Jaeger's all-in-one container locally:
docker run --rm -d \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317Then open http://localhost:16686 and pick "zenflow" from the service dropdown. The flow tree renders as nested spans with timing on the right.
Datadog
Datadog accepts OTLP via the Datadog Agent's OTLP receiver:
# datadog-agent.yaml (Helm values or daemonset config)
otlp_config:
receiver:
protocols:
grpc:
endpoint: 0.0.0.0:4317Then point your app at the agent:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317Datadog's APM UI groups the spans into a service map and highlights long-running steps, error rates, and token usage when the goai spans set those attributes.
Span attributes worth filtering on
The attributes zenflow attaches at each level make for useful searches:
zenflow.run_id- tie every span in one workflow run together. The same ID flows into the NDJSON event stream'srunIdfield, so you can cross-reference a CI artifact with a trace.zenflow.workflow.name- filter by workflow YAML.zenflow.step.id- drill into one step across many runs.zenflow.step.agent- group by the agent persona that ran the step (useful when one agent shows up in many workflows).zenflow.resume = "true"- filter for runs that came fromResumeFlowrather than a fresh start.
The goai layer adds provider/model attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) following the OTel GenAI semantic conventions. Cost-attribution dashboards typically aggregate over gen_ai.system and gen_ai.request.model.
Sampling
For production deployments where every workflow run is interesting (which is most zenflow use cases - the workflows are explicit, not continuous request handling), use AlwaysSample:
sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithBatcher(exp),
)If you have high workflow throughput and need to drop some, ParentBased(TraceIDRatioBased(0.1)) keeps 10 percent of the trees - and importantly, it samples at the root so you never lose a single step from a sampled trace.
Without OTel
If you don't want to wire OTel at all, the NDJSON event stream from --json carries equivalent information for most diagnostic needs: per-step start and end events with status, duration, and token counts. It is plain text, easy to grep, and works in CI without any extra infrastructure.
OTel becomes worth the setup once you have multiple long-running services, want flame-graph timing across step parallelism, or need to tie zenflow runs into an existing observability stack.