OTel Arrow Phase 2: Rust Dataflow Engine Kills the Serialization Tax

OpenTelemetry Arrow Phase 2 OTAP Dataflow Engine diagram showing columnar Arrow data pipeline with Rust thread-per-core architecture

OTel Arrow Phase 2: The OTAP Dataflow Engine keeps telemetry in Apache Arrow format throughout the pipeline

OpenTelemetry Arrow Phase 1 solved bandwidth. Swap otlp for otelarrow in your Collector config and network costs drop 50–70%. Most teams still haven’t done it. While that adoption lag plays out, Phase 2 is attacking a harder problem: the serialization tax hiding inside the pipeline itself. The new OTAP Dataflow Engine — a Rust runtime built by F5 and Microsoft — keeps telemetry in Apache Arrow columnar format from ingestion through egress. No more deserializing Arrow batches into OTLP structs, filtering row-by-row, then re-serializing on the way out. The question is whether this architectural rethink holds up at production scale. Here’s what’s real and what’s still experimental.

What Phase 1 Fixed (and What It Didn’t)

Phase 1 of OTel Arrow focused entirely on the wire. Instead of transmitting telemetry as row-oriented OTLP protobufs, it encodes batches in Apache Arrow’s columnar format before sending. Columnar data compresses dramatically better than row-based formats because similar values sit adjacent in memory. The result: 50–70% bandwidth reduction for metrics and logs versus OTLP/gRPC with Zstd compression.

Phase 1 is stable and has shipped in OpenTelemetry Collector-Contrib since v0.104.0 (July 2024). It has been tested at 500,000–600,000 spans per second. There is no good reason to still be on plain OTLP if you’re running a high-volume pipeline.

What Phase 1 didn’t change: everything inside the Collector. Data arrives as Arrow, gets deserialized into OTLP structs for processing, passes through filter and routing processors row by row, then gets serialized again before export. That round-trip is where CPU time goes. At 50,000 requests per second, serialization overhead alone runs to roughly 35% of CPU utilization. Add regex-heavy filters and complex OTTL transforms and that number climbs.

Phase 2: The Pipeline Stays Arrow

The OTAP Dataflow Engine is the Phase 2 answer. It’s a Rust runtime designed around one principle: telemetry stays in Arrow columnar format inside the pipeline, not just on the wire. Instead of converting Arrow → OTLP → Arrow on every hop, the engine operates directly on Arrow record batches throughout.

Why Rust? The thread-per-core, NUMA-aware, shared-nothing architecture requires tight control over memory layout that Go’s garbage collector complicates. Each CPU core gets one worker thread. Data stays local to that thread. No cross-thread data movement on the hot path. Bounded channels prevent memory blowout during backpressure. Zero-copy data types eliminate unnecessary allocations.

The practical payoff: filtering and routing become column-oriented operations. A column predicate evaluates an entire batch in one pass rather than iterating row-by-row through OTTL expressions. For teams with complex filter chains, that’s a significant change to the CPU profile.

What Ships With the Engine

The Dataflow Engine ships with a processor set covering common pipeline operations:

filter_processor — column-oriented predicate filtering
signal_type_router — routes logs, metrics, and traces to separate downstream paths
content_router — attribute-based routing
durable_buffer_processor — disk-backed storage using Arrow IPC format for reliable delivery
fanout_processor — fan data to multiple sinks
retry_processor — retry with backoff
attributes_processor — enrich and rename attributes

There’s also an HTTP admin API that exposes live pipeline state, configuration, debug logs, and Prometheus metrics. You can reconfigure the pipeline without restarting it — a meaningful operational upgrade over the current Collector reload behavior.

Who Is Building This

F5 Networks built the original Dataflow Engine and transferred the repository to the OpenTelemetry organization. Microsoft joined as co-developer, contributing Rust expertise and Azure-scale observability infrastructure requirements. GreptimeDB is contributing Rust support. The backing matters: F5 and Microsoft run telemetry at the scale where this architecture is necessary, not theoretical.

What’s Production-Ready and What Isn’t

This distinction matters more than the headline. Phase 1 transport is stable — use it today. The Phase 2 Dataflow Engine is explicitly incubation-stage. The OpenTelemetry team is direct about it: no backward-compatibility guarantees on configuration formats, APIs, or component interfaces. Production deployment is not recommended at this stage.

That is the right call. The project is in active design iteration. Running experimental pipeline infrastructure under production SLOs is how you manufacture incidents.

The recommended path: evaluate the engine in a non-production environment. Run benchmarks on your actual telemetry shape. File issues. The conversation is happening at open-telemetry/otel-arrow on GitHub.

What To Do Right Now

If you’re running OTLP between Collectors and haven’t enabled Phase 1, that’s the first action. The Phase 1 exporter is a drop-in swap:

exporters:
  otelarrow:
    endpoint: collector.example.com:4317
    tls:
      insecure: true
processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otelarrow]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otelarrow]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otelarrow]

Default stream count is max(1, NumCPU/2). Compression levels run from zstdarrow1 to zstdarrow10. For Phase 2: follow the repository, run the engine in a sandbox, and provide feedback. The roadmap includes Wasmtime integration for user-defined Wasm processors and Go–Rust pipeline interop so the existing Collector can orchestrate OTAP pipelines.

The serialization tax is real. Phase 2 is the right architecture to eliminate it. The honest timing: Phase 1 should already be running, Phase 2 is ready for evaluation. Before committing to a Phase 2 migration timeline, read the OTel Arrow production case study for what Phase 1 delivers at real scale.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.