NATS Data Loss Exposed: Jepsen Finds 49.7% Message Loss

Jepsen’s analysis of NATS 2.12.1 uncovered something every backend developer dreads: acknowledged messages that aren’t actually durable. Independent distributed systems testing found NATS JetStream loses acknowledged writes under real-world failure scenarios—single crashes, power failures, even minority node corruption. The culprit? A default fsync policy that flushes data every two minutes while acknowledging writes immediately. If you’re running NATS in production, check your configuration now.

The Lazy fsync Betrayal

NATS JetStream acknowledges publishes immediately but only flushes writes to disk every two minutes by default. This violates a fundamental principle of the Raft consensus algorithm: sync before ACK.

Jepsen’s testing revealed the consequences. In a coordinated power failure scenario, NATS lost 131,418 of 930,005 messages—14.1% of acknowledged writes gone. Typical failure scenarios produced 30 seconds of write loss. The Raft thesis explicitly states that nodes must “flush new log entries to their disks” before acknowledging. MongoDB, etcd, and TiDB all follow this standard. NATS doesn’t.

The fix exists: setting sync_interval: always forces disk synchronization before acknowledgment. However, users must discover this manually. By default, NATS optimizes for throughput at the expense of durability.

Minority Corruption, Majority Data Loss

Consensus protocols should protect against minority node failures. In a five-node cluster, one corrupted node shouldn’t compromise the quorum. Jepsen proved NATS doesn’t uphold this guarantee.

A single-bit error in one of five nodes caused catastrophic results: 679,153 of 1,367,069 acknowledged writes lost—49.7% vanished. Different replicas served different message histories, creating persistent split-brain conditions. Corrupted nodes could become cluster leaders and unilaterally delete committed data.

When snapshot metadata corruption occurred, affected nodes triggered orphaned stream detection and deleted all stream data files. If that corrupted node became leader, it wiped committed data across the cluster. The stream became permanently unavailable.

The Dangerous Defaults Pattern

NATS isn’t alone. The industry has a dangerous defaults problem.

SurrealDB ships with fsync disabled by default to improve benchmark performance, risking data corruption on power failures. NATS acknowledges writes before flushing, optimizing throughput over safety. Both systems document these trade-offs. Neither changes the defaults.

PostgreSQL represents the better approach: safe by default, performance opt-in. Users who need speed can disable synchronous_commit. The default protects production systems from silent data loss.

Here’s the fundamental question: should distributed systems optimize for the 1% running benchmarks or the 99% running production workloads? When “acknowledged” doesn’t mean “durable,” we’ve broken a core contract with users.

Split-Brain from Single Crashes

Jepsen found that a single OS crash combined with process pauses or network partitions caused persistent replica divergence that survived cluster recovery. Different nodes lost acknowledged messages from different time windows. The split-brain persisted even after complete cluster restart and network healing.

Loss windows approximated the sync_interval setting—10 seconds in testing. Any node could lose committed writes: the failed node, nodes that were running before the failure, or nodes that came up after. The Raft implementation stores commits only in memory on majority nodes, creating vulnerability when minority nodes restart without synced state.

What Developers Must Do

If you’re running NATS JetStream in production, verify your sync_interval configuration immediately. The default setting puts acknowledged data at risk during coordinated failures.

Set sync_interval: always for critical data. Yes, this impacts performance. That’s the trade-off between throughput and durability. Systems handling financial transactions, user data, or state that can’t be reconstructed need durability before acknowledgment.

Beyond NATS, question every “reliable” system’s defaults. Documentation claims and actual guarantees often diverge. SurrealDB prioritizes benchmark performance. NATS prioritizes throughput. What other systems ship with safety opt-in rather than safety by default?

Test failure scenarios. Don’t trust marketing claims. Jepsen tests are open source—the NATS test suite runs on standard infrastructure. If you’re building critical systems, verify the guarantees yourself.

The Bigger Picture

The NATS findings expose an industry-wide pattern: distributed systems that claim reliability but ship with unreliable defaults. When vendors optimize for benchmarks while production systems burn, we’ve got our priorities backwards.

Safe defaults should be mandatory for consensus systems. If your system loses acknowledged messages, you haven’t built a distributed database—you’ve built a distributed gamble. Users shouldn’t need to discover durability settings through data loss incidents.

NATS responded to Jepsen’s findings by updating documentation and marking issues “under investigation.” The dangerous defaults remain unchanged. Documentation doesn’t prevent production incidents. Safe defaults do.

Check your NATS configuration. Question your assumptions about “reliable” systems. Test everything. Trust nothing.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

NATS Data Loss Exposed: Jepsen Finds 49.7% Message Loss

The Lazy fsync Betrayal

Minority Corruption, Majority Data Loss

The Dangerous Defaults Pattern

Split-Brain from Single Crashes

What Developers Must Do

The Bigger Picture

AWS re:Invent 2025: Kiro AI Agent, Trainium3, Lambda

Kroger’s $2.6B Automation Fail: Speed Beats Robots

Leave a reply Cancel reply

More in:Industry Analysis

AI Code Verification Bottleneck: 96% Don’t Trust Output

Ars Technica Fires AI Reporter: Claude, ChatGPT Fabricate

TypeScript Hits 48.8%: GitHub #1, 78% of Jobs Require It

India Blocks Supabase Under Section 69A – No Explanation

WiFi DensePose Hits GitHub #2: Real or AI-Generated Hype?

DevProd Teams Hit 4.7% Headcount: 2026 Benchmarks

Categories

The Lazy fsync Betrayal

Minority Corruption, Majority Data Loss

The Dangerous Defaults Pattern

Split-Brain from Single Crashes

What Developers Must Do

The Bigger Picture

Share

You may also like

Leave a reply Cancel reply

More in:Industry Analysis

Categories

Latest Posts