Cloudflare Down Again: Second Outage in Under 3 Weeks

Cloudflare experienced another global outage this morning, taking down LinkedIn, Zoom, Shopify, banks, and thousands of other websites for 25 minutes. This marks the second major incident in less than three weeks—and both were caused by internal configuration errors, not external attacks. When one company controls roughly 20% of the internet’s traffic, even a routine code change can break the web.

The December 5 Incident

The outage began at 8:47 UTC and lasted 25 minutes, affecting approximately 28% of Cloudflare’s HTTP traffic. All affected requests returned HTTP 500 errors. The cause? A Lua nil reference error triggered during a Web Application Firewall configuration change:

[lua] Failed to run module rulesets callback late_routing:
/usr/local/nginx-fl/lua/modules/init.lua:314:
attempt to index field 'execute' (a nil value)

Here’s the kicker: Cloudflare was trying to protect customers from CVE-2025-55182, a critical React Server Components vulnerability with a CVSS score of 10.0. They increased the WAF buffer size from 128KB to 1MB to catch malicious payloads. When engineers disabled an internal testing tool, the whole thing fell apart.

The irony is thick: Cloudflare caused a 25-minute global outage while trying to protect against a security vulnerability. Within those 25 minutes, more businesses lost money than would have been compromised by the React vulnerability itself. Security patches need the same rigorous testing as feature releases—good intentions don’t excuse sloppy execution.

The Pattern That Matters

This wasn’t a one-off. Less than three weeks ago, on November 18, Cloudflare experienced a 3-6 hour outage when a database permissions change doubled the size of their Bot Management feature file, crashing proxy servers globally. That incident took down ChatGPT, X (Twitter), Spotify, and League of Legends.

Two major outages in under three weeks. Two completely different root causes—database configuration versus WAF configuration. Same outcome: global infrastructure failure from internal errors.

This isn’t bad luck or external threats. It’s a process problem. Two unrelated internal errors in less than a month suggests inadequate testing, deployment safeguards, or engineering practices for a company this critical to internet infrastructure. When you’re Cloudflare’s size, these errors are unacceptable.

The Centralization Problem

According to W3Techs, 80.7% of websites that use CDNs rely on Cloudflare. That’s roughly 21.8% of ALL websites on the internet. When Cloudflare sneezes, the internet catches a cold.

One nil reference error in a Lua script took down LinkedIn, Zoom, banks, e-commerce sites, and thousands of other services simultaneously. We’ve created a massive single point of failure, and it keeps failing. This is an architectural problem for the internet itself.

Cloudflare’s market dominance isn’t surprising—they offer a generous free tier, excellent DDoS protection, and 330+ global data centers. But “too big to fail” infrastructure that keeps failing is a systemic risk. Multi-CDN strategies are no longer optional for serious businesses.

What Cloudflare Promises

After the December 5 incident, Cloudflare committed to three fixes:

Enhanced rollouts with health validation for configuration changes
Streamlined emergency procedures with improved rollback capabilities
“Fail-open” error handling that defaults to known-good states

They also halted all network changes pending implementation of these safeguards.

Here’s the problem: these should have already been standard practice. Gradual rollouts, canary deployments, health checks, circuit breakers—these aren’t cutting-edge DevOps practices. They’re table stakes for any system at scale, let alone one that serves 20% of the internet.

The fact that Cloudflare is promising “better testing” after two major outages suggests these safeguards weren’t adequately in place. Promises are good. Trust is earned through sustained reliability.

What Developers Should Do

Cloudflare’s response time has improved—25 minutes versus 3-6 hours is progress. They’re transparent about failures and publish detailed incident reports. But fast recovery doesn’t excuse recurring failures.

If your business depends on Cloudflare for mission-critical infrastructure, it’s time to consider multi-CDN strategies. Route traffic across multiple providers. Build fallback mechanisms. Don’t trust any single provider with your entire online presence, no matter how dominant they are.

The internet’s infrastructure shouldn’t depend on whether a Cloudflare engineer properly tested a Lua configuration change. We can do better than this.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.