The Day the Bots Won (Kinda): Analyzing the Cloudflare Nov 18 Outage

November 19, 2025 Jack Beaman

The "Immune System" of the Internet

Before we dissect the digital corpse of yesterday's outage, let's pour one out for the sheer scale of Cloudflare. Founded in 2009 by Matthew Prince, Lee Holloway, and Michelle Zatlyn, Cloudflare started with a simple mission: to help build a better internet. Initially a "Project Honey Pot" offshoot to track spam, it evolved into a global behemoth providing security, performance, and reliability for everything from your local bakery's landing page to massive enterprises like OpenAI and Shopify.

Today, Cloudflare sits in front of approximately 20% of the entire web. They are the bouncers of the internet club, checking IDs (SSL certificates), tossing out drunks (DDoS attacks), and making sure the VIPs (verified bots) get through the velvet rope quickly.

But as we learned on November 18, 2025, when the bouncer trips, nobody gets into the club.

The Outage: November 18, 2025

At 11:20 UTC, the internet held its collective breath. Major platforms like X (formerly Twitter), ChatGPT, and Canva started throwing HTTP 500 errors. For IT teams globally, it was the "Is it me, or is it them?" moment. Spoiler: It was them.

Cloudflare confirmed the incident was their "worst outage since 2019," disrupting the majority of core traffic flowing through their network.

"We are sorry for the impact to our customers and to the Internet in general... We know we let you down today." — Cloudflare Engineering Team

The Culprit: A Bot Management "Feature File"

For once, it wasn’t DNS. The irony is palpable. The outage wasn't caused by a malicious cyber-attack, but by the very system designed to stop them: Bot Management.

Here is the technical sequence of events, or "How to Break the Internet in Three Steps":

The Database Change: Engineers made a change to permissions on a ClickHouse database cluster to improve security. This change inadvertently caused a query to return duplicate rows for a "feature file" used by the Bot Management machine learning model.
The Bloat: This feature file, which lists traits used to identify bots, doubled in size.
The Panic: The software running on Cloudflare's core proxy (specifically the new FL2 engine written in Rust) had a hard-coded limit on the size of this file. When the file exceeded the limit, the code hit a Result::unwrap() on an error state. In Rust terms, the thread said, "I can't handle this," and panicked, crashing the entire process.

Why "Unwrap" is a Four-Letter Word

For the developers in the room, this is a classic lesson in defensive coding. The specific failure happened because the code essentially said: "Trust me, this value will always be valid." When the duplicate database rows pushed the feature count over the limit (from ~60 to over 200), the validity check failed.

Because the code used an unwrap() on an error rather than gracefully handling it, the crash was immediate and catastrophic. It’s the software equivalent of fainting because your sandwich has too many pickles.

The Reliance on Bot Management Systems

This incident highlights a critical fragility in modern web architecture. We rely heavily on automated Bot Management Systems (BMS) to filter traffic. These systems are complex, relying on real-time data, heuristics, and machine learning models.

When these systems are not properly isolated, a configuration error doesn't just stop bot protection, it stops all traffic. A minor oversight.. At BeamanDevelopment, we always emphasize that your WAF and Bot configuration should fail open (allow traffic) rather than fail closed (block everyone) during internal system errors, whenever possible.

Put a damned WHERE database = ‘default’ in the SQL query

Key Statistics:

Duration: ~3 hours (11:20 UTC to 14:30 UTC for main resolution), total roughly 6 hours
Impact: "Significant failures to deliver core network traffic" globally.
Root Cause: Internal ClickHouse database query returning duplicates, .unwrap() called on Err, proxy crashed, interwebs smoked.

Conclusion

Cloudflare had the right intentions, an overall security improvement by making database permissions more explicit. It’s a shame that this minor oversight had such terrific consequences. We applaud the CEO for the quick and decisive mea culpa.

The internet is resilient, but it is also incredibly centralized. When a provider like Cloudflare sneezes (or in this case, panics over a config file), the world catches a cold.

For our clients, this serves as a reminder: Monitor your dependencies. While you can't control Cloudflare's deployment schedule, you can ensure your own error handling doesn't compound the issue. And maybe, just maybe, double-check your database queries before running them in production.

Back to blog

Item added to your cart