Cloudflare glitch causes global internet outage — here’s the non-techie explanation



Cloudflare has revealed the cause of a major global outage on November 18 that disrupted access to thousands of websites and online services.

The company said the outage was not the result of a cyberattack and instead was triggered by an internal technical mistake.

According to Cloudflare, the problem started just after 11:20 UTC when a routine change to one of its database systems accidentally created a faulty configuration file. That file is used by Cloudflare’s bot-detection system, which helps the company identify and block malicious automated traffic.

Because of the error, the file suddenly doubled in size. When it was automatically sent across Cloudflare’s global network, many of the company’s servers couldn’t read it and began to fail. This caused websites using Cloudflare to show HTTP 500 errors, meaning the servers couldn’t process requests.

Initially, engineers suspected a massive DDoS attack because the network kept failing and then recovering every few minutes. This happened because the system alternated between receiving “good” and “bad” versions of the configuration file as different database servers updated at different times.

By early afternoon, all the servers were producing the faulty file, which caused the widespread outage.

Cloudflare engineers eventually traced the problem to the bot-detection file, stopped it from updating, and rolled back to an earlier, working version. Traffic began returning to normal around 14:30 UTC, and all services were fully restored by 17:06 UTC.

Several Cloudflare products were affected during the incident, including its core content delivery network, security tools, login system, and Workers KV storage service. Many users were unable to log in to Cloudflare’s dashboard during the outage.

Cloudflare CEO Matthew Prince apologised for the disruption, calling it the company’s most serious outage since 2019. He said Cloudflare is already taking steps to prevent a similar event in the future, including tightening internal safety checks and improving how its systems handle errors.

IOL



Source link

Leave comment

Your email address will not be published. Required fields are marked with *.