1. Cloudflare is in the business of being a lightning rod for large and targeted DoS attacks. A lot of cases are attacks.
2. Attacks that make it through the usual defences make servers run at rates beyond their breaking point, causing all kinds of novel and unexpected errors.
Additionally, attackers try to hit endpoints/features that amplify severity of their attack by being computationally expensive, holding a lock, or trigger an error path that restarts a service — like this one.
this was in the middle of a scheduled maintenance, with all requests failing at a singular point - that being a .unwrap().
there should be internal visibility into the fact a large number of requests are failing all at the same LOC - and attention should be focused there instantly imo.
or at the very least, it shouldn't take 4 hours for anyone to even consider it wasn't an attack.
in situations such as this, where your entire infra is fucked, you should have multiple crisis teams working in parallel, under different assumptions.
if even one additional team was created that worked under the assumption it was an infra issue rather than an attack, this situation could have been resolved many hours earlier.
for a product as vital to the internet as cloudflare, it is unacceptable to not have this kind of crisis management.
2. Attacks that make it through the usual defences make servers run at rates beyond their breaking point, causing all kinds of novel and unexpected errors.
Additionally, attackers try to hit endpoints/features that amplify severity of their attack by being computationally expensive, holding a lock, or trigger an error path that restarts a service — like this one.