That is also true at Cloudflare for what it’s worth. However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release, especially since there’s a 5 min lag (if I recall correctly) in the monitoring dashboards to get all the telemetry from thousands of servers worldwide.
Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.
The fintech company I worked at does handle millions of QPS has has thousands of servers. It is on the same order of magnitude or at least 0.1x scale, not to mention the complexity of business logic involving monetary transactions.
If there’s indeed a 5 min lag in monitoring dashboard in Cloudflare, I honestly think that's a pretty big concern.
For example, a simple curl script on your top 100 customers' homepage that runs every 30 seconds would have given the warning and notifications within a minute. If you stagger deployments at 5 minute intervals, you could have identified the issue and initiated the rollback within 2 minutes and completed it within 3 minutes.
> However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release
This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.
Given how well-established cloudflare is, I would've figured they'd be profitable by now.
That raises the question: why does so much of the web rely on a company which does not have the means to sustain itself?
That was admittedly hyperbole, but since we're talking about a company with assets and revenue in the billions I'm not sure it matters. The fact remains that a lack of money/resources is not their problem.
They don't have unlimited resources. They have ~5000 employees. That's not small but it's not huge either. For sake of comparison, Google hit that headcount level literally 20 years ago.
They have enough money to buy anything they need. The CEO alone has billions. He could pay for as many employees as he wants out of his own pocket and not notice. In fact he's good at buying people, even senators.
That doesn't make sense. It would be like saying Twitter, SpaceX, and Tesla all should be incapable of engineering mistakes because their owner is rich. The world doesn't work that way.
Genuinely curious, how to actually implement detection systems for a large scale global infra which that works with < 1 minute SLO ? Given cost is no constraint.
Right now I'd say maybe don't push changes to your entire global infra all at once and certainty not without testing your change first to make sure it doesn't break anything, but it's really not about a specific failure/fix as much as it is about a single company getting too big to do the job well or just plain doing more than it should in the first place.
Honestly we shouldn't have created a system where any single company's failure is able to impact such a huge percentage of the network. The internet was designed for resilience and we abandoned that ideal to put our trust in a single company that maybe isn't up for the job. Maybe no one company ever could do it well enough, but I suspect that no single company should carry that responsibility in the first place.
But then would a customer have to use 10 different vendors to get the same things that Cloudflare currently provides? E.g. protection against various threats online?
Can you name a major cloud provider that doesn’t have major outages?
If this were purely a money problem it would have been solved ages ago. It’s a difficult problem to solve. Also, they’re the youngest of the major cloud providers and have a fraction of the resources that Google, Amazon, and Microsoft have.
> Can you name a major cloud provider that doesn’t have major outages?
That fact that no major cloud provider is actually good is not an argument that cloudflare isn't bad, or even that they couldn't/shouldn't do better than they are. They have fewer resources than Google or Microsoft but they're also in a unique position that makes us differently vulnerable when they fuck up. It's not all their fault, since it was a mistake to centralize the internet to the extent that we have in the first place, but now that they are responsible for so much they have to expect that people will be upset when they fail.
Every major cloud provider (including Cloudflare) is orders of magnitude better at keeping 9s of availability worldwide for thousands of customers than those customers are individually. The very best of those customers might be better and only rely on cloud providers for the scaling or huge amounts of infrastructure they don’t otherwise want to own, but the vast majority are actually less capable at accomplishing whatever uptime the providers already get.
Could cloudflare do better? Sure, that’s a truism for everyone. Did they make mistakes and continue to make mistakes? Also a truism.
Trust me, they are acutely aware of people getting upset when they fail. Why do you think they’re CEO and CTO are writing these blog posts?
With all due respect, engineers in finance can’t allow for outages like this because then you are losing massive amounts of money and potentially going out of business.
Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.