It would be a good thing, if it would cause anything to change. It obviously won't. As if a single person reading this post wasn't aware that the Internet is centralized, and couldn't name specifically a few sources of centralization (Cloudflare, AWS, Gmail, Github). As if it's the first time this happens. As if after the last time AWS failed (or the one before that, or one before…) anybody stopped using AWS. As if anybody could viably stop using them.
If anything, centralisation shields companies using a hyperscaler from criticism. You’ll see downtime no matter where you host. If you self host and go down for a few hours, customers blame you. If you host on AWS and “the internet goes down”, then customers treat it akin to an act of God, like a natural disaster that affects everyone.
It’s not great being down for hours, but that will happen regardless. Most companies prefer the option that helps them avoid the ire of their customers.
Where it’s a bigger problem is when a critical industry like retail banking in a country all choose AWS. When AWS goes down all citizens lose access to their money. They can’t pay for groceries or transport. They’re stranded and starving, life grinds to a halt. But even then, this is not the bank’s problem because they’re not doing worse than their competitors. It’s something for the banking regulator and government to worry about. I’m not saying the bank shouldn’t worry about it, I’m saying in practice they don’t worry about it unless the regulator makes them worry.
I completely empathise with people frustrated with this status quo. It’s not great that we’ve normalised a few large outages a year. But for most companies, this is the rational thing to do. And barring a few critical industries like banking, it’s also rational for governments to not intervene.
> If anything, centralisation shields companies using a hyperscaler from criticism. You’ll see downtime no matter where you host. If you self host and go down for a few hours, customers blame you.
Not just customers. Your management take the same view. Using hyperscalers is great CYA. The same for any replacement of internally provided services with external ones from big names.
Exactly. No one got fired for using AWS. Advocating for self-hosting or a smaller provider means you get blamed when the inevitable downtime comes around.
If you cannot give a patient life saving dialysis because you don't have a backup generator then you are likely facing some liability. If you cannot give a patient life saving dialysis because your scheduling software is down because of a major outage at a third party and you have no local redundancy then you are in a similar situation. Obviously this depends on your jurisdiction and probably we are in different ones, but I feel confident that you want to live in a district where a hospital is reasonably responsible for such foreseeable disasters.
Yeah I mentioned banking because of what I was familiar with but medical industry is going to be similar.
But they do differ - it’s never ok for a hospital to be unable to dispense care. But it is somewhat ok for one bank to be down. We just assume that people have at least two bank accounts. The problem the banking regulator faces is that when AWS goes down, all banks go down simultaneously. Not terrible for any individual bank, but catastrophic for the country.
And now you see what a juicy target an AWS DC is for an adversary. They go down on their own now, but surely Russia or others are looking at this and thinking “damn, one missile at the right data Center and life in this country grinds to a halt”.
>If anything, centralisation shields companies using a hyperscaler from criticism. You’ll see downtime no matter where you host. If you self host and go down for a few hours, customers blame you.
What if you host on AWS and only you go down? How does hosting on AWS shield you from criticism?
This discussion is assuming that the outage is entirely out of your control because the underlying datacenter you relied on went down.
Outages because of bad code do happen and the criticism is fully on the company. They can be mitigated by better testing and quick rollbacks, which is good. But outages at the datacenter level - nothing you can do about that. You just wait until the datacenter is fixed.
This discussion started because companies are actually fine with this state of affairs. They are risking major outages but so are all their competitors so it’s fine actually. The juice isn’t worth the squeeze to them, unless an external entity like the banking regulator makes them care.
I’m pretty cloudflare centric. I didn’t start that way. I had services spread out for redundancy. It was a huge pain. Then bots got even more aggressive than usual. I asked why I kept doing this to myself and finally decided my time was worth recapturing.
Did everything become inaccessible the last outage? Yep. Weighed against the time it saves me throughout the year I call it a wash. No plans to move.
I'm of a similar mindset... yeah, it's inconvenient when "everything" goes down... but realistically so many things go down now and then, it just happens.
Could just as easily be my home's internet connection, or a service I need from/at work, etc. It's always going to be something, it's just more noticeable when it affects so many other things.
To be honest, it's MUCH easier to have one source to blame when things go down. If a small-medium vendor's website goes down on a normal day, so poor IT guy is going to be fielding calls all day.
If that same vendor goes down because Cloudflare went down, oh well. Most already know and won't bother to ask when your site will be back up
> It would be a good thing, if it would cause anything to change. It obviously won't.
I agree wholeheartedly. The only change is internal to these organizations (eg: CloudFlare, AWS) Improvements will be made to the relevant systems, and some teams internally will also audit for similar behavior, add tests, and fix some bugs.
However, nothing external will change. The cycle of pretending like you are going to implement multi-region fades after a week. And each company goes on continuing to leverage all these services to the Nth degree, waiting for the next outage.
Not advocating that organizations should/could do much, it's all pros/cons. But the collective blast radius is still impressive.
the root cause is customers refusing to punish these downtime.
Checkout how hard customers punish blackouts from the grid - both via wallet, but also via voting/gov't. It's why they are now more reliable.
So unless the backbone infrastructure gets the same flak, nothing is going to change. After all, any change is expensive, and the cost of that change needs to be worth it.
I think you’re viewing the issue from an office worker’s perspective. For us, downtime might just mean heading to the coffee machine and taking a break.
But if a restaurant loses access to its POS system (which has happened), or you’re unable to purchase a train ticket, the consequences are very real. Outages like these have tangible impacts on everyday life. That’s why there’s definitely room for competitors who can offer reliable backup strategies to keep services running.
Talking more about some unrelated function taking down the whole system, not advocating for "offline" credit card transactions (is this even a thing these days?). Ex: If the transaction needs to be logged somewhere, it can be built to sync whenever possible rather than blocking all transactions if the central service is down.
Payment processor being down is payment processor being down.
Do any of those competitors actually have meaningfully better uptime?
From a societal level, having everything shut down at once is an issue. But if you only have one POS system targeting only one backend URL (and that backend has to be online for the POS to work) then cloudflare seems like one of the best choices
If the uptime provided by cloudflare isn't enough then the solution isn't a cloudflare competitor, it's the ability to operate offline (which many POS have, including for card purchases) or at least multiple backends with different DNS, CDN, server location etc.
If it’s that easy to get the exact same service / product as another vendor the maybe your competitive advantage isn’t so high. If Amazon would be down I’d just wait a few hours as I don’t want to sign up on another site.
I agree. These days it seems like everything is a micro-optimization to squeeze out a little extra revenue. Eventually most companies lose sight of the need to offer a compelling product that people would be willing to wait for.
I remember a Google cloud outage years ago that happened to coincide with one of our customers' massively expensive TV ads. All the people who normally would've gone straight to their website instead got 502. Probably a 1M+ loss for them all things considered.
You need to be punishing the services you "paid" to use, but had downtime. So did you terminate any of those services for downtime, or had any sort of punishment done to them as a result?
> Checkout how hard customers punish blackouts from the grid - both via wallet, but also via voting/gov't.
What? Since when has anyone ever been free to just up and stop paying for power from the grid? Are you going to pay $10,000 - $100,000 to have another power company install lines? Do you even have another power company in the area? State? Country? Do you even have permission for that to happen near your building? Any building?
The same is true for internet service, although personally I'd gladly pay $10,000 - $100,000 to have literally anything else at my location, but there are no proper other wired providers and I'll die before I ever install any sort of cellular router. Also this is a rented apartment so I'm fucked even if there were competition, although I plan to buy a house in a year or two.
Downtimes happen one way or another. The upside of using Cloudflare is that bringing things back online is their problem and not mine like when I self-host. :]
Their infrastructure went down for a pretty good reason (let the one who has never caused that kind of error cast the first stone) and was brought back within a reasonable time.
Same idea with the Crowdstrike bug, it seems like it didn't have much of on effect on their customers, certainly not with my company at least, and the stock quickly recovered, in fact doing very well. For me, it looks like nothing changed, no lessons learned.
That's true of a lot of "Enterprise" software. Microsoft enjoys success from abusing their enterprise customers what seems like daily at this point.
For bigger firms, the reality is that it would probably cost more to switch EDR vendors than the outage itself cost them, and up to that point, CrowdStrike was the industry standard and enjoyed a really good track records and reputation.
Depending on the business, there are long term contracts and early termination fees, there's the need to run your new solution along side the old during migration, there's probably years of telemetry and incident data that you need to keep on the old platform, so even if you switch, you're still paying for CrowdStrike for the retention period. It was one (major) issue over 10+ years.
Just like with CloudFlare, the switching costs are higher than outage cost, unless there was a major outage of that scale multiple times per year.
that IS the lesson! there are a million questions i can ask myself about those incidents. What dictates they can't ever screw up? sure it was a big screw up, but understanding the tolerances for screw ups is important to understanding how fast and loose you can play it. AWS has at least a big outage a year, whats the breaking point? risk and reward etc.
I've worked places where every little thing is yak shaved, and places where no one is even sure if the servers are up during working hours. Both jobs paid well.. both jobs had enough happy customers
Not that I doubt examples exist (I've yet to be at a large place with 0 failures on responding to such issues over the years), but it'd be nice if you'd share the specific examples you have in mind if you're going to bother commenting about it. It helps people understand how much is a systemic problem to be interested in vs having a comment which more easily falls into many other buckets instead. I'd try to build trust off the user profile as well, but it proclaims you're shadowbanned for two different reasons - despite me seeing your comment.
To be fair AWS (and GCP and Azure) at least is easy to replace with something else. And pretty much all alternatives are cheaper, less messy, etc. There are very few situations where you cannot viably do so.
We live in a world where you can get things like dedicated servers, etc. within similar time spans as creating a "compute engine" node on a big cloud provider.
The fact that cloud services added serious limitations to what applications were able to do (things like state management, passing configuration in more unified ways, etc.) means that running your own infrastructure is easier than ever, since your devs won't end up whining at you until you do something super custom just for some project to be a bit easier. But if you really want to you can.
GitHub also has become easy to get away from and indeed many individuals and companies did so.
CDNs are the bigger thing but A) there are a lot of other CDNs and B) having an image, or lets say an ansible config allows you to quickly deploy something that might be close enough for your use case. Just take any hosting company or even a dozen around the world.
Of course if you allowed yourself to end up in a complete vendor lock in things might be different, but if you think that it's a good idea to be completely dependent on the whims of some other company maybe you deserve that state. As in don't run a business without having any kind of fallback for decisions you make. Yes, profit from that big benefit something might give you, but don't lock the door behind you.
Sure you might be lucky and sure maybe you are fine going for luck while it lasts. Just don't be surprised when it all shatters.
It is as easy to not use them as it ever was. There has been no actual centralisation. Everything is done using open protocols. I don't know what more you could want.
Compare it to Windows where there is deep volume discounting and salespeople shmoozing CTOs and getting in with schools, healthcare providers etc etc. That's actual lock-in.
It’s too few and far between. It’s gonna make some changes if it’s a monthly event. If businesses start to lose connection for 8 hours every month, maybe the bigger ones are going to run for self hosting or at least some capacity of self hosting.
Here's where we separate the men from the boys, the women from the girls, the Enbys from the enbetts, and the SREs from the DevOps. If you went down when Cloudflare went do, do you go multicloud so that can't happen again, or do you shrug your shoulders and say "well, everyone else is down"? Have some pride in your work, do better, be better, and strive for greatness. Have backup plans for your backup plans, and get out of the pit of mediocrity.
Or not, shit's expensive and kubernetes is too complicated and "no one" needs that.
Same with the big Crowdstrike fail of 2024. Especially when everyone kept repeating the laughable statement that these guys have their shit in order, so it couldn't possibly be a simple fuckup on their end. Guess what, they don't, and it was. And nobody has realized the importance of diversity for resilience, so all the major stuff is still running on Windows and using Crowdstrike.
I wrote https://johannes.truschnigg.info/writing/2024-07-impending_g... in response to the CrowdStrike fallout, and was tempted to repost it for the recent CloudFlare whoopsie. It's just too bad that publishing rants won't change the darned status quo! :')
People will not do anything until something really disastrous happens. Even afterwards memories can fade. Cloudstrike has not lost many customers.
Covid is a good parallel. A pandemic was always possible, there is always a reasonable chance of one over the course of decades. However people did not take it seriously until it actually happened.
A lot of Asian countries are a lot better prepared for a tsunami then they were before 2004.
The UK was supposed to have emergency plans for a pandemic, but it was for a flu variant, and I suspect even those plans were under-resourced and not fit for purpose. We are supposed to have plans for a solar storm but when another Carrington even occurs I very much doubt we will deal with it smoothly.