> Most systems handle this defensively with locks and runtime validation.
So i work at an org with 1000s of terraform repos, we use the enterprise version which locks workspaces during runs etc.
everywhere else i’ve worked, we either just use some lock mechanism or only do applies from a specific branch and CI enforces they run one at a time.
My question is: who is this aimed at and what problem is it actually solving? Running terraform isn’t difficult - thousands of orgs handle it no problem - the issues I have with it with it have never been around lock contention and race conditions..
Hello, CTO of Terrateam here, the creators of Stategraph.
As you said, the common practice is to use locks on state to guarantee that operations don't step on each other. This works, however the cost is that if it takes 5 minutes to perform an operation, only one person can be doing an operation at a time, so if 5 devs are modifying infrastructure, the last one has to wait 25 minutes just to get back the plan, even if those 5 people are not changing overlapping resources in the state.
The way that most people deal with this is they take their infrastructure and break it up across multiple root modules, and then when those root modules, break it up again, etc.
Stategraph is solving the problem of getting all of the performance benefits of breaking up your root modules without breaking up your root modules. It dynamically determines which resources each of those 5 devs are operation on and, if the resources do not overlap, can run them in parallel.
That means Stategraph is manipulating state in a bit more sophisticated way than standard Terraform/Tofu, and we need to be careful we don't get it wrong.
I'm not sure I would want this even if I could have it TBH. Engingeering org size is about ~200 with infra/sre/ops around ~25.
Different teams want to move at difference cadences. At a certain scale splitting up things feels a little more natural (maybe I am stockholmed by prior limitations with TF though or just used to this way of operating now).
But even then, we're moving to k8s operators to orchestrate a bunch of things and moving off terraform apart from the stuff that doesn't change much (which will eventually get retired as well). Something like https://www.youtube.com/watch?v=q_-wnp9wRX0
Terraform variable management is our larger problem (now/nearterm) when we have to deploy numerous cells of infra that use the same project/TF files with different variables. Given the number of projects/layers of TF getting cell specific variables injected is meh.
Those variables are instance size, volume size, addresses, IAM policy, keys etc.
This is in the b2b saas world with over a million MAU. We've got islands of infra for data soverignty, some global cells where each cell can communicate back / host some shared services (internal data analytics, orchestration tooling, internal management tooling and the like).
The way I look at it is that TF has a limitation on state size. And when you hit that limit, you have to either slow down a ton or do a (big) refactoring.
As comparison, if a programming language forced you to split your software into multiple executables when you got to a certain number of functions, I think, almost universally, we would say that it's not a production language. That is a stupid limitation and forcing development work on users because of stupid limitations is disqualifying.
But for TF, even if we are refactoring it because the tool is doing it, we tell ourselves that it's a good idea anyways because of good software practices. But splitting infrastructure over multiple root modules is, in my analogy, the same as being forced to do it over multiple executables. It comes with a lot of unnecessary limitations.
With Stategraph, you can choose to split your infrastructure over multiple root modules, if that is what you want to do, not because you don't have a choice.
V1 of Stategraph is a drop-in TF/Tofu replacement, but once it's there, you can see a path to something more like k8s operators, without having to do any migration of infrastructure.
Back in the day, before git, we had RCS. Developers would just lock files when they worked on them, and then unlock when they were done. Or you'd copy a folder manually ("branch") to work on things concurrently and then punch them in the shoulder when they forgot to unlock master so that you could lock it and check in. It worked absolutely fine, there were loads of workarounds!
> The concurrent applies isn’t that big of a deal?
That depends. There are many organizations (we talk to them) which have plans and applies that take 5 - 10s of minutes, some even close to an hour. That's a problem. We talked to one customer that a dev can make a change in the morning and depending on the week might have to wait until the next day to get their plan, and then another day to apply it, assuming there are no issues.
If you're in that position you have two options:
1. Just accept it and wait.
2. Refactor your root module to independent root modules.
(2) is what a lot of people do, but it's not cheap, that's a whole project. It's also a workflow change.
Stategraph is trying to offer a third option: if your changes don't overlap, each dev can run independently with no contention.
Even if one doesn't think contention over state is a big deal, I hope that one can agree that a solution that just removes that contention at very little cost is worth considering.
> There are many organizations (we talk to them) which have plans and applies that take 5 - 10s of minutes, some even close to an hour. That's a problem. We talked to one customer that a dev can make a change in the morning and depending on the week might have to wait until the next day to get their plan, and then another day to apply it
That's us. Especially because our teams are distributed across NA/Eastern Europe/Japan. So getting a lock is a problem because you have to wait for someone else to finish, then getting the required reviews is a problem because you have to wait for people from other timezones to come on, then by the time you're ready to re-plan after the reviews someone else has taken the lock, then you have to wait for them,...
So i work at an org with 1000s of terraform repos, we use the enterprise version which locks workspaces during runs etc.
everywhere else i’ve worked, we either just use some lock mechanism or only do applies from a specific branch and CI enforces they run one at a time.
My question is: who is this aimed at and what problem is it actually solving? Running terraform isn’t difficult - thousands of orgs handle it no problem - the issues I have with it with it have never been around lock contention and race conditions..