The system I used didn't have any notion of repair, just retry-forever. What did...

throwaway894345 · 2025-11-22T07:03:13 1763794993

Repair is just continuous retrying some reconciliation operation, where “reconciliation” means taking the desired state and the current state and diffing the two to figure out what actions need to be performed. In my case I needed to look up what the definition of a “workspace” was (from a database or similar) in terms of what infrastructure should exist and then query the cloud provider APIs to figure out what infrastructure did exist and then create any missing infrastructure, delete any infrastructure that ought not exist, and update any infrastructure whose state is not how it ought to be.

> I've written service tree management tools that do that sort of thing on a single host but not any kind of distributed system.

That’s essentially what Kubernetes is—a distributed process manager (assuming process management is what you are describing by “service tree”).