In our case, we're designing around INSERT-only tables with a composite primary ...

zozbot234 · 2025-07-16T15:16:16 1752678976

> with a composite primary key that includes the site id

It doesn't look like you'd need multi master replication in that case? You could simply partition tables by site and rely on logical replication.

ForHackernews · 2025-07-16T15:36:36 1752680196

I think that's absolutely true in the happy scenario when the internet is up.

There's a requirement that during outages each site continue operating independently and might* need to make writes to data "outside" its normal partition. By having active-active replication the hope is that the whole thing recovers "automatically" (famous last words) to a consistent state once the network comes back.

teraflop · 2025-07-16T16:18:54 1752682734

But if you drop the assumption that each site only writes rows prefixed with its site ID, then you're right back to the original situation where writes can be silently overwritten.

Do you consider that acceptable, or don't you?

ForHackernews · 2025-07-17T09:31:19 1752744679

Not silently overwritten: the collision is visible to the application layer once connectivity is restored and you can prompt humans to reconcile it if need be.

LudwigNagasena · 2025-07-16T16:38:40 1752683920

Sounds like a recipe for a split brain that requires manual recovery and reconciliation.

ForHackernews · 2025-07-16T21:36:49 1752701809

That's correct: when the network comes back up we'll present users with a diff view and they can reconcile manually or decide to drop the revision they don't care about.

We're expecting this to be a rare occurrence (during partition, user at site A needs to modify data sourced from B). It doesn't have to be trivially easy for us to recover from, only possible.

zozbot234 · 2025-07-16T17:47:27 1752688047

You could implement a CRDT and partially automate that "recovery and reconciliation" workflow.