I remember at Oracle they built systems to shut down the previous presumed leade...

tanelpoder · on Aug 30, 2024

Yep, the "STONITH" technique [1]. But programmatically resetting one node over a network/RPC call might not work, if internode-network comms are down for that node, but it can still access shared storage via other networks... The Oracle's HA fencing doc mentions other methods too, like IPMI LAN fencing and SCSI persistent reservations [2].

[1] https://en.wikipedia.org/wiki/STONITH

[2] https://docs.oracle.com/en/operating-systems/oracle-linux/8/...

setheron · on Aug 30, 2024

They had access to the ILOM and had some much more durable way to STONITH. Of course every link can "technically" fail but it brought it to some unreasonable amount of 9s that it felt unwarranted to consider.

tanelpoder · on Aug 30, 2024

Yep and ILOM access probably happens over the management network and can hardware-reset the machine, so the dataplane internode network issues and any OS level brownouts won't get in the way.