Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I remember at Oracle they built systems to shut down the previous presumed leader to definitively know it wasn't ghosting.


Yep, the "STONITH" technique [1]. But programmatically resetting one node over a network/RPC call might not work, if internode-network comms are down for that node, but it can still access shared storage via other networks... The Oracle's HA fencing doc mentions other methods too, like IPMI LAN fencing and SCSI persistent reservations [2].

[1] https://en.wikipedia.org/wiki/STONITH

[2] https://docs.oracle.com/en/operating-systems/oracle-linux/8/...


They had access to the ILOM and had some much more durable way to STONITH. Of course every link can "technically" fail but it brought it to some unreasonable amount of 9s that it felt unwarranted to consider.


Yep and ILOM access probably happens over the management network and can hardware-reset the machine, so the dataplane internode network issues and any OS level brownouts won't get in the way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: