Adventures in highish availability

  • Peter Chubb

    Peter Chubb


I manage a small farm of servers and network for continuous integration and development, supporting around 50 users. We recently retired about a dozen servers, and have instead used containers and virtual machines on a pair of really big servers. Given some excess capacity in the new machines, I decided to try to set up replication and failover, so I can bring one machine down for maintenance, and people won't notice (much). Although there are off-the-shelf tools (like Pacemaker), they didn't seem applicable --- so we rolled out own. In hindsight this may have been a mistake. In this talk, I'll be talking about all the things that went wrong.