[29/06/2019] XSLTEL Downtime Explanation
It all started when my monitoring server (Zabbix) started to complain about SSH service being down on main node. so I logged to iLO console and found a small little message informing me about filesystem corruption and I will need to umount filesystem and do xfs_repair.
So first thing comes to mind is rebooting server into rescue mod, unmount filesystem and xfs_repair. I thought that won't take more than 10 mins to complete.
However in rescue mode everything went as planned and xfs_repair worked with minor complains. but after booting the server I was welcomed with this message again and again :
systemd-journald [XXX]: Received request to flush runtime journal from PID 1
Then the OS asked me for root password for maintenance as the system couldn't boot from /boot mount. troubleshooting the log files it seems the server had a nasty upgrade from cloudinit package that destroyed the installed kernel.
For 3 hours I couldn't figure out what went wrong exactly. so to minimize the downtime I decided to reinstall the server and restore Virtual Servers configs from the backup I have and VS Images were intact on the RAID array .