Showing posts with the label XSLTEL Downtime

[29/06/2019] XSLTEL Downtime Explanation

It all started when my monitoring server (Zabbix) started to complain about SSH service being down on main node. so I logged to iLO console and found a small little message informing me about filesystem corruption and I will need to umount filesystem and do xfs_repair.

So first thing comes to mind is rebooting server into rescue mod, unmount filesystem and xfs_repair. I thought that won't take more than 10 mins to complete.

However in rescue mode everything went as planned and xfs_repair worked with minor complains. but after booting the server I was welcomed with this message again and again :

systemd-journald [XXX]: Received request to flush runtime journal from PID 1

Then the OS asked me for root password for maintenance as the system couldn't boot from /boot mount. troubleshooting the log files it seems the server had a nasty upgrade from cloudinit package that destroyed the installed kernel.

For 3 hours I couldn't figure out what went wrong exactly. so to minimize…