[29/06/2019] XSLTEL Downtime Explanation


It all started when my monitoring server (Zabbix) started to complain about SSH service being down on main node. so I logged to iLO console and found a small little message informing me about filesystem corruption and I will need to umount filesystem and do xfs_repair.

So first thing comes to mind is rebooting server into rescue mod, unmount filesystem and xfs_repair. I thought that won't take more than 10 mins to complete.

However in rescue mode everything went as planned and xfs_repair worked with minor complains. but after booting the server I was welcomed with this message again and again :

systemd-journald [XXX]: Received request to flush runtime journal from PID 1

Then the OS asked me for root password for maintenance as the system couldn't boot from /boot mount. troubleshooting the log files it seems the server had a nasty upgrade from cloudinit package that destroyed the installed kernel.

For 3 hours I couldn't figure out what went wrong exactly. so to minimize the downtime I decided to reinstall the server and restore Virtual Servers configs from the backup I have and VS Images were intact on the RAID array .

Comments

Popular posts from this blog

Google Analytics Console not working for new accounts

WHMCS sell in multiple currencies for the same client

How to install lets encrypt SSL on Windows server ?