[RFO] VPS3 Reason for Outage - Sunday 13 July 2014

  • Sunday, 13th July, 2014
  • 21:47pm

VPS3 - RFO (Reason for Outage)
Sunday 13 July 2014

On Sunday 13 July 2014 VPS3 Node went down and upon reboot it forced a FSCK file system check. This file system check failed at 15%.

Our technicians rebooted the server into single user mode and instigated a manual file system check. This check took 5 hours to run to completion.

This server has a RAID Card with a BBU (Battery Backup Unit) and we have write caching enabled. This means that when data is written to the array it first of all writes to fast cache memory and then writes out to the array. It helps ensure the fastest possible server performance.  The battery ensures if the server crashes or the power is lost the cache memory remains intact upon power up again so no data being written to the array in theory will be lost.

Upon completion our senior server admin noticed a little data corruption. Extensive investigations of the log files show that around the time of the crash the battery in the BBU was marked as degraded yet upon reboot it was marked as having no issues. We have therefore concluded the data corruption and lengthy array rebuild time was caused by the BBU not functioning correctly.

We are still looking into this issue but as a temporary solution and to prevent this from happening again we are turning off Write Caching on the RAID Card. This will result in slightly slower writes to the disk but it should not be anything that causes too much of a concern.  We are also putting the battery through a cycle test.

We apologise again for the downtime today on this server. Unfortunately hardware is not a flawless technology and of course the downtime today was regrettable. We thank you for your extended patience.

« Back