Server 29 RAID Array Rebuild

  • Saturday, 4th January, 2014
  • 16:44pm
** the vast majority of accounts are restored.  a couple remain - mainly huge accounts (one is 96GB) so these will be restoered overnight.  any client who notices any issues with their site should open a helpdesk ticket and we will gladly check this for you.  we thank you once more for your extreme patience during this hardware failure**

1.49pm
We are at 84%.  The final accounts should all be on line within 2 - 3 hours maximum.  At this point if anyone notices any issues with your hosting please do open a ticket and we will gladly check for you.

11.00am

A client has been on chat asking for an update.  We are at 68% now.   For those wondering cpanel is programmed so it will always restore accounts in alphabetical order of user name. We are now on the letter Q.  

=================================================

10.00am

We are at 59% now.  Thanks for your patience.  As each account is restored one at a time the website starts working so fewer people are affected as time goes on.  Next update 12noon

=================================================
8.48am
Just a quick update.  Any client who is back on line and logs into cpanel will see the server load is elevated.  There is nothing we can do about this and this is a direct result of the restore process that is happening.  Sites may lag a little for the next few hours.

Presently 50% of accounts are restored.  The next update will be at 10am

=================================================

8.00am
The accounts are coming on line one at a time and presently 40% of accounts are working fine. There are some really large accounts that take significant time to restore. 

=================================================

3.00am
We have been trading for ten years and have had shared servers with RAID arrays for ten years.  Unfortunately this is the first time we have had a shared server RAID array fail in that time.  The top techs at the data centre and us were unable to get the array to build.  We believe the issue was caused by a random reboot happening during the RAID rebuild but as the array was unreoverable we cannot get any logs.  RAID rebuilds and drive replacements are a standard procedure and arrays are designed to allow drives to fail and be replaced without data loss.  In this case a reboot for whatever reason has caused the array to be unrecoverable.

The server is now back on line and we are restoring accounts from the daily backups which were taken at 1am on Saturday.  All user names beginning with the letter A are now restored and tested as working.  We are continuing with the restore and it should proceed quickly.  As sites are restored they will start working again so as time passes fewer clients will remain affected by this.   We will provide an update at 7am.

We sincerely apologise for this and any clients with any questions should open a helpdesk ticket for attention of management. 

=================================================

11.40pm
The array is having trouble rebuilding unfortunately. As a precaution we are installing a fresh copy of CentOS and Cpanel into a spare server and are preparing for a worst case scanario of a restore from backup.

Data centre technicians are still working to bring the failed RAID array back on line.  We appreciate your extended patience.

=================================================

9.00pm
As you know the server is offline so data centre remote hands are working on this.  We have no direct access to the server just now.  The latest update is the RAID array is now rebuilding but we are not aware of the % complete at this time.

The moment we have more information it will be posted here.
=================================================

7.48pm
As you are already aware the server is offline at this time while the rebuild completes and the file checks are done. We apologise for the downtime but it looked like the server rebooted while rebuilding the RAID array and this forced the array offline as it was so out of sync.

We have put the array back online and it shows as degraded which is expected but the server wants to do a fsck file system check on reboot which fails as the drives are out of sync which then gets us into a chicken and egg scenario.

As such the server is temporarily offline while the rebuild completes and an fsck file system check is completed.  We will aim to update you on progress at 10pm as we are not expecting this server to be on line before then.

It has been 222 days since the last reboot on this server.  Whilst downtime is regrettable we ask those clients who understandably are becoming frustrated to remember that this server has virtually had 100% uptime for a long time.
=================================================

7.12pm

We have managed to get the server to boot now and we are presently checking all drives (one at a time) by conducting a fsck file system check.  The first drive is at 11%.

Once all drives are checked we can reboot the server with the array on line.  That point the array will be degraded and will still need to be rebuilt.  We apologise for this and we cannot speed this up at this point.

Next update 8.30pm
=================================================
6.45pm

The server is having trouble booting up and a data centre technician is currently still troubleshooting.
=================================================
5.45pm

We are fully aware this rebuild is causing some server connectivity issues.  We are checking into this now and will have more information once the technician on the data centre floor comes back to us

=================================================
A short time ago our monitoring system detected a drive on this server had failed.  We have replaced the drive and the RAID array is currently rebuilding.  There will be higher server load and increased latency while the array rebuilds.  Once the array rebuild completes and the array becomes optinal again everything will be fine.


Thank you for your patience while we work to ensure all drives on this server are optimal.
« Back