We had a staff meeting yesterday to discuss our 'Disaster Recovery' procedures for shared servers
We thought we would publish our notes from the meeting as some of you may be interested in reading this.
==================================================================
What would happen if a BWF Shared Server totally failed?
If we receive notification that one of our shared servers was having issues in the first instance we would have technicians in the data centre despatched immediately to check the situation. They would immediately start work to repair the issue and we would keep clients informed. We would also start posting to our announcements page and we would Tweet and post to Facebook.
If after a few hours it became apparent there was likely to be a more serious issue we would tell these technicians to keep working but our disaster recovery plan would swing into action summarised below. (Please note this recovery is based on the fact accounts would be restored to a new server with new Ips. We feel clients would prefer the inconvenience of an IP change that be delayed for a long time). Remember the technicians would still keep working on the failed server as this is the quickest way to recover and the bullet points below are in addition to any work on the failed server:
Even with our Current Disaster Recovery Plan in place with the weaknesses we identified if we had a complete server failure we would have everything back running within 24-36 hours max.
How can we speed this up / Improve?
Moving forward we will reduce the time limit before we start packaging up the backups to 2 hours. Currently we wait about 4-5 hours before doing that. The thinking is we would rather package up backups we do not need and delete them when we discovered the server was fine.
What if the BWF website and Billing System went off line?
We strictly host our main website and billing system on a separate server from clients.
We currently host our main website on a 4 year old server. Whilst it is rock solid and we are loathed to move from something that works and has never had outages moving forward it seems prudent we move that to a totally separate new server and we plan to complete this within 6 weeks.
Our current BWF Contingency Plan is shown below and then we will look at how we can improve this moving forward and set a time frame:
We would start packaging up the backup the instant we had a server failure. Last month for example our server hosting our website and billing system needed a drive replaced in the RAID array and the array needed to rebuilt. There was no downtime during this as the drive was hot swappable. While that completed and purely as a precaution while that completed our techs were already packaging up the backup from the daily backup and making the up to date database available from our hourly database backup.. It clearly was not needed and a RAID rebuild is a common thing but we wanted to have a backup plan.
We would deploy a new server, install CentOS, install and harden Cpanel, configure and license Cpanel
We would make some internal changes such as WHMCS license IP, ENOM API IP etc etc to make sure the website was functional
We would update the DNS and bring our website and billing system back on line.
We are very confident we could have had our main website and billing system back on line within 6 hours.
How can we speed this up / Improve?
We are still discussing how best to proceed to give really fast recovery in the event of failure. Suggestions from our notes from the staff meeting are below
By the 10th February we will be deploying a spare empty server that will always be on line ready to take accounts. Presently we would wait until we had a server failed before deploying a new server. This will speed up the restore process by at least 4-6 hours (the time it takes to install and configure a new server). This server will sit empty and ready to receive accounts from any failed server.
We are looking to perhaps deploy a VPS server and we will have a ‘spare’ copy of our website and client area located there simply in a suspended state. We will perhaps write a script to rsync the database a few times per day to this ‘spare’ server or we will look at implementing a custom MySQL replication setup. This would allow for almost immediate recovery in the event of a failure.
A staff member has suggested the site on an svn repo for quick deployment.
Staff Resources to handle a Large Outage
All our staff are 100% committed to work extra shifts if needed. This is not a requirement and the last outage one tech came back into work after dinner without telling us and started working again to help our clients. Outages are rare and the voluntary ‘flexible’ approach to working means staff can get extended time off with their families on quiet times in return for working during outages.
We also have access to many techs from our remote support company in India to handle busy times and the management there can deploy technicians for us at short notice.
Giles our Senior Support Admin is USA based so during an outage it means that either Stephen K or Giles W can be on hand throughout the night. Giles working from USA and the time difference can allow Stephen to get needed sleep to take over next morning. Russell a local staff member is taking more of a role with us now and he is also available.
Finally it should be noted that nearly all our servers are provided by our long term partner Hostdime USA and Hostdime UK. They are a global hosting company and they have the strength of 300+ staff members in the event of outages. All our servers are on a management contract with them for hardware replacement and they are very professional. Knowing this should give you real peace of mind.