=====================================
14 August - 7.07pm
We ask any client who still has issues to contact us via the helpdesk as make final preparations to get everyone moved from this platform. Some users are still offline and we will work with you in the next 12-24 hours to have new servers deployed and we will salvage as much data as we can for you and assist in whatever way we can to help you restore access.
Please note most clients with backup solutions were on line the same day the SAN failed so it is always good to consider backups. Most of the time you most likely will never need them but this is that time I'm afraid. Perhaps with hindseight we should have charged a little more and included backups for everyone as opposed to keep prices lower and 'upsell' backups to clients. Unfortunately we cannot turn back the clock and change that situation.
We deeply apologise for this failure of hardware and we are truly sorry this happened but unfortunately this was totally out of out control.
We will be making no further announcement on this issue except to say we will deal directly with any client on the helpdesk who continues to have issues.
=====================================
14 August - 3.28pm
The SAN came up but as soon as we tried to restart some machines (just one at a time) it quickly failed again. We are waiting on the UK noc staff to swap some more hardware so we can try again.
=====================================
14 August - 12.20pm
Sorry for the lack of updates we wanted to have something to tell you.
We managed to get the SAN online, however there is no way we can currently guarantee if or for how long it will stay up. We just rebooted a hypervisor and have verified that it was able to log into the iscsi targets.
[root@hyp2 ~]# vgdisplay | grep 'VG Name'
VG Name onapp-ktye8kgmiegw1e
VG Name onapp-uci9csaxfkew3i
VG Name onapp-dn9qg7vnkvqqjj
All of the LV's (LV=LOGICAL VOLUME like a virtual drive) are now available , so it should be possible for us to begin booting individual VM's but a large number of VMs are not booting at this time. One VM is reporting as on line but it will not ping and we cannot console it to see what is happening. We are able to mount LVs which should mean even if the machines do not boot we can pull data off them. We are having the DIMEnoc senior admin look into this once more as he is physically there.
Just so you know any client who had backups added to their server have now been restored and are running again. We have 5 clients currently affected. One client has local backups that are very recent so he has provided those and we are working right now to build him a new server so we can restore his accounts. Another client with a non cpanel server has backups on Amazon so if he wishes later we can deploy him a machine. That leaves 3 clients. We are going to be emailing all three clients shortly (or updating their tickets if they have one open) to let them know the current situation.
=====================================
14 August - 9.06am
We had a short message from the techs on the ground. They are attempting to bring the SAN online now. We will update here shortly.
=====================================
14 August - 8.33am
We will be turning on the Hypervisors shortly and attempting to see if we can get any machines to start. The SAN is in a very fragile state at the moment. We had asked the engineers if it was possible to take a complete backup of the SAN before putting it on line but they have replied with the following
"To be honest - attempting to backup your existing data is going to be very painful. This is partly due to the fact that we don't know which LV's are active, which are cancelled; there's some that are running windows (which means we have to do full dumps) and for those that are running Linux - we will be trying to associate the data with whichever VM it's supposed to be for. Lastly, if we were to try to do the backup - I think it would take a minimum of 12-24 hours to complete. So, with that said - I think the best bet is to cross our fingers and put it online."
=====================================
14 August - 7.15am
We have an update from the engineer on the ground from the past hour. Please note we have contracted DIMEnoc to supply and manage the SAN so this is why the update us from a Hostdime engineer (they are the same company)
Hello Stephen,
We are going to be performing consistency checks shortly on the array, and we will then be able to see if we can recover backups of you data off of the array.
Philip Jameson
HostDime.com, Inc.
Systems Integration Engineer
=====================================
14 August - 12.44am
Two more client servers are being restored right now from their backup servers and that means that many clients have service restored at this time. There are still some clients whose service is not restored and we are talking to them right now via helpdesk tickets to see if they have their own local backups as they do not have an off cluster backup option from us.
Dan S from DIMEnoc is working overnight and Philip J is taking over from him in a few hours. We will have an update in the morning and no further updates will be available for a few hours now. They are working towards no data loss and we need to be patient for a little while longer.
=====================================
13 August - 9.08pm
I have a bit more information. As it stands, DIMEnoc integrations team (we pay them to supply and manage our SANs) is and has been working on maintaining the integrity of of all the data on the SAN itself which is what is taking long. As for the data, it could be 8 − 12 hours until the SAN is fully functional again as they will need to make sure the data is good and that the array is rebuilt before allowing any servers to connect to the SAN.
We are actively working with clients to restore from backups and we have 7 VPS clients servers are restored and functioning at this time from backups. We are working on other servers as we speak.
Please do open a helpdesk ticket : support@bigwetfish.co.uk for any questions and we have additional staff working overnight
=====================================
13 August - 8.19pm
We have spoken to management at DIMEnoc the company we rent the SAN from and who manage it. We will have some detailed information to all clients within the hour.
=====================================
13 August - 6.10pm
Unfortunately we are still waiting for an update from the noc. We are pushing for an eta as that is what clients need at this time.
=====================================
13 August - 3.14pm
It is with regret that at this time the SAN appears to have gone offline. We are yet again waiting for information from the data centre technicians. You can also check http://www.hostdime.co.uk who we share this SAN with and their site is throwing an error as well.
We have managed to move a few smaller VPS servers this morning and for those clients they are fine. We do not want to create a situation where we move clients and cause issues for other clients and overload nodes. Therefore we do have some new hardware being deployed tomorrow that will allow us to move all other clients. Business shared server clients were moved to a new server yesterday.
We will be sending an email to all clients later once we have some information.
=====================================
12 August - 5.09pm
I am updating you as to the status of the UK SAN. Fortunately, the rebuild on the first parity drive completed earlier this morning. This should result in better I/O since the RAID controller is no longer having to compute error correction on the fly.
Unfortunately, the RAID array is only sub-optimal as two drives had failed so while we once again have full redundancy, the second parity drive has only just begun its rebuild. You will not receive peak performance until its rebuild has completed also.
The current read on that rebuild is as follows:
root@uksan [~]# arcconf getstatus 1
Controllers found: 1
Logical device Task:
Logical device : 0
Task ID : 102
Current operation : Rebuild
Status : In Progress
Priority : Low
Percentage complete : 12
=====================================
12 August - 9.50am
We are still seeing some elevated loads today but we expect this to subside by afternoon time when we expect the RAID rebuild to complete.
=====================================
11 August - 7.31pm
We have checked all sites hosted on the business1 server and we are not seeing any errors on the main page of each site. Let us know if you see any errors we will be happy to take a look. We will be contacting all Business1 ciients next week to make sure you get compensated as per our terms of service compensation arrangements for this 24 hour outage.
All issues related to this SAN failure should now be completed. The SAN RAID Array is still rebuilding so you will continue to see elevated load across all servers but all servers are responsing well at this time.
=====================================
11 August - 6.48pm
The fsck has completed and the server is up now. Load is still high but we are working on that now and it is our only priority. If you see any issues let us know via the helpdesk
=====================================
11 August - 6.04pm
The Business1 Server fsck is at 85%
=====================================
11 August - 4.32pm
The Business1 server file system check is at 63.3%. We have some clients asking about a time frame we will post in an hour and will then make a best effort guess.
=====================================
11 August - 3.41pm
Many clients are asking about the status of the SAN rebuild. The current output from the rebuild is as follows:
root@uksan [~]# arcconf getstatus 1
Controllers found: 1
Logical device Task:
Logical device : 0
Task ID : 101
Current operation : Rebuild
Status : In Progress
Priority : Low
Percentage complete : 40
Please note this is not to be confused with the fsck (file system check) needing to be performed on the business server. This is at 56%
We anticipate this rebuild may take 15 more hours to complete as a best guess. Once that finishes loads will stabilise. Thirteen servers are up and running we believe at this time. One is being rebooted and one is performing an fsck. If you have a server that is still down please let us know.
The escalated load is in direct relation to this rebuild. It is unavoidable if we are to keep them online while simultaneously rebuilding the array. Until that is completed, I am sorry to say there isn't anything we can do about the load.
=====================================
11 August - 7.16am
Unfortunately the Business 1 shared server was working last night but then stopped responding. It rebooted and an auto fsck (file system check failed). The server has been rebooted in recovery mode and a manual fsck (file system check) is currently running and it is at 12%
=====================================
11 August - 6.31am
We are still seeing elevated load on a number of servers and this is causing some services on some serevrs to fail. Unfortunately it is going to take a little time for things to calm down.
=====================================
11 August - 1.06am
All servers excelt one are back up at this time but some have elevated load. This is caused by three factors. (1) The RAID array is still being verified; (2) Some FSCK file system checks are ongoing and (3) DIMEnoc who we share this SAN with have one OpenVZ Node hosted on the same SAN and those Virtual Containers are calculating quotas. As these processes all complete in the next few hours things will stabilize.
One VPS server unfortunately has a corrupt file system. After an fsck we determine that there were too many lost and found files and the server non recoverable. We need to rebuild this server and restore from backup. The client has been informed and we have started the process of restoring the two sites hosted by the client to a shared server temporarily then tomorrow we can build a new VM and perform a final restore to that new virtual machine.
=====================================
10 August - 10.18pm
We are booting the servers now one at a time. We will update you when all servers are back up
=====================================
10 August - 9.30pm
Just to let you know the array is still verifying but we are getting closer. We are still advised not to turn on any more VMs until this process completes. A client just came on live chat to ask if it was 'days' or 'hours'. Please know this will be hours and not days.
You may also know that www.hostdime.co.uk (we host with DIMEnoc and their sister company Hostdime have their corporate website on the same SAN) is still down so they are in the same position. One client was pretty annoyed earlier as he said we had compromised his data integrity by placing data on a SAN we knew to be 'ropey'. We need to say this SAN has been flawless since the initial issues in July 2011 and has had no issues since that time. We absolutely would not place live servers on a SAN we knew to be faulty and we are pretty sure Hostdime would not place their main corporate website for Hostdime UK on a dodgy SAN either - that would make no business sense. The majority of previous outages in the past 12 months were related to power outages in the Maidenhead Noc as well as one switch reboot following afailed firmware update. We have had zero SAN issues and we would never place the integrity of our clients data at risk by knowingly hosting it on a failing SAN
=====================================
10 August - 8.37pm
We have some more information for you from a tech on the ground: 'There was an issue with the raid card where it just stopped recognizing the array. We're going to be shooting diagnostic information to the manufacturer to see what exactly happened, but a fair part of the downtime was determining where exactly the problem was, as the data that the raid card was giving didn't make too much sense. That was coupled with the fact that we had to be incredibly careful about doing anything so that we wouldn't lose any data. '
=====================================
10 August - 8.28pm
We have turned on one more random server and are monitoring the SAN closely. The array is still verifying so we do not want a lot of seek on the array that will slow that down and also cause it to fail again. There are still 13 servers turned off at this time.
=====================================
10 August - 7.49pm
We were advised we could turn one server on for testing purposes so we picked a VPS with only 2 hosted domains as a test. That VPS is now working fine. The senior admin does not want any more VMs turned on until they stop work as they are still running commands and the array is still verifying causing pressure on the SAN.
=====================================
10 August - 7.31pm
The SAN is back up but the array is verifying and there is a lot happening on it at the moment. We have been advised by the senior noc admin to not start any machines until the verification process completes. To quote the senior admin 'we don't want to swamp the SAN and have something odd happen that drops it again' - we will update in 30 mns.
=====================================
10 August - 6.28pm
We just had an update that our LVs can now been seen on the SAN which is good news. The techs on the ground have some more commands to run before we can start the VMs (VPS Servers)
=====================================
10 August - 5.55pm
The same Tech we mentioned below just gave us a further update he said he should have the array up and operational and would have an update in 45 mins
=====================================
10 August - 5.51pm
We just spoke to Phillip J a senior member of the NOC staff and he asked us to be patient for a while longer. This was an on line chat and then he went off to continue working. I am sorry that is all we have at this time.
=====================================
10 August - 4.52pm
Unfortunately we still await this update. We just sent an email to all 15 affected VPS clients and are also emailing the 25 clients on the affected business1 shared server. If your email is linked to your server and need to provide us an alternative email contact please email support@bigwetfish.co.uk. We are also on Live chat on our website and have been all day.
=====================================
10 August - 3.50pm
We just spoke to the technical staff on the ground and they tell us they are still checking a number of things to see what is causing the failure. They will have an update within the next hour. We deeply apologise for this outage we know it is not acceptable and is not what we would want. The moment we have some fresh information we will post it here.
=====================================
10 August - 3.00pm
The technicians report they are working on a few tweaks now to bring everything back on line and please be assured we will contact all clients as soon as we have some information
=====================================
10 August - 2.00pm
We have nothing to report yet technicians are still on site working. Please be assured the moment we have information we will post it here.
=====================================
10 August - 1.30pm
We can confirm that 40 clients are affected by this outage (15 VPS cleints and 25 clients from business1 server). We want to assure you there is a technician working in the data centre right now to determine the cause of the outage. A few clients from server 20 'may' be affected - server 20 is up but we are aware a few clients have DNS redirects set up as they have yet to update DNS following a recent server upgrade. As many of you know we have been working to move 'shared' servers to a new platform. This latest outage may force our hand with the rest of the clients on this partucular setup.
=====================================
10 August - 1.00pm
We are aware of some issues affecting 15 VPS servers on our network as well as 1 Shared Server (business1.bigwetfish.co.uk). We believe this outage is SAN related and techs are working on this issue now in the data centre. As many of you know we have server contracts with DIMEnoc and Hostdime are their parent company. You will also see that http://www.hostdime.co.uk is also down at this time. Therefore the outage is not just affecting us. We share this particular SAN with Hostdime who also have a VPS node located on it together with their UK corporate website.
We will have an update as soon as possible. We know uptime is critical for all clients and please be assured we are working for a resolution as soon as practically possible.
=====================================