UK Data Centre Connectivity Issues

Tuesday, 29th May, 2012
09:47am

**Please note this is a Major Outage affecting the entire data centre. Other webhosts such as @tsohost, @vidahost and @xilo are also tweeting about similar outages. It will just take time to bring all services back. We are posting below please do refresh regularly for updates**

==============================================

31 May 6.01pm

This will serve as a final update on this issue. The data centre confirmed just after midnight last night that they have connected the UPS's again and everything is back to how it should be.

Unfortunately we have lost a little data from 1 shared hosting account as a direct result of this outage. Basically what happened as you can see from this thread is that as a direct result of the sudden power outage one server failed and after the file system check many thousands of files were missing and corrupt. We only had a weekly backup for this account (daily backup was corrupt) which has been fully restored for the client concerned and their site is as it was 6 days ago.

This should serve as a reminder that although we take backups our terms of service state we never guarantee backups. Clients should keep local backups as well to serve as the ultimate disaster recovery tool. Please ask us if you need help in generating backups we can show you how easy it is in Cpanel.

==============================================

30 May 9.52pm

Just a brief update. We now only have 2 clients still directly affected by these issues and we are working with them now to get a resolution. 100% of the data for the other 6 clients have been recovered after this unfortunate issue.

With regards to the outage you may want to read this article for more information: http://www.theregister.co.uk/2012/05/30/pulsant_power_outage/

==============================================

30 May 5.28pm

Unfortunately we still have 8 queued tickets for account issues on the server that needed to be restored. Many of these are complex issues with corrupt and missing data which was a direct cause of the server failure after the power outage. We are sorry if you are one of these 8 people and we want to assure you we are working at each ticket diligently in the order in which they were received in order to ensure a full restoration of services.

==============================================

30 May 2.40pm

We are working through tickets as we get them. Apologies for the extended delay today in response times. The backup server we need to help some 3 clients is at 60% on a file system check so it may be another while unfortunately. As soon as it comes back we can have service restored for the few remaining clients with corrupt databases/files.

At this time all services are back on line and working. As stated earlier if you notice any issues on your account please open a support ticket

==============================================

30 May 1.15pm

The data centre responded and they are bringing the offline backup server we talked about at 10.20am online now. It is file system checking at this time and will be online soon. This will allow us to deal with lingering issues with a few missing or corrupt databases.

==============================================

30 May 11.46am

We are pleased to say that finally all the virtual containers on VPS2 node are up. The issue was after such a brutal shuwdown (loss of power) all containers needed to recalculate quotas. We could have skipped this but then every user would have had unlimited space and no way to track disk usage on the server. This would have only resulted in a scheduled reboot at another time to enable quotas.

This is a slight disadvantage of OpenVZ Virtualization compared to other types. Notice our Cloud was on relatively quickly whereas the OpenVZ containers took longer. I guess the disadvantage of this could be set against the cheaper cost of such a Virtualization model.

For those interested the Data Centre are posting updates here:

https://my.pulsant.co.uk/customerarea/network_status/network_status.html?id=2266&op=view

==============================================

30 may 10.20am

We have just been chatting to our data centre partners regarding the backup server that is still off line. This is hindering us resolving the few lingering issues we have. For example, one VPS client has a corrupt database and we need access to this backup server to restore. The corrupt database was caused by the sudden loss of power to the server.

We have asked them to try to expediate this issue asap as having a backup server off line means the uptime of a handful of clients are being affected. Unfortunately our backup servers are cheaper Atom boxes with no ability to console into the server so there is nothing we can do remotely.

==============================================

30 May 9.45am

We are getting reports that some data on server 21 is out of date. This should not be the case. Whilst I personally did not do the work we had a top tech from the Integrations Team at DOMEnoc assisting us and before he clocked off he confirmed valid data was restored.

We need to look into this. If your account is on server 21 and has old data can you open a ticket and we will work to resolve that

The issue with ftp on server 21 has also now been resolved. This was due to the server rebuild and we had neglected to open the passive ports in the firewall. This has now been corected - thanks to the 2 clients who opened tickets about ftp not working

==============================================

30 May 9.11am

Containers on VPS2 node are starting up. Unfortunately as the server stopped sudenly caused by the power outage these servers need to recalculate quotas. This is a major disadvantage of OpenVZ Virtualization. We cannot start too many at a time otherwise the node will overload so we are starting 2 at a time at random. Containers are coming on line slowly.

==============================================

30 May 8.21am

Virtually all clients have service restored. Here is what we know about at the moment and are working on. Any further issues please open a support ticket and we will handle on a case by case basis:

- 9 accounts need to be restored on server 21 manually and these are being worked on
- One VPS client's Operating system appears to have been corrupted with the sudden power loss. Still working for a resolution
- Some containers on VPS2 node were stopped and are now starting

If there are any other issues please open a support ticket and we will handle on a case by case basis now.

==============================================

30 May 7.44am

All shared and reseller servers are now back on line. Server 21 is back on line following the required restore (no data was lost as we used live data as only operating system files were corrupted in the power outage). There are a number of missing accounts (but no missing data) we just need to manually re-create a few accounts. Unfortunately for the very small number of users this is going to take some time.

We have one report from another VPS client that her MySQL service will not start on her server. It appears to be related to this outage with possible data corruption on her disks caused by the sudden power loss. We are looking into this for this one client. Unfortunately our Backup3 server will not boot up following the outage. We did have a ticket in with low priority but have changed that to High Priority now that we need access to it. The data centre are swamped with requests so we anticipate a slight delay on that one.

Any other issues that come in via support tickets we will handle on a case by case basis now.

==============================================

30 May 7.11am

Server 21 has been rebuilt and data has been copied to it from the old /home folder. A chmod command is finishing off and most sites there should be back on line. There are a few sites missing that will need some manual intervention to get working. We will have a list shortly and our techs can start work.

==============================================

30 May 3.48am

Server 21 unfortunately lost a few vital config files in the sudden power outage. A senior technician is working now now to get a new server built and the data restored. We are estimating 8am for this server to be back on line.

We see another VPS server that appears to have lost its /etc folder. The technician is looking at that as well.

If you see any more issues just let us know.

==============================================

30 May - 3.17am

VPS3 node file system check took nearly 4 hours to complete. Unfortunately it looks as if a number of critical files were lost whe the power was suddenly lost. This is most unfortunate. At this time 4 VPS servers are not on line. We are working to see what we can recover.

==============================================

30 May - 2.36am

The VPS3 Node has completed its file system check and it took 3.5 hours. Containers are starting now but unfortunately due to the nature of OpenVZ all containers need to recalculate quotas. We are starting 2 Virtual machines at a time as having any more running quotas is not wise.

The servers will now come back on line one by one. We could reboot the server withour Quotas to get all servers back on line now but that would give all users unlimited space with no way to track usage. If we did that we would still need to take the server down again to do what we are doing now. As it is 2.38am we have taken the decision to proceed with this. Had it been peak hours we would have taken a different decision.

==============================================

30 May - 1.31am

All shared and reseller servers should be back on line. If you notice any issues let us know and we can investigate right away. We will be on Twitter until 3am and then the helpdesk after that time.

If you have a VPS or Cloud VPS server you should be back on line too. Again if you notice any issues let us know. We checked a random sample of these servers and they were up.

VPS3 node is still causing issues. The File System Check is ongoing. We will post an update as soon as possible.

We also want to thank Phillip J of @hostdime (DIMEnoc is our partner data centre) for assisting us with getting things back up as quickly as we did. He continues to work on VPS3 at this time for us. It is good to know at time of issues we have the power of a large international hosting partner to call on for extra support

==============================================

30 May - 1.15am

We are waiting for an update on VPS3 Node. Our data centre techs are working on that for us at this time and they are consoled into it. They are running a file system check on it. They tell me there are a lot of multiply claimed blocks so progress is slow. We will have an update soon.

==============================================

30 May - 1.06am

Server 24 file system check has completed. The server is booting now. Please allow 10-20 minutes for services to sll start and for websites to start resolving.

==============================================

30 May - 12.52am

Server 20 File System check has completed and it is now back on line. We still have server 24 console open and it is at 80%. Others we assume are progressing in a similar fashion. Some servers may be sluggish for a while as the SAN will still be under considerable pressure with multiple file system checks ongoing. As more complete and more servers come back on line the situation will shoerly improve.

==============================================

30 May - 12.45am

Server 24 file system check is at 74%. Other servers are progressing as well. The situation is getting better. As more servers complete file system checks and come on line the I/O wait time on the SANs will improve significantly.

==============================================

30 May - 12.22am

An update. Server 20 had gone 191 days without a reboot so the file system check as forced it is currently at 83%. It should not be long now. Other servers we will check shortly but needless to say the FSCKs are ongoing and we will post when they are back on line.

==============================================

29 May - 11.56pm

Just an update. The File System Check of VPS3 Node is at 33%. We will update you once we know more.

==============================================

29 May - 11.49pm

A number of servers at this time are doing file system checks. Server 23 is at 29% and server 24 is at 18%. This is either due to the server having a large uptime so was due a check anyhow or due to the fact the servers suddenly lost power in the outage. There is nothing we can do to speed these up.

If your server is not on line and you have Twitter please tweet @bigwetfish and we can tell you where in the FSCK it is at. We are watching all servers and staff are staying on duty until all servers are back on line

Please also note it is likely we will be late into work tomorrow so sales and accounting tickets will ne queued until the afternoon.

==============================================

29 May - 11.29pm

All cluster servers on the cloud are on line. Some are file system checking and others are just starting services. As many servers were started at once the hypervisor loads is likely to be heavier so this may take a little time for things to calm down. Rest assured we are moniitoring this.

Server 18 named and apache was failing we have fixed that and that server is now back on line.

Unfortunately the /vz partition of the VPS3 Node needs a file system check at this time and that is happening right now.

==============================================

29 May - 10.59pm

The Control Panel Server for our OnApp Cluster is now back up and the SAN and Hypervisors are on line following loss of power. We are starting the VMs now. Some may need to fsck and we will update the list here if necessary.

Server 14 and Server 18 are back on line at this time.

==============================================

29 May - 10.47pm

Finally we have some concrete information. We have one cabinet without power at the moment in Maidenhead. Other cabinets have power again and services are coming back on line. Any server that has had good uptime unfortunately will need to FSCK and we will post this as we know it.

We are also aware form Twitter that TSOHOST and VIDAHOST have the same issues with loss of power in Maidenhead. Any updates will be posted here as soon as we have the information.

==============================================

29 May - 10.41pm

We are still waiting on some information from our people but we are getting this information from @tsohost who just tweeted:

Quote from @tsohost "We have 3 of our racks in Maidenhead currently without power - we're working with the datacentre to restore this now." - unfortunately we seem to be not getting that information just yet. As I type this I am holding on the phone.

==============================================

29 May - 10.27pm

It appears this issue has happened again in the data centre. We are currently in communication with our partners to determine the exact cause of this. Updates will appear here.

==============================================

29 May - 6.53pm

We have received confirmation that the issue earlier was a routing issue affecting all servers located in the data centre segment. Backup systems quickly were able to be brought on line to rectify the issue and engineers on site made sure that systems were restored in as timely a manner as possible. Unfortunately this is all we have in terms of information. Please be assured if this happens again we will push for a detailed explanation at that time.

Thanks for your patience this morning during the very short burst of outage.

==============================================

29 May 2012 - 11.12am

We are still awaiting an update on what caused the 15-20 minutes of loss of connectivity. We do have emergency AIM live chat contacts for our data centre partners but as the issue is resolved we felt it appropriate just to open a support ticket for this information. Once we receive that information we will update this announcement. Please check back later and refresh.

==============================================

29 May 2012 - 10.11am

Connectivity has been restored. We are awaiting a report on what has happened from our data centre partners.

==============================================

29 May 2012 - 9.51am

We are aware that the entire UK data centre has lost connectivity to the internet. We are working with our partners there (DIMEnoc) and they have informed me every server they have located there is having connection issues so this is a much wider issue than just our servers. They have assured us this is getting top priority at this time.

This is a most unfortunate situation. We take uptime seriously and we will give a full report once we know more. Please do not open a support ticket at this time we will update this page immediately we know more so please refresh periodically

==============================================

« Back

By Month

By Month

Support

UK Data Centre Connectivity Issues

Support

By Month

By Month

Support

UK Data Centre Connectivity Issues

Support

Generate Password