Cloud Hypervisor Fault and failure of Auto-Failover

Sunday, 20th November, 2011
19:10pm

Approximately 3 hours ago Hypervisor number 2 failed and went off line. OnApp cloud is designed with this eventuality in mind and all Virtual Machines housed on the failing hypervisor should auto failover to a spare and empty Hypervisor. This worked for all but 4 Virtual Machines and downtime for all these VMs was seconds.

The 4 that did not Migrate were then looked at. Contact with OnApp was made and their technicians told us there was an 'unpatched bug' in the system that prevented failover from happening for these 4 VMs. This Patch has now been applied. This is most frustrating for us as last week only we had them upgrade to the latest OnApp version so we would have assumed all the relevant patches would have been applied.

We have manually failed over the 4 affected Virtual Machines to the spare hypervisor but two Virtual Machines are showing an inconsistency in their file systems and will not boot up without a linux fsck (file system check). This is being completed directly at SAN level to speed up the process.

We apologise for these issues. The cloud is designed to not have this happen and we have spare hardware there for this eventuality but in this case according to OnApp there was an unpatched bug that for some reason prevented 4 Virtual Machines from migrating across

Although only 2 servers on our cloud are now affected we do apologise if you are housed on either of these servers.

============================================

8.21pm Update: It turns out three VMs failed to restart and needed fsck file system checks. One server check has completed and is now on line again. The other 2 are completing now. We do not anticipate the outage for these 2 VMs beng much longer.

============================================

9.04pm Update: server 24 fsck completed 10 minutes ago and services just started. Working on the remaining issues now

============================================

9.06pm Update: In the middle of this all server 23 cpanel decided to kick off an autp update elevating the load and preventing named from starting in a timely fashion. We felt it could cause more issues by stopping a cpanel update mid way through so we had to let it run. This explains the extra 45 minutes it took for server 23 to start up. Sites are resolving now but the cpanel update is still happening so there is elevated load on this server at this time.

============================================

9.20pm Update: All Affected VMs are now back up and all sites we tested are resolving. If you notice any lingering issues let us know. Thanks to Dan.S staff member from our Datacentre Partner for his assistance this afternoon as well as Giles (Orlando Based BWF Staff) for assisting on his day off.

============================================

9.51pm Update: Server 20 had a read only /tmp folder. We stopped all services, unmounted it, remounted it and the issue was resolved. This has cured the php and database connection errors that some users were reporting folowing the outage. if you notice anything else let us know

============================================

« Back

By Month

By Month

Support

Cloud Hypervisor Fault and failure of Auto-Failover

Support

By Month

By Month

Support

Cloud Hypervisor Fault and failure of Auto-Failover

Support

Generate Password