VPS 8 Node Unresponsive (OpenVZ)

  • Monday, 16th September, 2013
  • 08:37am

===================================
9.22am
The file system check is complete.  Containers now start one at a time and quotas need to be recalculated so another 10 mins on average per container (it's the way OpenVZ works unfortunately). This is the disadvanntage of OpenVZ over the slightly more expensive Xen virtualization.  When a node needs a reboot and has had more than a certain number of days uptime it will force an fsck on the entire array and the quotas need recalcutared.

Fyi initial investigations show the main node failed after a sudden load spike that hapened so quickly our monitoring system did not have time to alert us and even if it had we would not have been able to react in time.  Load reached from 2.7 to 617 in a few minutes.  We also know which container caused teh load:

 server1.*******.co.uk (hostname removed for security)

07:48:27 AM 0 112 46.21 30.65 14.37

We are still looking for the cause and will take whatever corrective action we can with this particular server.  Again due to how OpenVZ works the load spike affected every container on the node and not just one VPS server as would have happened on a Xen server.   OpenVZ works well when there are no sudden load spikes on single servers but as you can see when one container load spikes it can affect every container.  
===================================
9.13am
The file system check is now 63% complete
===================================
8.58am
Our KVM access to this Node is not working so this is not a live update.  We managed to speak to a data centre tech who was able to check and the server as we suspected is undergoing a File System Check and it currently is 12.5% completed (7 minutes ago).  We need to let this complete.
===================================
8.37am
VPS8 Node came unresponsive and our logs show this is due to server load issues.  This is an OpenVZ Node where all containers can burst into all CPU Cores so it is likely one container has caused this server to crash.  We are rebooting the main node now.

Our KVM connection to this main node is not working.  We have opened a ticket with our data centre.  The server is not back on line following an IPMI power cycle so it is likely the Node is undergoing an fsck (file system check).  This will slightly delay the node coming back on line again.
===================================
« Back