VMware – Understanding High Availability


With VMware ESXi hosts, multiple server workloads are consolidated to one virtual server, and in case of a breakdown or a server failure, it is important to keep the services running. VMware HA helps the administrator achieve a sense of availability concerning minimum downtime for the servers. HA does not provide 100 percent availability of VMs, but rather provides higher availability by rapidly recovering VMs from failed hosts. HA monitors all ESXi hosts in a cluster, and in case a failure is detected, it automatically restarts the VM on a different host. To ensure this functionality, shared storage is needed.

Once a particular ESXi host has crashed, HA restarts those virtual machines on the remaining ESXi hosts in the failover cluster. It can also monitor virtual machines that run in a virtualized environment for guest operating system failures, which means that if the guest operating system of a VM fails, HA can even restart that VM on another ESXi host in the same high-availability cluster. This process is called vSphere HA VM Monitoring and is configured on cluster level. Once enabled, particular VMs are monitored using VMware Tools by sending and receiving heartbeats. A virtual machine is restarted if:

  • VMware Tools is not responding using heartbeat signals anymore in a specified time frame
  • No network traffic and storage I/O was generated in the last 120 seconds

If a virtual machine was reset, a screenshot will be stored. In case of an operating system crash, such as Blue Screen of Death (BSOD) or Kernel panic, the administrator can retrace the reason for the issue. Various options control the behavior of this mechanism, including:

  • Failure interval: Timeout in seconds after a particular VM is reset
  • Minimum uptime: Timeframe vSphere HA will wait before monitoring the affected VM
  • Maximum per-VM resets: Maximum of resets per VM
  • Maximum resets time window: Time window defining how often VMs can be reset, for example, only every 12 hours

These options can be set per priority level of particular virtual machines. Configuring vSphere HA VM Monitoring is a complex topic that needs to be planned and tested carefully. Especially time and reset thresholds need to be chosen wisely.

To configure vSphere HA VM Monitoring, perform the following steps:

  1. Login to vCenter Server using vSphere Web Client.
  2. Select the Cluster and click Manage, Settings, and vSphere HA located under Services.
  3. Configure the behavior parameters suiting best to your environment.

HA ensures that the services running in the virtual environment are not hampered and are recovered as soon as possible. However, whenever a virtual machine fails over from one host to another host, the guest OS has to be rebooted.

HA uses the concept of heartbeating to detect host failures and performs migration of VMs accordingly. It does this by placing an agent on each host to maintain a heartbeat with all the other hosts in the cluster; a loss of heartbeat automatically triggers a restart of all affected VMs on the other hosts.

HA also keeps a check on the available resources of different clusters at all times, in order to be sure on which host the VMs need to be moved in case of a host failure. HA depends on the categorization of hosts as primary hosts and secondary hosts; the first five hosts powered on in an HA cluster are considered primary and all other remaining are said to be secondary. Primary hosts are responsible for replicating and maintaining the state of the cluster, and also initiating any failover actions. Every host that joins the cluster needs to communicate to the primary hosts to complete its configuration and setup. If a primary host fails, a new primary host is chosen at random from the pool of secondary hosts. HA needs at least one primary host to be available, in order to perform its operations.

HA detects a host failure when the HA agent on a host stops sending heartbeats to the other hosts in the cluster. A host usually stops sending, or fails to send, heartbeats if it gets isolated from the network, crashes, or becomes completely down due to a hardware failure. Once any of these situations is detected, other hosts in the cluster consider the affected host as failed and the said host declares itself isolated from the network. All VMs on the failed host then successfully restart on other hosts in the cluster. A priority can be set in HA to restart or move hosts, ensuring that critical hosts are taken care of first. This restart policy addresses priority of workloads; there are four settings available:

  • Disabled
  • Low
  • Medium
  • High

If HA needs to restart a set of particular virtual machines, the order would be:

  1. Agent virtual machines – These are special virtual machines that extend the functionality of ESXI hosts; for example, virus-protection or network filtering
  2. High priority VMs
  3. Medium priority VMs
  4. Low priority VMs

To sum it up, if failover resources are not sufficient, HA ensures that more important workloads are recovered first.

A very critical component of vSphere HA is host isolation. Host isolation rules control how particular ESXi hosts will react once they are unable to communicate with each other. If the HA agent on an ESXi host loses this connection, it will try to ping an isolation address. This isolation address can be configured manually and should be a dedicated, independent host with high availability (quorum). By default, host isolation is declared on HA masters after 5 seconds, and on slaves after 30 seconds. The administrator can configure the isolation behavior per cluster basis; there are three major response types that control isolation:

  • Leave powered on: The state of the virtual machines is not changed.
  • Shut down: All virtual machines are shut down using VMware Tools. There is a 5-minute timeout after which VMs are powered off.
  • Power off: All virtual machines are powered off immediately.

Deciding which response type to choose needs deep knowledge of your infrastructure, applications, and the particular behaviors in case of failures. Make sure to plan and test this scenario in detail.

Datastores also play a very important role in vSphere HA setups. Enabled by default, datastores are used to check whether the other ESXi cluster nodes can be reached if the connectivity can’t be verified using the network. This ensures that a cluster won’t split in case of a network issue. It is recommended to use two to five datastores for heartbeats. These datastore are selected by default, but the administrator can also configure dedicated datastores.

Configuring HA

HA can be configured while creating a new cluster or by modification of cluster settings at a later stage, whenever required. To configure HA, following steps need to be followed:

  1. Select the desired cluster on which HA needs to be enabled.
  2. Go to the Manage tab and Settings.
  3. Under Services, click vSphere HA.


      4. Click on the Edit button.

      5. Select Turn On vSphere HA.


The Host Monitoring Status section is used to enable the exchange of heartbeats among hosts in the cluster. In order for HA to work, Host Monitoring must be enabled. However, it can be disabled temporarily while performing maintenance tasks.



Please enter your comment!
Please enter your name here