Thursday, December 24, 2009

VMware’s Fault Tolerance or “FT”

What is FT?
Fault Tolerance is 2 VM’s running in tandem on different ESX Hosts. VMware calls this “VLockstep”. FT uses a special type of VMotion to create the other VM on another Host. The secondary VM is a shadow of the primary VM.

FT is available only in the following editions of VSphere ESX 4: Advanced, Enterprise and Enterprise Plus editions

What are FT Requirements and other things that you need to know?
Note: More requirements may be required than the list provided below. This is intended as a brief list of some important items and some gotcha’s it is not a complete list.

At least 2 ESX Hosts are needed for FT in a test environment. Preferably 3 Hosts or more are recommended for maximum redundancy in a production environment.

Hardware Virtualization Technology “VT” enabled at the BIOS level of the ESX Hosts

You must have specific processors that support VMware Lockstep “FT” technology

Requires identical processors from the Primary and on Secondary ESX hosts
The CPU clock speed must be within 400 MHz

VM’s cannot use more than one vCPU

FT currently does not integrate with vCPUs, VM Snapshots, Thin Provision LUN’s, RDM or Storage VMotion

Primary and Secondary ESX hosts must be running same version and build of ESX

VMware HA must be enabled on the ESX Cluster for FT

DRS is not required for “FT” however, you can certainly continue to run DRS as part for your ESX Cluster for the other VM’s without any issues. Therefore, FT can run with or without DRS running on the same ESX Cluster.

No more than 8 protected servers per ESX host are recommended. I would however, not run that many protected servers. I would run up to 4.

Make sure that you have enough resources available for a second VM such as processor and memory. The two VM’s will share the same diskspace.

The FT VM cannot be thin-provisioned it must be thick. If the VM is thin provisioned it must be converted to thick.

A dedicated GB NIC for FT Logging each ESX Host is required

Shared diskspace

VM Storage cannot be migrated

VCB is not supported at this time

SSL Certificates must be enabled and verified for each ESX Host

Dedicated GB Network

Dedicated NIC for FT Logging MKernel Port- However, it can be shared with VMotion and the Service Console in a test environment

Note: Once the Primary VM goes down VSphere will failover to the Secondary VM and automatically create another VM for Secondary purposes “Spawning” another VM on another ESX Host if another Host is available.

How to implement FT?
OK so before you go implementing FT.... You should read VMware's FT requirements. It is also a good idea to run a VMware Site Survey against your ESX Hosts and also run the CPU info utility in order to make sure that your Hosts can be utilized for FT.
http://www.vmware.com/download/shared_utilities.html

After you have thoroughly looked over the requirements and have evaluated whether you can run FT or not in your environment it is quite simple to implement.

First Test that VMotion is working and that you can VMotion VM’s between ESX Hosts in your Cluster

You must have HA Enabled on your ESX Cluster

Right-Click on VM and select Fault Tolerance. Note: The VM can be either powered up or powered down during this process. I have had more success with the VM being powered off.
The FT process will begin to build the second VM for FT on one of the other ESX Hosts in the Cluster. Depending on the size of the VM this can take a while. Be Patient!

The Primary VM has a different type of icon that signifies it is a VM running with a paired Lockstep VM











The (Secondary) VM is created however; you cannot see the VM running on the list of VM’s unless you list the Virtual Machines on the ESX Host that is housing the secondary VM















Some Potential Errors when enabling FT on a VM

SSL Host Certificates are not configured on the ESX Hosts
No FT Logging NIC’s are configured on the Hosts
The VM has snapshots, has multiple vCPU’s, has a CD-ROM connected

How to Test FT?

Verify that you see the same information on the Secondary VM by opening up the Console on the Primary and the Secondary VM

Shutdown the Host where the Primary VM resides and make sure that you can still access the Secondary VM while the Primary is shutting down

Test pings to the primary machine while it is being shutdown. Note: You should lose no more than 1 ping to the server at the time of failover to the Secondary VM.

When the other ESX host is powered back up the Secondary VM will power up again and you will have the pair working in vLockStep tandem again


How to Failover for FT?

Right Click on the Primary VM and select Test Failover from the Fault Tolerance Menu
These tests will Failover to the Secondary VM and then Failback to the Primary VM
Note: It is normal to lose 1 ping during the failover process

You can also Test restarting the Secondary VM
Right Click on the Primary VM and select Test Restart Secondary from the Fault Tolerance Menu