VMWare : Troubleshooting VM Performance

VMWare Troubleshooting VM Performance in ESX/vSphere

Firstly we’re going to be using esxtop, which can be executed from the local console or an SSH session on ESXi, as well as vCenter.

General guidelines;

  1. Physical : vCPU ratio should be around 1:5
  2. Avoid memory oversubscription in production environments
  3. Reserve memory for virualised SQL servers using Lock Pages in Memory
  4. Reserve memory for virtualised RDS/Citrix Servers
  5. Always create datastores using the vSphere client, this will ensure that the VMFS partition is aligned – this will reduce IO and potentially increase performance.
  6. Align guest data disks (for Windows, any version earlier than 2008 must be manually aligned) – this will reduce IO and potentially increase performance.
  7. Look to keep the number of VM’s per Datastore/LUN to around 10-15, this will help to reduce SCSI reservation contention.
  8. vCPU’s ; less is more. From my own testing I’ve found that Citrix servers perform best with 2 vCPU’s over 4 vCPU’s. This is not only better for my users but also the ESXi hosts as there is less co-scheduling.

Lets begin…..

1) Investigate CPU contention/exhaustion using esxtop (press ‘c’ from esxtop, and shift-v for per-VM stats only):

  1. Check host PCPU usage using esxtop
  2. Look at %RDY, if this is equal or greater than 10% there is a performance issue – this can indicate CPU contention.
  3. Look at %MLMTD, if this is high it would indicate a CPU limit is being imposed on the VM. %RDY – %MLMTD gives a true indication of CPU contention.
  4. If %RDY is truluy above 10% the first step is to lower the number of active vCPUs configured on the ESX/vSphere server, next you’re looking at reducing the number of VM’s on the server.
  5. Investigate co-scheduling/SMP related issue – are VM’s using all presented vCPU’s? From esxtop press ‘c’ then ‘e’  – then take a look at %CSTP. If these values are high this could indicate issue as this represents the overhead in co-scheduling CPU’s from a co-stopped to co-started state.

For example, if you have 16 cores, the maximum vCPUs that should be defined across all active VMs should not exceed 80.

2) Investigate memory usage via esxtop and vCenter (press ‘m’ from esxtop, and ‘shift-v’ for per-VM stats only):

  1. Check host memory utilisation using esxtop
  2. For full-fat ESX only – the service console may be low on RAM, you can adjust this by following these instructions: http://www.vmware.com/pdf/esx_performance_tips_tricks.pdf
  3. Watch out for memory balooning, this can have a significant impact on VM performance. You can track memory balooning in vCenter and esxtop; MCTLTGT is the VMKernel’s desired memory baloon size, MCTLSZ is the actual size. If the target is greater than the size the baloon is increased/inflated, if it is smaller it is decreased/deflated. VM memory limits can also trigger balooning.
  4. Transparent Page Sharing (TPS) allows a host to share memory with other VM’s on the host – only used when memory resources are low/overcommitted.
  5. Check esxtop for SWCUR (currently used SWAP), SWTGT and SWCUR. If SWTGT is less than SWCUR swapping will take place. Swapping is slow so should be avoided at all costs.  If sawpping is unavoidable use SSD’s; There’s a -12% degradation with local SSD versus -69% for Fiber Channel and -83% for local SATA storage. (more information here)
  6. SWPWT represents the ammount of time a Virtual Machine is waiting for memory to be swapped in and should always be below 5%
  7. SWR/SWW represent Swap Reads/Writes from disk to memory and vice versa.

3) Using esxtop investigate storage (press ‘u’ for per-datastore or ‘d’ for per-hba stats:

  1. Investigate DAVG – represents the roud-trip time bewteen HBA and storage, should be less than 30ms ideally
  2. Investigate KAVG – represents actual latency due to VMKernal
  3. Investigate GAVG – represents the round-tripfor IO requests sent form the host to storage, again lower is better, ideally less than 30ms.
  4. Check the CONS/s – this indictaes SCSI reservation conflicts generated by metadata updates on the same LUN at a given time.
  5. vscsiStats (more info here)  will report per-VMDK/RDM

4) Finally, consider the network subsystem:

  1. Check bandwidth availability
  2. Using esxtop check %DRPTX and %DRPR, if the latter is high consider increasing the Rx buffer from device manager (yes, Windows only…?linux configuration) on the VM

If all else fails check advisories on your hardware platform, I’ve run into issues in the past that have been device firmware specific so dont rule out the siplist of things.

UPDATE 22/02/2010 : Check out the new esxtop article here for further performance troubleshooting tips.