VMWare VCB Troubleshooting ‘ non-zero return code’

VMWare VCB Troubleshooting ‘ non-zero return code’

I recently came across the following error when running a VCB via HP DataProtector 6.0:


“Creating a quiesced snapshot failed because the (user-supplied) custom pre-freeze script in the virtual machine exited with a non-zero return code”

The following steps resolved the issue:
  > Perform a ‘repair’ of the VMWare tools installed on the Virtual Machine
  > Restart the Virtual Machine

The error is generated because the VCB fails to execute a script on the VM due to an issue with the VMWare tools on the guest.

VMWare : Capturing Performance Statistics

VMWare : Capturing Performance Statistics

The following process will allow you to capture Windows Performance counter compatible CSV files from any ESX server using the ‘esxtop’ utility which is an integral part of VMWare ESX.

First we must create a couple of script files. The first being ‘ftp.sh‘ I have created the scripts on a datastore which houses NO Virtual Servers. Be careful where you place this data as filling up a datastore with VM’s will stop those VM’s working. You will need to modify the text in RED to ensure the script works in your environment. The text in RED is simply the path where the script fiels are located, and the path where the csv files will be generated.

This script will generate the CSV file and ‘trim’ it down to the stats we require. By default esxtop will generate an insanely large csv file. Once ‘trimmed’ it will upload the csv file to an FTP server of your choice and finally gzip/archive the file for future reference.

# Every 24 Hours FTP todays stats
echo $(date +%R)
# Perform streamlining of CSV file
dm1=$(date –date=’1 day ago’ +%Y-%m-%d)
cat /vmfs/volumes/LOCAL_ATTACHED/esxtop/$HOSTNAME_$dm1.csv | cut -d “,” -f 1,`head -1 /vmfs/volumes/LOCAL_ATTACHED/esxtop/$HOSTNAME_$dm1.csv | tr “,” “\12” | egrep -n “\\\\\Memory\\\\\Free MBytes|Physical Disk\(vmhba1\)\\\\\Reads/sec|Physical Disk\(vmhba1\)\\\\\Writes/sec|Physical Disk\(vmhba1\)\\\\\MBytes Written|Physical Disk\(vmhba1\)\\\\\MBytes Read|\\\\\Physical Disk\(vmhba2\)\\\\\Reads/sec|Physical Disk\(vmhba2\)\\\\\Writes/sec|Physical Disk\(vmhba2\)\\\\\MBytes Written|Physical Disk\(vmhba2\)\\\\\MBytes Read |Physical Disk\(vmhba2\)\\\\\Commands/sec|Physical Disk\(vmhba1\)\\\\\Commands/sec|Physical Cpu\(_Total\)” | cut -d “:” -f 1 | tr “\12” “,”` > /vmfs/volumes/LOCAL_ATTACHED/esxtop/trim_$HOSTNAME_$dm1.csv

sed -i”.bak” “2d” /vmfs/volumes/LOCAL_ATTACHED/esxtop/trim_$HOSTNAME_$dm1.csv
rm /vmfs/volumes/LOCAL_ATTACHED/esxtop/trim_$HOSTNAME_$dm1.csv.bak -f
rm /vmfs/volumes/LOCAL_ATTACHED/esxtop/$HOSTNAME_$dm1.csv -f
mv /vmfs/volumes/LOCAL_ATTACHED/esxtop/trim_$HOSTNAME_$dm1.csv /vmfs/volumes/LOCAL_ATTACHED/esxtop/$HOSTNAME_$dm1.csv


# Connect to FTP Server
ftp -inv $HOST << EOF
user $USER $PASS
lcd /vmfs/volumes/LOCAL_ATTACHED/esxtop
put $HOSTNAME_$dm1.csv

# GZIP and archive stats
gzip /vmfs/volumes/LOCAL_ATTACHED/esxtop/$HOSTNAME_$dm1.csv
mv /vmfs/volumes/LOCAL_ATTACHED/esxtop/$HOSTNAME_$dm1.csv.gz /vmfs/volumes/LOCAL_ATTACHED/esxtop/archive/

Secondly create the ‘capturestats.sh‘ script which will launch esxtop and capture the statistics you require. Again, modify the text in RED to suit you environment. This script will capture stats every 60 seconds 1439 times – there are 1440 minutes in a day, and we want the script to start again at midbight, so thisscript will run 00:00 to 23:59.

# capture.sh
today=$(date +%Y-%m-%d)
# There are 1440 minutes in a day, we want to capture 00:00 > 23:59 so we’ll specify 1439 captures at 60 second intervals.
esxtop >>
/vmfs/volumes/LOCAL_ATTACHED/esxtop/EUVMTST1_$today.csv -d 60 -n 1439 -c /root/.esxhoststats

Next, create the esxtop config file under /root/.esxhoststats. This will ensure that we capyure only what we need, CPU stats, Memory Useage and Disk I/O stats. You can modify your own config file to meet your own requirements.


Finally, under root acount context (accessed via sudo su –) execute the ‘controb -e‘ command. Add the following lines to the file:

00 00 * * * /vmfs/volumes/LOCAL_ATTACHED/esxtop/capturestats.sh >/dev/null
00 01 * * * /vmfs/volumes/LOCAL_ATTACHED/esxtop/ftp.sh >/dev/null

 This will cause the capturestats.sh script to run at midnight every day and the ftp.sh script to run at 01:00 everyday.

VMWare : VCB Cleanup Script

VMWare : VCB Cleanup Script

VCB is a powerful and useful technology for backing up Virtual Machines running on ESX/vSphere, more than once it has rescued an entire Virtual Machin where a standalone server would have had to be rebuilt.

My only fault with the ‘vanilla’ installation is the cleanup process. Whena  VCB backup fails, or even succeeds at times, the snapshot is not cleaned up on the VCB proxy server or the ESX host. This can result in wastedstorage on both systems.

VCB includes a cleanup script which will clean directories in the VCB root folder as specified in the following file: C:\Program Files\VMware\VMware Consolidated Backup Framework\config\config.js

This script works well at cleaning up folders for VCB that are created dynamically in the root folder, however, if you have created subdirectories for per-VCB tasks the script will not automatically cleanup the VCB snapshots. I create a sub-folder on a per-VCB basis, why? This is simple,. because using HP dataprotector to perform VCB backups I have to specify a folder that exists tobackup to tape. If I specify the VCB root folder then if a VCB fails and the data is not cleaned up, the next VCB will be significantly larger as it will contain the data for the failed VCB snapshot and the new VCB snapshot.

I have around 50 VM’s, therefore manually editing the VCB config.js file to reflect the root VCB directory for each backup was a no-no.

You’ll find a link below to a modified version of the vcb-cleanup.wsf file included with the VCB proxy package. Copy this file to the ‘C:\Program Files\VMware\VMware Consolidated Backup Framework\generic’ directory.

Modify the  C:\Program Files\VMware\VMware Consolidated Backup Framework\config\config.js file to reflect the root cotainer for all of your VCB folders.

Follow the steps outlined below to perform VCB cleanup on all subfolders:

  1. Ensure no VCB backups are in progress
  2. Open a command prompt, cd to  “C:\Program Files\VMware\VMware Consolidated Backup Framework\generic”
  3. Execute the following command “cscript.exe vcb-cleanup-mod.wsf “C:\Program Files\VMware\VMware Consolidated Backup Framework”  -y”

Download the script from here. Not, you will need to rename the file from ‘.txt‘ to ‘.wsf

In order for VCB to work under Windows 2008 R2 x64 you must configure vcbMounter.exe to “Run As Administartor”

VMWare : Troubleshooting VM Performance

VMWare Troubleshooting VM Performance in ESX/vSphere

Firstly we’re going to be using esxtop, which can be executed from the local console or an SSH session on ESXi, as well as vCenter.

General guidelines;

  1. Physical : vCPU ratio should be around 1:5
  2. Avoid memory oversubscription in production environments
  3. Reserve memory for virualised SQL servers using Lock Pages in Memory
  4. Reserve memory for virtualised RDS/Citrix Servers
  5. Always create datastores using the vSphere client, this will ensure that the VMFS partition is aligned – this will reduce IO and potentially increase performance.
  6. Align guest data disks (for Windows, any version earlier than 2008 must be manually aligned) – this will reduce IO and potentially increase performance.
  7. Look to keep the number of VM’s per Datastore/LUN to around 10-15, this will help to reduce SCSI reservation contention.
  8. vCPU’s ; less is more. From my own testing I’ve found that Citrix servers perform best with 2 vCPU’s over 4 vCPU’s. This is not only better for my users but also the ESXi hosts as there is less co-scheduling.

Lets begin…..

1) Investigate CPU contention/exhaustion using esxtop (press ‘c’ from esxtop, and shift-v for per-VM stats only):

  1. Check host PCPU usage using esxtop
  2. Look at %RDY, if this is equal or greater than 10% there is a performance issue – this can indicate CPU contention.
  3. Look at %MLMTD, if this is high it would indicate a CPU limit is being imposed on the VM. %RDY – %MLMTD gives a true indication of CPU contention.
  4. If %RDY is truluy above 10% the first step is to lower the number of active vCPUs configured on the ESX/vSphere server, next you’re looking at reducing the number of VM’s on the server.
  5. Investigate co-scheduling/SMP related issue – are VM’s using all presented vCPU’s? From esxtop press ‘c’ then ‘e’  – then take a look at %CSTP. If these values are high this could indicate issue as this represents the overhead in co-scheduling CPU’s from a co-stopped to co-started state.

For example, if you have 16 cores, the maximum vCPUs that should be defined across all active VMs should not exceed 80.

2) Investigate memory usage via esxtop and vCenter (press ‘m’ from esxtop, and ‘shift-v’ for per-VM stats only):

  1. Check host memory utilisation using esxtop
  2. For full-fat ESX only – the service console may be low on RAM, you can adjust this by following these instructions: http://www.vmware.com/pdf/esx_performance_tips_tricks.pdf
  3. Watch out for memory balooning, this can have a significant impact on VM performance. You can track memory balooning in vCenter and esxtop; MCTLTGT is the VMKernel’s desired memory baloon size, MCTLSZ is the actual size. If the target is greater than the size the baloon is increased/inflated, if it is smaller it is decreased/deflated. VM memory limits can also trigger balooning.
  4. Transparent Page Sharing (TPS) allows a host to share memory with other VM’s on the host – only used when memory resources are low/overcommitted.
  5. Check esxtop for SWCUR (currently used SWAP), SWTGT and SWCUR. If SWTGT is less than SWCUR swapping will take place. Swapping is slow so should be avoided at all costs.  If sawpping is unavoidable use SSD’s; There’s a -12% degradation with local SSD versus -69% for Fiber Channel and -83% for local SATA storage. (more information here)
  6. SWPWT represents the ammount of time a Virtual Machine is waiting for memory to be swapped in and should always be below 5%
  7. SWR/SWW represent Swap Reads/Writes from disk to memory and vice versa.

3) Using esxtop investigate storage (press ‘u’ for per-datastore or ‘d’ for per-hba stats:

  1. Investigate DAVG – represents the roud-trip time bewteen HBA and storage, should be less than 30ms ideally
  2. Investigate KAVG – represents actual latency due to VMKernal
  3. Investigate GAVG – represents the round-tripfor IO requests sent form the host to storage, again lower is better, ideally less than 30ms.
  4. Check the CONS/s – this indictaes SCSI reservation conflicts generated by metadata updates on the same LUN at a given time.
  5. vscsiStats (more info here)  will report per-VMDK/RDM

4) Finally, consider the network subsystem:

  1. Check bandwidth availability
  2. Using esxtop check %DRPTX and %DRPR, if the latter is high consider increasing the Rx buffer from device manager (yes, Windows only…?linux configuration) on the VM

If all else fails check advisories on your hardware platform, I’ve run into issues in the past that have been device firmware specific so dont rule out the siplist of things.

UPDATE 22/02/2010 : Check out the new esxtop article here for further performance troubleshooting tips.

VMWare VCB : Improving Performance of VCB

VCB Backup Essentials

Having recently introduced VCB backups into Dataprotector 6.0 I thought I would share a few useful tips for ensuring that backup speeds are as fast as possble.

1) Ensure that all VM’s have a scheduled task to zero-out free space prior to VCB running. Windows, when you delete a file does not zero-out the disk space (populate the data blocks with zero’s) – soif you had a 20GB drive that contained 15GB of data, then you delete 10GB of data, unoless you zero out this space the backup will still be 15GB.

I use the free ‘SDELETE‘ tool from sysinternals (now Microsoft) to do this, and simply execute a scheduled task before the backup is due to run. SDELETE can eb found here: http://technet.microsoft.com/en-us/sysinternals/bb897443.aspx

2) When running VCB, check the disk queue performance counter on the VCB Proxy server, the storage to which the VCB snapshot is taken can be a serious bottleneck for VCB performance. Initially I was running VCB over fibre, to a fibre attached SAN disk. I found that after 1.97GB the backup would grind to a halt – 200Kb/sec!!! By changing the VCB snapshot drive to local RAID0 storage this increased to over 2.2GB/min, or 37.5MB/sec. Your hardware may be capable of significantly faster speeds.

3) Disable additional disk paths on the VCB Proxy Server: VCB does not like MPIO/multiple paths to LUNS. This step is probably the biggest potential speed gain you’ll get. Disable the additional disk objects in Windows device manager, test you backups once complete, if they don’t work enable the path you disabled and disable a different one. This can see speed improvements of 100MB/sec.

4) Run multiple VCB snapshots at the same time. Your SAN containing the VM’s will, more than likely, support more than 35MB/sec. Just ensure you change the snapshot directory otherwise your backup application may backup multiple snapshots at once!