HP BL465c G7 : ESXi 4.1 Issues

HP BL465c G7 : ESXi 4.1 Issues

Having just built a new ESXi 4.1 cluster comprised of BL465c G7’s we’ve run into some interesting issues I thought I would share.

Six BL465c G7 servers have been installed in 2 new C7000 G2 enclosures each with 2x Flex-10 Virtual Connect Modules and 2×20-port 8GB Fiber Channel Virtual Connect Modules. Servers are split 3 per-chassis;

In breif the issues are as follows:

  1. PSOD on reboot and random crashes with PSOD
  2. Corrected Memory Error threshold exceeded
  3. Hang on reboot
  4. ILO Duplicate SSL Certificate Serial Number

 

 Hardware 

  BL465c G7
  2x AMD Opteron 6172 12-core CPU
  96GB RAM (12x 8GB 1333Mhz)
  2 x 60GB SSD
  ESXi 4.1 (HP Part Number 583772-007)

Firmware Version

All BL465c G7 ESXi blades have:
•    BIOS vA19 Jul 23 2010 Sept 30 2010
•    ILO3 v1.10 v1.15 Oct 22 2010
•    CNA v2.102.453.02.102.517.6 2.102.517.7
•    P410i v3.50

Chassis firmware:
•    Onboard Administrator 3.20 3.21
•    Flex10 Modules – 3.10 3.15
•    FC Modules –1.40 1.41

CNA Driver Version

The following versions have been tried:

•    2.102.404 2.102.440, 2.102.486, 2.102.518.0, currently 2.102.518.0 patch1 (unsigned), currently 2.102.554.0

ESXi Versions

HP OEM build as VMware vanilla build fails to load installer. Build ID is 260247.

Issues

#1 – PSOD on reboot and random crashes with PSOD – RESOLVED

The PSOD error was always a #PF Exception 14 and mentions benet_vlan_rem_vid. From this you can easily acertain it is an issue with the CNA driver. To confirm the current driver version you are using enable with either the local support console or SSH and enter the following command: vsish -e get /net/pNics/vmnic0/properties | grep “Driver Version”

The built in version with the HP OEM version of ESXi 4.1 is 2.102.404. The  2.102.440 version available on the VMware site simply made the PSOD issue worse.

A new Emulex CNA driver (onboard Converged Network Adapter) is available directly from VMware which resolves the PSOD issue. At the time of writing this is version 2.102.486 2.102.518.0 2.102.554.0. This is an unsigned signed driver and must be installed using the following process:

  1. Download and install the VMware Remote CLI: http://www.vmware.com/support/developer/vcli/
  2. Obtain the new CNA driver from HP support.
  3. Launch the CLI from the start menu
  4. Execute the following command (modifying the path to the CNA driver and server name)
    • vihostupdate.pl –server server01 –install –bundle “C:\offline-bundle.zip” –bulletin SVE-be2net-2.102.486.0 –nosigcheck
    • vihostupdate.pl –server server01 –install –bundle “C:\offline-bundle.zip” –bulletin SVE-be2net-2.102.518.0
    • vihostupdate.pl –server server01 –install –bundle “C:\SVE-be2net-2.102.554.0-offline_bundle-347594.zip” –bulletin SVE-be2net-2.102.554.0

Check the version that you are running using the following command:

  • vsish -e get /net/pNics/vmnic0/properties | grep “Driver Version”
  • vsish -e get /net/pNics/vmnic0/properties | grep “Driver Firmware Version”

#2 –  Corrected Memory Error threshold exceeded ((Processor 2, Memory Module 2)) – RESOLVED

There are also multiple alarm entries within vCenter for ‘Host memory status

We went through changing DIMMs, system board and CPU as part of the troubleshooting process. None of these seemed to help the problem.I then came across the following community post that got me thinking: http://communities.vmware.com/thread/221222

I implemented the below, and this has since been confirmed by HP as a work around for this issue which is caused by the 8GB DIMM and the default power management options. Specifically this affects 8GB DDR3 DIMMs only.

  1. Reboot the server and enter the Rom Based Setup Utility (RBSU)
  2. Select Power Management Options
  3. Select HP Power Profile
  4. Select Maximum Performance
  5. Verify that HP Power regulator is now set to HP Static High Performance Mode

I have also been informed that the following will resolve this issue (thanks to Mads Kirkegaard):

  1. Select Power Management Options
  2. Select Advanced Power Management Options
  3. Select Minimum Processor Idle Power State
  4. Select No C-states (Default is C1E State (AMD C1 Clock Ramping))

If this resolves your issue contact your HP support specialist for further details on this issue, ensure you advise them of the workaround used. To be clear this is a hardware issue, the above is a workaround that disables the trigger for the hardware issue. Note, that the above is best practice for an ESX/ESXi host.

#3 – Hang on reboot – RESOLVED

The shutdown/reboot process for the BL465c G7 blades always fails with the server hanging. Without the new CNA driver (version 2.102.486) you will get a PSOD (#PF Exception 14) approx 75% of the time, with it installed you will simply get a hang/crash.

Further analysis, looking at the console output from the ILO using the tech support mode you can see the reboot process gets stuck at: ‘Requesting system reboot‘ When comparing this to a BL460c G6 blade I can see this is the last output before the server resets.

We have tested disabling USB support and the serial port in the BIOS, this made no difference. Also setting the HP Power Profile to OS Control Mode did not resolve this.

Use the following commands to check driver/firmware versions:

  • vsish -e get /net/pNics/vmnic0/properties | grep “Driver Version”
  • vsish -e get /net/pNics/vmnic0/properties | grep “Driver Firmware Version”

Update 23/11/2010: I have identified that when the Virtual Connect profile is not attached to the server it will reboot/shutdown without issue.  Whilst in this state the server is useless, it will hopefully help identify the root cause of the problem.

Update 07/12/2010: New firmware and BIOS out. CNA firmware version 2.102.517.6 and ESXi driver version 2.102.518.0. Issue still remains. One thing I have found is that if the following commands are executed in order the server will shutdown:

  1. /sbin/services.sh stop
  2. /sbin/esxcfg-module –u –f be2net
  3. reboot

I have had reported cases from New Zealand and Turkey so this is not a unique issue.

Update 23/12/2010: New firmware out 2.102.517.7 – does not resolve the issue. Also tested with 32GB RAM instead of 96GB RAM as per request by VMware and HP – no difference. Had an interesting discussion with VMWare/HP today which points at the Emulex CNA driver being the cause – an issue with the interrupts not being disabled on shutdown. A debug driver has been tested on full ESX and when the driver is instructed to disable interrupts on shutdown the server reboots.This may also indicate towards more issues with 12-core CPU’s than 8-core, simply due to the increased number of interrupts.

We’re hoping to get a pre-release version in the next few days, designed specifically to deal with this issue.

Update 24/12/2010: Pre-release driver version tests of driver ‘518.0.elx.patch1-1’ have proven to be succesful. This driver should be available form your support partner for testing. I’ll update this post when the official driver has been released.

Update 07/01/2011 21/01/2011: The most recent update I have is that the official release is now still on schedule for the first week of Febraury. Note that running the unsigned driver is not a supported solution. I’ve also recieved similar reports from Australia and Germany.

Update 07/02/2011: The new driver has been released, version 2.102.554.0 and is publically available from the following location: http://downloads.vmware.com/d/details/esxi4x_emulex_blade_10gb_dt/ZHcqYnRkdCVidGR3

To install;

  1. Download the driver and extract the SVE-be2net-2.102.554.0-offline_bundle-347594.zip file from the ‘offline-bundle’ folder. Copy this to C:\.
  2. Place host you wish to update in maintenence mode
  3. Confirm the bulletin ID by listing available packages, change server name as appropriate:
    • vihostupdate.pl –server server01 –list –bundle “C:\SVE-be2net-2.102.554.0-offline_bundle-347594.zip”
  4. Use the rCLI to install the driver using the command, change server name as appropriate: 
    • vihostupdate.pl –server server01 –install –bundle “C:\SVE-be2net-2.102.554.0-offline_bundle-347594.zip” –bulletin SVE-be2net-2.102.554.0
  5. Reboot the server – it will still hang as the driver is not in use at this stage.
  6. Perform a test reboot – it should work as the driver is now in use!

Pre-release fix available, final, official, release not out yet.

Final fix now available.

#4 – ILO Duplicate SSL Certificate Serial Number – IN PROGRESS

No fix as of yet, we’re troubleshooting as you read this.

#5 – Network Connectivity Failure – IN PROGRESS

We had an issue last week where the management/vMotion NIC’s decided that network connetivity was no longer required.This caused VM’s to fail over via VMHA and therefore interrupted live applications. We’re testing the new driver released last week (05/02/2011) to see if this resolves this issue.

No fix as of yet, we’re troubleshooting as you read this.