ESXi Hosts Not Responding APD – PowerPath PPVE 5.7 SP3 Build 173 and SRM 5


I’ve recently uncovered an issue with SRM 5 and the latest released version of Powerpath for ESXi – PPVE 5.7 SP3 Build 173 where Powerpath is not handling detaches devices properly after a SRM failover.

This is a known issue with SRM and Powerpath documented in VMware KB2016510  – ‘SRM TestFailover cleanup operation times out and fails with the error: Cannot detach SCSI LUN. Operation timed out: 900 seconds.’

This wasn’t the exact operation we had been performing. We had been undertaking Planned Migrations in the week preceding the incident rather than Test Failovers. Also there were no errors reported in SRM. In this post I wanted to document our symptoms so if you have a vBlock and SRM and you notice hosts becoming disconnected in vCenter; don’t panic… read on!

We had been running SRM 5 for a few months, but it seems we recently reached a tipping point after a period of extensive testing of  SRM planned migrations, test failovers and clean-ups. While we didn’t have any errors with our cleanup operations as per the above VMware KB article, out of the blue our ESXi hosts started to drop out of vCenter.

As we performed Planned Migration from the Recovery to Protected Site and back again, SRM was unmounting and detaching LUNS and Powerpath was incorrectly detaching the devices. Over time this caused the ESXi Hosts to stop responding within vCenter as they went into an APD (all paths down) state. First it was one host and then the following week it was five. Thankfully the VMs were not affected, but the hosts were completely unresponsive through the DCUI and we found the only fix was to gracefully shutdown the virtual machines via the OS and reboot the ESXi hosts. It was a real pain. Troubleshooting the  issue was compounded as lockdown mode was enabled and SSH\ESXi shell disabled.

The good news — this is a known issue with Powerpath VE that EMC are aware of. This is detailed in the emc knowledgebase – emc284091 “Esxcfg-rescan hangs after unmapping LUN from ESX with Powerpath/VE”. The root cause as per emc284091 — ‘This is a known issue with Powerpath/VE 5.7 where it is not handling detached devices properly. Detaching a device results to setting the device state to OFF and Powerpath/VE is not properly handling this state.’

We were advised by VMware not to perform any more SRM failovers until we have installed powerpath 5.7 P01. Thankfully EMC will supply you with an early beta to resolve the issue as P01 is only out in Q3 2012. We were supplied with PowerPath VE 5.7 P01 b002 and this appears to have solved the problem.

If you want to try and identify the fault yourself, look out for the following error message – ‘PowerPath: EmcpEsxLogEvent:1260:Error:emcp:MpxEsxVolProbe: Device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx probe already in progress’

There is also a SCSI sense code that you will normally find in the vmkernel.log but in our case we did not see it because I had to reboot the host to gather logs:

WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device “xxxxxxxxxxxxxxx” –  issuing command 0x4125001e09c0”

The above sense code  is the sense bye series. The PPVE / Hot Fix will now recognise it and respond accordingly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s