ESXi Hosts Not Responding APD – PowerPath PPVE 5.7 SP3 Build 173 and SRM 5


I’ve recently uncovered an issue with SRM 5 and the latest released version of Powerpath for ESXi – PPVE 5.7 SP3 Build 173 where Powerpath is not handling detaches devices properly after a SRM failover.

This is a known issue with SRM and Powerpath documented in VMware KB2016510  – ‘SRM TestFailover cleanup operation times out and fails with the error: Cannot detach SCSI LUN. Operation timed out: 900 seconds.’

This wasn’t the exact operation we had been performing. We had been undertaking Planned Migrations in the week preceding the incident rather than Test Failovers. Also there were no errors reported in SRM. In this post I wanted to document our symptoms so if you have a vBlock and SRM and you notice hosts becoming disconnected in vCenter; don’t panic… read on!

We had been running SRM 5 for a few months, but it seems we recently reached a tipping point after a period of extensive testing of  SRM planned migrations, test failovers and clean-ups. While we didn’t have any errors with our cleanup operations as per the above VMware KB article, out of the blue our ESXi hosts started to drop out of vCenter.

As we performed Planned Migration from the Recovery to Protected Site and back again, SRM was unmounting and detaching LUNS and Powerpath was incorrectly detaching the devices. Over time this caused the ESXi Hosts to stop responding within vCenter as they went into an APD (all paths down) state. First it was one host and then the following week it was five. Thankfully the VMs were not affected, but the hosts were completely unresponsive through the DCUI and we found the only fix was to gracefully shutdown the virtual machines via the OS and reboot the ESXi hosts. It was a real pain. Troubleshooting the  issue was compounded as lockdown mode was enabled and SSH\ESXi shell disabled.

The good news — this is a known issue with Powerpath VE that EMC are aware of. This is detailed in the emc knowledgebase – emc284091 “Esxcfg-rescan hangs after unmapping LUN from ESX with Powerpath/VE”. The root cause as per emc284091 — ‘This is a known issue with Powerpath/VE 5.7 where it is not handling detached devices properly. Detaching a device results to setting the device state to OFF and Powerpath/VE is not properly handling this state.’

We were advised by VMware not to perform any more SRM failovers until we have installed powerpath 5.7 P01. Thankfully EMC will supply you with an early beta to resolve the issue as P01 is only out in Q3 2012. We were supplied with PowerPath VE 5.7 P01 b002 and this appears to have solved the problem.

If you want to try and identify the fault yourself, look out for the following error message – ‘PowerPath: EmcpEsxLogEvent:1260:Error:emcp:MpxEsxVolProbe: Device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx probe already in progress’

There is also a SCSI sense code that you will normally find in the vmkernel.log but in our case we did not see it because I had to reboot the host to gather logs:

WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device “xxxxxxxxxxxxxxx” –  issuing command 0x4125001e09c0”

The above sense code  is the sense bye series. The PPVE / Hot Fix will now recognise it and respond accordingly.

Advertisements

Trend Deep Security Warning Message ‘Machine was unprotected during move from one esx host to another’


I wanted to post some more information on this Trend DS error message – ‘Machine was unprotected during move from one esx host to another’ as it seems to come up regularly.

The description of the error message is, ‘a virtual machine was moved to an ESX that does not have an activated Deep Security Virtual Appliance.’

In essence this warning message is saying that the ESXi host you vMotioned your VM too is not currently protecting the virtual machine.

This can be because there is no virtual appliance on the target ESXi host, the Trend Virtual Appliance is not offering Anti Malware protection, is not Activated or is Offline.

This error message will not show for unactivated virtual machines — A virtual machine has to be activated to generate this error message.

There is a known bug with this error message too – even though your VM is being protected by the appliance, the error message is always reported as an Agent error. Apparently Trend are working on this.

Back to the error message: When you receive this error message, what is the next step?

Trend is a complicated beast – An appliance can have issues for a number of reasons – whether there is a fault with the appliance or one of its dependencies is what you need to figure out. It could be something as basic as the appliance dropping off the network, losing connectivity back to the DSM or to the vShield Endpoint VMkernel port, or possibly its no  longer activated (not registered as a security appliance in vShield Manager.)

If you get this warning  message, open the virtual appliance that the VM is currently residing on and first ‘Clear Warnings/Errors’  so you remove any old status\error messages and then run ‘Check Status’ to see if there are any new issues. If there are errors reported on the appliance try and resolve them by following the patented ‘Trend DS Virtual Appliance Health Check’ below.

My main bugbear with Trend is that it is too complicated and it does not report its current state accurately and concisely. When I run a Check Status I want to know exactly what is going on. It would be most useful to have a health check screen on the appliance where the health check tests I mention below in the article are run sequentially in full view for the benefit of the administrator. Issue could be highlighted immediately and it would give us confidence that the appliance and its dependencies are all configured correctly, rather than having to check all the different components individually.

For example if you check the status of your appliance and it reports back that it is Managed and Online you would expect it to be managed, online and offering anti malware protection. In my testing after I changed the vShield VMkernel IP address on my ESXi host from 169.254.1.1 to 169.254.1.2, so the appliance could not offer anti malware protection, I ran a Check Status and the virtual appliance would still report that it was managed, online and offering anti malware protection.

On the plus side when I migrated a VM to the ESXi host with the misconfigured VMkernel port, the warning message was still generated that the VM is unprotected. What this shows is this error message is symptomatic of an underlying issue with your virtual appliance or ESXi host. While the issue may not be immediately noticable because the DSM reports that all is well, you should dig deeper following the ‘Trend DS Virtual Appliance Health Check’ below.

Bottom line — You cannot fully trust the DSM when you notice this error message. The only way to verify for sure that the appliance is actually working or not would be to drop the EICAR virus on the VM to confirm whether anti malware protection is working.

‘Trend DS Virtual Appliance Health Check’:

  1. Synchronise your Virtual Center(s) in Trend DSM
  2. Confirm your credentials for VVC and vShield are uptodate
  3. Confirm filter driver is installed on ESXi host via Trend DSM
  4. Confirm vShield driver is installed on ESXi host via vShield Manager
  5. Confirm Trend Appliance is registered as Security VM with vShield Manager
  6. Confirm the appliance is in the correct VLAN
  7. Confirm the appliance network configuration is correct
  8. Confirm you can ping the Appliance from the DSM.
  9. Confirm the VMkernel IP address for vShield Endpoint is correct on ESXi host – 169.254.1.1

and if nothing works follow my last resort:

10. Deactivate and reactivate the appliance

And if that fails…. Follow the blocksandbytes ‘Triple D’ process:

11. Deactivate, Delete and Deploy the appliance.

When I’m being lazy and I know the config hasn’t changed I will Deactivate and reactivate the appliance immediately. What I find with Trend is that as long as your environment is static, Trend will continue to stay Green, but if your environment is fairly dynamic and hosts are being rebooted, VMs are being built and vMotioned, you are performing SRM fail overs and fail backs, etc. it struggles to keep up with environment changes.

Every week I have to try and figure out why virtual machines are unhappy and do not have anti-malware protection. Hopefully this will help others stay on top of Trend DS 8.

vBlock Tip: vSphere 5, SIOC, EMC FAST VP and Storage DRS


If you’ve got a vBlock its most likely you’ve got an EMC array with EMC FAST VP and hopefully by now you’ve upgraded to vBlock Matrix 2.5.0 and you’re using vSphere 5.

If not, what are you waiting for? Oh yeah, there are still a few outstanding issues.  (My advice wait for the Storage vMotion issues to be resolved, its a real pain.)

I wanted to post some best practices and recommended settings for leveraging VMware’s Storage IO Control with EMC Fast VP and Storage DRS.

First a quick recap:

  • FAST VP is EMC’s sub LUN auto-tiering mechanism.
  • SIOC is VMware’s attempt to leverage the idea of DRS (distributed resource prioritisation) into the storage layer. SIOC  provides I/O performance monitoring and isolation of virtual machines in vSphere 5.
  • Storage DRS is a new feature in vSphere 5 which allows datastores to be pooled together as a single resource.

The bottom line: EMC FAST VP and SIOC are not only compatible but can work together harmoniously because they serve different purposes.

EMC FAST monitors data usage over an hourly period and only moves data once every 24 hours. Unlike SIOC, EMC FAST redistributes data based on the 1GB slice usage and lowers the response time of the busiest slices.

Compared to EMC FAST, SIOC uses a relatively short sampling window and is designed to quickly deal with short term IO contention crises. It can act quickly to throttle IO to limit guest latency during times of IO contention.

SIOC and EMC FAST perform complementary roles to monitor and improve storage performance, therefore they should both be leveraged in your environment.

And lastly Storage DRS – should it be used — yes, but  in what capacity?

My recommendation is to leverage Storage DRS in Automatic mode for initial placement to balance VMs evenly across datastores. I would also enable SDRS to monitor free capacity to make VM relocation recommendations if datastores approach capacity. The default setting is 90% which should be adequate.

What should be disabled though is IO Metrics — It is EMC’s recommendation Storage DRS IO metrics be disabled when using FAST VP. This is because they will perform competing roles, potentially identifying similar relocations and cause inefficient use of storage system resources.

So there you have it. The best way to leverage these components in your vBlock.

Sources:

There is a great EMC document here which lists best practice with EMC VNX Storage and vSphere and an old, but relevant article from Virtual Geek on SIOC and auto-tering.

vBlock Tip: Set VMFS3.MaxHeapSizeMB to 256MB on all ESXi hosts


This is the sort of issue I came across this week that you would expect VCE to make a de facto standard in the vBlock.

Why? Because the performance hit is negligible (a slight increase in additional kernel memory of 64mb), vBlock customers are likely to hit this ceiling and its another setting that we then don’t have to worry about.

I started running into issues vMotioning two VMs. It turns out this is a known issue as per KB1004424.

I was told by VMware: ‘ESXi 5.0 Host(Source on which you are trying to power on the VM)  already has 18 virtual disks(.vmdk) greater than 256GB in size open and you are trying to power on a virtual machine with another virtual disk of greater than 256GB in size.’

The heap size is effectively the amount of VMDK storage that can be hosted, across all virtual machines, for a given host.

The default heap size is a value of 80. To calculate the amount of open VMDK storage on the host that is available,  multiply 80 x 256 * 1024 — A 80MB heap value extends to 20 TB of open VMDK storage on a host.

Increasing the size to 256 results in 256 x 256 * 1024 — A 256MB heap value extends to 64 TB of open VMDK storage on a host, which should be plenty.

I was provided the following instructions which I repeated on each host to fix:

  1. Login to the vCenter Server or the ESXi host using the vSphere Client.
  2. Click on the configuration tab of the ESXi Host
  3. Click on Advanced Settings under Software
  4. Select VMFS3
  5. Change the value of VMFS3.MaxHeapSizeMB to 256
  6. Click on OK
  7. Reboot host

After rebooting each host the problem was solved. That was an easy one, for once!

SIOC is fully supported by SRM 5…er…except for planned migrations!


I was having some issues with SIOC and SRM 5 Planned Migrations. I noticed that my planned migration were failing when SIOC was enabled.

This got me thinking to whether SIOC is even supported with SRM 5, but I couldn’t even find documentation online whether it was supported. Looks like any mention of it has been omitted from the official documentation.

So after a bit of digging here is what I’ve found from VMware:

1) SIOC is supported for use with SRM – you can use SRM to protect SIOC enabled datastores

2) to execute a “planned migration” with SRM – you will need to disable SIOC first (on the datastores). You can not do a “planned migration” with SIOC enabled.

Lets start with the good news — SIOC is supported by SRM 5, so you can leave it enabled on all your replicated datastores.

This leads us to Point 2 – There are a few caveats:

As per KB2004605, you cannot unmount a datastore with SIOC enabled. If you are going to initiate a Planned Migration, you need to disable SIOC first on your protected site (active) LUNS. This is because SRM needs to unmount the active LUNS before it breaks the mirror and sets the read-only LUNS in your Recovery Site to read-write and mounts them on all ESXi hosts.

If you attempt a Planned Migration without disabling SIOC, the unmounting of LUNS and therefore the Planned Migration will fail.

There are other instances where a mounted datastore would need to be unmounted. Consider the following scenario, I haven’t had a chance to test this, but this is what I think will happen:

  1. For whatever reason your protected site (DC1) goes offline.
  2. Login to SRM at your Recovery Site (DC2) and initiate your Disaster Recovery plan
  3. The Protected (DC1) site array is unavailable so SRM is unable to synchronise changes, but it continues the recovery –
  4. SRM Instructs RecoverPoint\SRDF to break the mirror and convert the read-only recovery site (DC2) LUNS to read-write and SRM mounts them in vCenter.
  5. SRM powers on your VMs. Job done!
  6. But wait, the old protected site (DC1) eventually comes back online.
  7. You login back in SRM and hit Reprotect to start replicating back the other way
  8. SRM tries to unmount the LUNS in vCenter in DC1 before it begins replication back the other way but cannot because SIOC is enabled.
  9. The reprotection fails.

It seems clumsy to me that SRM isn’t aware of SIOC – It doesn’t matter whether it’s during a planned migration or a re-protect, if you have to keep disabling and re-enabling it’s a pain in the arse.

Clearly this isn’t going to happen a lot once you go live and its an annoyance at best, but this is the sort of minor issue that a polished product like SRM 5 shouldn’t have. Maybe I’m being so critical because it is such a good product now – they’ve raised my expectations!

I’ve raised a feature request with VMware to have this automated in a next release and I’ve been told the documentation will be updated to ‘state the obvious’.

Maybe I am blissfully ignorant of the complexity involved but as an enterprise end user it looks like a gap to me that needs fixing.

Manual steps introduce uncertainty and risk and this looks like an issue that should be solved.

HA / Distributed vSwitch problems after Storage vMotion scripts available for KB2013639


Anyone with a vBlock who has upgraded to vSphere 5, should have noticed by now that some virtual machines do not get restarted by HA. You can find out why here.

We first raised this problem with VMware a few months ago and have been waiting patiently for a root cause analysis. Initially we thought it may have been affecting virtual machines that were upgraded from VM v7 to VM v8, but as it turns out the issue is caused by SvMotion, and since we SvMotioned all our VMs from VMFS-3 to VMFS-5 datastores it affected all of our VMs.

You’ve gotta love VMware’s short term workaround – Do Not SvMotion. Not quite what I was expecting when I upgraded to vSphere 5. There have been, what feel like, a lot of schoolboy cockups in this release. Silly things like Virtual machine folders and files no longer getting renamed when you storage vmotion. Just feels plain clumsy.

Anyway, one of the big issues with this problem was that you’re not really sure which virtual machines (if any) are affected. Before the article KB2013639 (If that link doesn’t work try this one) was released, we followed the following steps to manually fix the problem on all our virtual machines:

  1. Connect the VM to another port group on the vDS
  2. Connect the VM back to the old port group on the vDS

Thankfully, there is now a script out to detect and fix virtual machines that are affected by the HA/DVS/Storage Vmotion. You can find William Lam’s copy here and Alan Renouf’s copy here. I’ve tested Alan’s script and it worked great, without any VM downtime.

This doesn’t stop virtual machine from being affected the next time you storage vmotion the VM. This only identifies which virtual machines will not restart correctly if a HA event is triggered.

For a fix you have to wait for vCenter 5 Update 2, which I believe is out in June, but if you have a vBlock I don’t have word yet which compatibility matrix this fix will be released with.

As usual we vBlock boys have to wait till last… and they’ll probably slip in some patches for Cisco and EMC in there too.

Wonderful!

Deep Security 8 SP1 Upgrade


As you guys and girls may be aware, Trend DS 8 SP1 has been out since the 30th April.

DS 8 SP1 promises support for wildcard exclusions and also adds linux support via an agent for on-demand scanning. (no real-time scanning yet).

There is also the added benefit of fixing the HEAP_MAX_SIZE PSOD issue but still waiting confirmation on this.

We’ve been having a few ongoing issues with our Trend environment mainly due to a lack of care and attention since I installed 7.5 SP1 and upgraded to DS 8. Also Trend is not the easiest beast to get up and running correctly. A lot of this is down to the documentation. The install guide (Getting Started?) is too  simplistic and the Best Practice documentation is confidential (go figure!) so I would definitely recommend professional services if you are think about buying Trend DS. And on the plus side you get someone to blame if anything goes wrong!

I thought the release of 8 SP1 would be a good oppurtunity to get the Trend boys onsite to blow away the existing DSM + database and install DS 8.0 SP1 from scratch.

Bear in mind this was a live cluster, so we effectively split the cluster in half and kept one half on DS 8 (with all the live VMs) and the other half was upgraded to DS 8 SP1.

We deployed a new VM, installed DSM 8 SP1 on a new database, prepared the ESXi hosts and deployed the new virtual appliances. Once the infrastructure was configured, the existing virtual machines were vmotioned onto the DS 8 SP1 hosts that were managed with the new 8 SP1 DSM.

This was a little tricky as you effectively had two DSM’s in operation on a single cluster – not recommended for long! The key to managing the VMs was to change the view to sort by host, then you could easily ignore all the unmanaged VMs on half the hosts that were not prepared.

Once the VMs were vmotioned across, we waited 5 minutes for their config to update (to ensure they still didn’t think they were being protected by a DS8 appliance) and then activated them on the new DS 8 SP1 virtual appliances on the new DSM.

After all the VMs were activated we could upgrade the remaining ESXi hosts and re-enable DRS to spread the VMs back across the cluster.

All in all it was a painless upgrade with no downtime and on the plus side Trend is looking much better.

If you have been through a few iterations of  Trend DS and  you’re having issues with high maintenance, VMs being unprotected, appliances going offline, etc I recommend this approach to clear out your infrastructure and database and start off fresh.

Yes you have to reconfigure your alerting and security profiles but its a small price to pay for a healthy, stable environment.

DS 8 SP1 — well recommended!

— UPDATE 11/06/2012 —

I have had confirmation from Trend HEAP_MAX_SIZE issue has been resolved in DS 8 SP1, but for now I’ve left the HEAP_MAX_SIZE variable set on all my ESXi hosts as it is still unclear in my mind whether this setting is no longer needed.