vCloud Director 5.1.1, vCloud Networking and Security 5.1.1, ESXi 5.1.0a, vCenter 5.1.0a Released!


Looks like my proof of concept environment is out of date already…

VMware released a couple of updates on Thursday 25/10/2012:

  • VMware ESXi 5.1.0a Build 838463 –  Download

Looks like there are a couple of new features in vCloud Director like Elastic vDCs which will be worth looking into, but otherwise its all bug fixes.

I haven’t been having any issues per se, so not sure how much value I will get out of these updates, but will hopefully get these installed next week to ensure I am up to date with the latest patches and try play around with the new vCloud Director features.

 

Shame on you VMware! Shame on you!


VMware, I am not impressed.

Guess the release date of SQL Server 2008 R2 SP1 for me, will ya?

I’ll give you a hint… Since it has only just been approved for use with vSphere 5.1, you’d hazard a guess recently, right?

Wrong!

SQL 2008 R2 SP1 was released in 11th July 2011. Date approved by VMware: 10th September 2012 (vSphere 5.1 release date).

Come on VMware… seriously? 14 months to approve a SQL service pack? That’s a joke.

I recently found out our administrators had applied SP1 to our SQL 2008 R2 servers earlier this year, when I tried to raise a support call and it was pointed out we were actually outside the VMware matrix.

I had to uninstall SP1 (thank you Microsoft for including this feature in SQL 2008 R2!) to get us back in line with the VMware compatibility matrix. The uninstall went quite smoothly (thank you again Microsoft) but that’s not really the point is it…

I’m running vSphere 5 Update 1 but I cannot apply SQL 2008 R2 SP1 or even SP2 because VMware are being slack!

Someone needs to up their game or loosen the compatibility matrix.

VMware Product Compatibility Matrix


This is the best website since sliced bread and I felt an irresistible urge to share it with y’all.

I am talking of course about the VMware Product Interoperabillity Matrix.

A bit of a mouthful but with this little beauty you can work out exactly what dependencies there are between VMware products.

Q. Planning to upgrade your ESXi hypervisor and worried about the impact this will have on your other VMware products like vCenter, vShield, VUM, SRM?

A. No problem, just check the product interoperability matrix!

Q. Want to install vCenter but not sure which versions of SQL are supported?

A. No problem, just check the product interoperability matrix!

You get the idea…

If you have never had a look before, I recommend you perform a quick review of your environment. You may be surprised to see you are out of the matrix (like I did!)

vShield Endpoint Driver BSOD issue


The vShield Endpoint driver is back in the bad books this week.

Looks like it is now causing our virtual machines to blue screen. grrrrr

If its not an issue with Trend Micro Deep Security, its an issue with vShield Endpoint!

This affected our Citrix Xenapp Provisioned Services Servers quite severely. They were blue screening every day. This has only affected one of our standard virtual machines – a file server crashed during the day the other week.

This will affect anyone using the latest officially released vShield driver 5.0.0.1 build-652273 and older versions.

This issue is confirmed by VMware to be fixed in a new version of the vShield Endpoint driver 5.0.0.2 build-813867 — another reason to contact VMware  to get your hands on this driver as it has not been officially released yet.

 

New vShield Endpoint Driver available to improve Deep Security 8 performance


Thanks to http://www.joulupukki.nl/wordpress/?p=523 for alerting me to this issue.

VMware made a pre release of the new vShield Endpoint Driver (5.0.0.2 build-813867) available last week to customers who are experiencing issues with their current vShield Driver. This will be released in Q4 but if you are using an anti malware product in your virtual environment that relies on vShield Endpoint Driver I would contact VMware to get the patch.

This hotfix needs to be applied on top of vShield Endpoint Driver build 652273 which is available with the VMware tools included with ESXi 5 Express Patch 3 (build 702118).

In the words of VMware this fixes two main issues: performance issues with network files and sharing violation issues.

1. Sharing violations – It was discovered that, while you had the thin agent installed and real-time AV scanning running, if you opened a file on a network share a few times in quick succession, the 3rd or 4th attempt could result in the file being locked. This was due to the lack of caching for network files, which is the recommend AV practice, but caused this locking

2. Performance issues – This was to due to the general overhead when our thin agent called some MS filter methods.

I also found that this version fixes a BSOD issue with vsepflt.sys. More about that in my next post.

vBlock Tip: Increase Cisco 1000V Max Ports from default of 32


Another post in the vBlock tip series…

VCE use static binding on the Cisco 1000V and this combined with the default of 32 ports per VLAN means most people will soon run out of ports on their DV port groups.

Who knows why 32 is the default. It seems a bit conservative to me. Maybe there is a global port limit but I haven’t been able to confirm this.

Either way, 32 doesn’t seem nearly enough ports in most network designs. The good news is the maximum is 1024, so it makes sense to me to increase it substantially depending on the number of VLANs you have.

As soon as your vBlock lands I would definitely review each DV Port Group and increase the max ports assigned.

Static binding is a pain in the arse – it means that any VM whether a template or whether its powered off will use up a port if it is assigned to the DV Port Group. You may only have 5x running VMs on the VLAN but you won’t be able to add and power on a 6th VM if you have 27x VMs\templates powered off and assigned to that same DV Port Group.

For that reason alone I am not sure why VCE don’t just use ephemeral binding. Anyway I am going off topic.

Instructions from VMware KB1035819 on how to increase your max ports for each VLAN (port-profile).

These are the commands I use:

  1. show port-profile – to find the correct port profile name
  2. conf t – enter configuration
  3. port-profile <DV-port-group-name> – change configuration context to the correct port-profile
  4. vmware max-ports 64 – change max ports to 64
  5. copy run start – copy running config to startup config
  6. exit
  7. exit
  8. exit

 

ESXi Hosts Not Responding APD – PowerPath PPVE 5.7 SP3 Build 173 and SRM 5


I’ve recently uncovered an issue with SRM 5 and the latest released version of Powerpath for ESXi – PPVE 5.7 SP3 Build 173 where Powerpath is not handling detaches devices properly after a SRM failover.

This is a known issue with SRM and Powerpath documented in VMware KB2016510  – ‘SRM TestFailover cleanup operation times out and fails with the error: Cannot detach SCSI LUN. Operation timed out: 900 seconds.’

This wasn’t the exact operation we had been performing. We had been undertaking Planned Migrations in the week preceding the incident rather than Test Failovers. Also there were no errors reported in SRM. In this post I wanted to document our symptoms so if you have a vBlock and SRM and you notice hosts becoming disconnected in vCenter; don’t panic… read on!

We had been running SRM 5 for a few months, but it seems we recently reached a tipping point after a period of extensive testing of  SRM planned migrations, test failovers and clean-ups. While we didn’t have any errors with our cleanup operations as per the above VMware KB article, out of the blue our ESXi hosts started to drop out of vCenter.

As we performed Planned Migration from the Recovery to Protected Site and back again, SRM was unmounting and detaching LUNS and Powerpath was incorrectly detaching the devices. Over time this caused the ESXi Hosts to stop responding within vCenter as they went into an APD (all paths down) state. First it was one host and then the following week it was five. Thankfully the VMs were not affected, but the hosts were completely unresponsive through the DCUI and we found the only fix was to gracefully shutdown the virtual machines via the OS and reboot the ESXi hosts. It was a real pain. Troubleshooting the  issue was compounded as lockdown mode was enabled and SSH\ESXi shell disabled.

The good news — this is a known issue with Powerpath VE that EMC are aware of. This is detailed in the emc knowledgebase – emc284091 “Esxcfg-rescan hangs after unmapping LUN from ESX with Powerpath/VE”. The root cause as per emc284091 — ‘This is a known issue with Powerpath/VE 5.7 where it is not handling detached devices properly. Detaching a device results to setting the device state to OFF and Powerpath/VE is not properly handling this state.’

We were advised by VMware not to perform any more SRM failovers until we have installed powerpath 5.7 P01. Thankfully EMC will supply you with an early beta to resolve the issue as P01 is only out in Q3 2012. We were supplied with PowerPath VE 5.7 P01 b002 and this appears to have solved the problem.

If you want to try and identify the fault yourself, look out for the following error message – ‘PowerPath: EmcpEsxLogEvent:1260:Error:emcp:MpxEsxVolProbe: Device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx probe already in progress’

There is also a SCSI sense code that you will normally find in the vmkernel.log but in our case we did not see it because I had to reboot the host to gather logs:

WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device “xxxxxxxxxxxxxxx” –  issuing command 0x4125001e09c0”

The above sense code  is the sense bye series. The PPVE / Hot Fix will now recognise it and respond accordingly.

Trend Deep Security Warning Message ‘Machine was unprotected during move from one esx host to another’


I wanted to post some more information on this Trend DS error message – ‘Machine was unprotected during move from one esx host to another’ as it seems to come up regularly.

The description of the error message is, ‘a virtual machine was moved to an ESX that does not have an activated Deep Security Virtual Appliance.’

In essence this warning message is saying that the ESXi host you vMotioned your VM too is not currently protecting the virtual machine.

This can be because there is no virtual appliance on the target ESXi host, the Trend Virtual Appliance is not offering Anti Malware protection, is not Activated or is Offline.

This error message will not show for unactivated virtual machines — A virtual machine has to be activated to generate this error message.

There is a known bug with this error message too – even though your VM is being protected by the appliance, the error message is always reported as an Agent error. Apparently Trend are working on this.

Back to the error message: When you receive this error message, what is the next step?

Trend is a complicated beast – An appliance can have issues for a number of reasons – whether there is a fault with the appliance or one of its dependencies is what you need to figure out. It could be something as basic as the appliance dropping off the network, losing connectivity back to the DSM or to the vShield Endpoint VMkernel port, or possibly its no  longer activated (not registered as a security appliance in vShield Manager.)

If you get this warning  message, open the virtual appliance that the VM is currently residing on and first ‘Clear Warnings/Errors’  so you remove any old status\error messages and then run ‘Check Status’ to see if there are any new issues. If there are errors reported on the appliance try and resolve them by following the patented ‘Trend DS Virtual Appliance Health Check’ below.

My main bugbear with Trend is that it is too complicated and it does not report its current state accurately and concisely. When I run a Check Status I want to know exactly what is going on. It would be most useful to have a health check screen on the appliance where the health check tests I mention below in the article are run sequentially in full view for the benefit of the administrator. Issue could be highlighted immediately and it would give us confidence that the appliance and its dependencies are all configured correctly, rather than having to check all the different components individually.

For example if you check the status of your appliance and it reports back that it is Managed and Online you would expect it to be managed, online and offering anti malware protection. In my testing after I changed the vShield VMkernel IP address on my ESXi host from 169.254.1.1 to 169.254.1.2, so the appliance could not offer anti malware protection, I ran a Check Status and the virtual appliance would still report that it was managed, online and offering anti malware protection.

On the plus side when I migrated a VM to the ESXi host with the misconfigured VMkernel port, the warning message was still generated that the VM is unprotected. What this shows is this error message is symptomatic of an underlying issue with your virtual appliance or ESXi host. While the issue may not be immediately noticable because the DSM reports that all is well, you should dig deeper following the ‘Trend DS Virtual Appliance Health Check’ below.

Bottom line — You cannot fully trust the DSM when you notice this error message. The only way to verify for sure that the appliance is actually working or not would be to drop the EICAR virus on the VM to confirm whether anti malware protection is working.

‘Trend DS Virtual Appliance Health Check’:

  1. Synchronise your Virtual Center(s) in Trend DSM
  2. Confirm your credentials for VVC and vShield are uptodate
  3. Confirm filter driver is installed on ESXi host via Trend DSM
  4. Confirm vShield driver is installed on ESXi host via vShield Manager
  5. Confirm Trend Appliance is registered as Security VM with vShield Manager
  6. Confirm the appliance is in the correct VLAN
  7. Confirm the appliance network configuration is correct
  8. Confirm you can ping the Appliance from the DSM.
  9. Confirm the VMkernel IP address for vShield Endpoint is correct on ESXi host – 169.254.1.1

and if nothing works follow my last resort:

10. Deactivate and reactivate the appliance

And if that fails…. Follow the blocksandbytes ‘Triple D’ process:

11. Deactivate, Delete and Deploy the appliance.

When I’m being lazy and I know the config hasn’t changed I will Deactivate and reactivate the appliance immediately. What I find with Trend is that as long as your environment is static, Trend will continue to stay Green, but if your environment is fairly dynamic and hosts are being rebooted, VMs are being built and vMotioned, you are performing SRM fail overs and fail backs, etc. it struggles to keep up with environment changes.

Every week I have to try and figure out why virtual machines are unhappy and do not have anti-malware protection. Hopefully this will help others stay on top of Trend DS 8.

vBlock Tip: vSphere 5, SIOC, EMC FAST VP and Storage DRS


If you’ve got a vBlock its most likely you’ve got an EMC array with EMC FAST VP and hopefully by now you’ve upgraded to vBlock Matrix 2.5.0 and you’re using vSphere 5.

If not, what are you waiting for? Oh yeah, there are still a few outstanding issues.  (My advice wait for the Storage vMotion issues to be resolved, its a real pain.)

I wanted to post some best practices and recommended settings for leveraging VMware’s Storage IO Control with EMC Fast VP and Storage DRS.

First a quick recap:

  • FAST VP is EMC’s sub LUN auto-tiering mechanism.
  • SIOC is VMware’s attempt to leverage the idea of DRS (distributed resource prioritisation) into the storage layer. SIOC  provides I/O performance monitoring and isolation of virtual machines in vSphere 5.
  • Storage DRS is a new feature in vSphere 5 which allows datastores to be pooled together as a single resource.

The bottom line: EMC FAST VP and SIOC are not only compatible but can work together harmoniously because they serve different purposes.

EMC FAST monitors data usage over an hourly period and only moves data once every 24 hours. Unlike SIOC, EMC FAST redistributes data based on the 1GB slice usage and lowers the response time of the busiest slices.

Compared to EMC FAST, SIOC uses a relatively short sampling window and is designed to quickly deal with short term IO contention crises. It can act quickly to throttle IO to limit guest latency during times of IO contention.

SIOC and EMC FAST perform complementary roles to monitor and improve storage performance, therefore they should both be leveraged in your environment.

And lastly Storage DRS – should it be used — yes, but  in what capacity?

My recommendation is to leverage Storage DRS in Automatic mode for initial placement to balance VMs evenly across datastores. I would also enable SDRS to monitor free capacity to make VM relocation recommendations if datastores approach capacity. The default setting is 90% which should be adequate.

What should be disabled though is IO Metrics — It is EMC’s recommendation Storage DRS IO metrics be disabled when using FAST VP. This is because they will perform competing roles, potentially identifying similar relocations and cause inefficient use of storage system resources.

So there you have it. The best way to leverage these components in your vBlock.

Sources:

There is a great EMC document here which lists best practice with EMC VNX Storage and vSphere and an old, but relevant article from Virtual Geek on SIOC and auto-tering.