Improving Citrix PVS 6.1 write cache performance on ESXi 5 with WcHDNoIntermediateBuffering


I’ve being doing a lot of Citrix XenApp 6.5 and PVS 6.1 performance tuning in our ESXi 5 environment recently. This post is about an interesting Citrix PVS registry setting that is no longer enabled by default in PVS 6.1. Credit to Citrix guru Alex Crawford for alerting me to this.

The setting is called WcHDNoIntermediateBuffering – there is a current article CTX126042 on the Citrix website but it is out of date and this document only applies to PVS 5.x.

What I noticed in our ESXi 5 environment was that if you compared an IOmeter test on your write cache volume with the PVS image read-only C:, you would see a huge IO penalty incurred when writes are redirected by PVS to the .vdiskcache file. In my testing with IOMeter, I would regularly achieve ~27000 IOPS (shown below) with a VDI test on the persistent disk.

Persistent Disk IO without PVS

Persistent Disk IO without PVS

When the same test was run against the read-only C: and the PVS driver had to intercept every write and redirect it to the .vdiskcache file IOPS would drop to 1000 (or x27 times), which is a pretty massive penalty.

WcHDNoIntermediateBuffering Disabled

WcHDNoIntermediateBuffering Disabled

Clearly this bottleneck would have an impact on write cache performance and latency and directly impact write intensive operations such as user logon and launching applications which would negatively impact the user experience.

WcHDNoIntermediateBuffering enables or disables intermediate buffering which aims to improve system performance. In PVS 5.x, PVS used an algorithm to determine whether the setting was enabled based on the free space available on the write cache volume if no registry value was set (default setting).

This is no longer the case, WcHDNoIntermediateBuffering in PVS 6.x is permanently disabled. I have confirmed this with Citrix Technical Support. Why was it disabled? Not sure, probably too onerous for Citrix to support – here are two current articles relating to issues with the setting – CTX131112 and CTX128038.

With PVS 6.1 the behaviour of the “HKLM\SYSTEM\CurrentControlSet\Services\BNIStack\Parameters\WcHDNoIntermediateBuffering” value is as follows:

  • No value present – (Disabled)
  • REG_DWORD=0 (Disabled)
  • REG_DWORD=1 (Disabled)
  • REG_DWORD=2 (Enabled)

As you can see the default behaviour is now disabled and the only way to enable WcHDNoIntermediateBuffering is to set the value to 2.

In testing in our ESXi 5 environment, with XenApp VMs running on VM8 hardware with an eager zero persistent disk on a SAS storage pool with the paravirtual SCSI adapter I saw a +20x increase in IO with WcHDNoIntermediateBuffering enabled. The throughput performance with WcHDNoIntermediateBuffering enabled is 76% of the true IO of the disk which is a much more manageable penalty.

WcHDNoIntermediateBuffering Enabled

WcHDNoIntermediateBuffering Enabled

Enabling WcHDNoIntermediateBuffering increased IOPS in our IOmeter VDI tests from 1000 IOPS to over 20000 IOPS, a pretty massive x20 increase.

Bottom Line: While CPU will be the bottleneck in most XenApp environments, if you are looking for an easy win, enabling this setting will align write cache IO performance closer to the true IO of your disk, eliminating a write cache bottleneck and improving the user experience on your PVS clients. We’ve rolled this into production without any issues and I recommend you do too.

Update 15/08/2013: Since upgrading to PVS 6.1 HF 16 I’ve since not seen any deterioration in IOmeter tests between our persistent disk and the read-only C:\. This may be due to improvements in HF16 or changes in our XenApp image, but this is good news nonetheless as there is now no IO penalty on the System drive with WcHDNoIntermediateBuffering enabled.

Recreating the test in your environment:

I used a simple VDI test to produce these results that included 80% writes / 20% reads with 100% Random IO on 4KB for 15 minutes.

Follow these instructions to run the same test:

  1. Download the attachment and rename it to iometer.icf.
  2. Spin up your XenApp image in standard mode
  3. Install IOmeter
  4. Launch IOmeter
  5. Open iometer.icf
  6. Select the computer name
  7. Select your Disk Target (C:, D:, etc)
  8. Click Go
  9. Save Results
  10. Monitor the Results Display to see Total I/O per second
Advertisements

VMware Product Compatibility Matrix


This is the best website since sliced bread and I felt an irresistible urge to share it with y’all.

I am talking of course about the VMware Product Interoperabillity Matrix.

A bit of a mouthful but with this little beauty you can work out exactly what dependencies there are between VMware products.

Q. Planning to upgrade your ESXi hypervisor and worried about the impact this will have on your other VMware products like vCenter, vShield, VUM, SRM?

A. No problem, just check the product interoperability matrix!

Q. Want to install vCenter but not sure which versions of SQL are supported?

A. No problem, just check the product interoperability matrix!

You get the idea…

If you have never had a look before, I recommend you perform a quick review of your environment. You may be surprised to see you are out of the matrix (like I did!)

vShield Endpoint Driver BSOD issue


The vShield Endpoint driver is back in the bad books this week.

Looks like it is now causing our virtual machines to blue screen. grrrrr

If its not an issue with Trend Micro Deep Security, its an issue with vShield Endpoint!

This affected our Citrix Xenapp Provisioned Services Servers quite severely. They were blue screening every day. This has only affected one of our standard virtual machines – a file server crashed during the day the other week.

This will affect anyone using the latest officially released vShield driver 5.0.0.1 build-652273 and older versions.

This issue is confirmed by VMware to be fixed in a new version of the vShield Endpoint driver 5.0.0.2 build-813867 — another reason to contact VMware  to get your hands on this driver as it has not been officially released yet.

 

vBlock Tip: Increase Cisco 1000V Max Ports from default of 32


Another post in the vBlock tip series…

VCE use static binding on the Cisco 1000V and this combined with the default of 32 ports per VLAN means most people will soon run out of ports on their DV port groups.

Who knows why 32 is the default. It seems a bit conservative to me. Maybe there is a global port limit but I haven’t been able to confirm this.

Either way, 32 doesn’t seem nearly enough ports in most network designs. The good news is the maximum is 1024, so it makes sense to me to increase it substantially depending on the number of VLANs you have.

As soon as your vBlock lands I would definitely review each DV Port Group and increase the max ports assigned.

Static binding is a pain in the arse – it means that any VM whether a template or whether its powered off will use up a port if it is assigned to the DV Port Group. You may only have 5x running VMs on the VLAN but you won’t be able to add and power on a 6th VM if you have 27x VMs\templates powered off and assigned to that same DV Port Group.

For that reason alone I am not sure why VCE don’t just use ephemeral binding. Anyway I am going off topic.

Instructions from VMware KB1035819 on how to increase your max ports for each VLAN (port-profile).

These are the commands I use:

  1. show port-profile – to find the correct port profile name
  2. conf t – enter configuration
  3. port-profile <DV-port-group-name> – change configuration context to the correct port-profile
  4. vmware max-ports 64 – change max ports to 64
  5. copy run start – copy running config to startup config
  6. exit
  7. exit
  8. exit

 

Trend Deep Security Warning Message ‘Machine was unprotected during move from one esx host to another’


I wanted to post some more information on this Trend DS error message – ‘Machine was unprotected during move from one esx host to another’ as it seems to come up regularly.

The description of the error message is, ‘a virtual machine was moved to an ESX that does not have an activated Deep Security Virtual Appliance.’

In essence this warning message is saying that the ESXi host you vMotioned your VM too is not currently protecting the virtual machine.

This can be because there is no virtual appliance on the target ESXi host, the Trend Virtual Appliance is not offering Anti Malware protection, is not Activated or is Offline.

This error message will not show for unactivated virtual machines — A virtual machine has to be activated to generate this error message.

There is a known bug with this error message too – even though your VM is being protected by the appliance, the error message is always reported as an Agent error. Apparently Trend are working on this.

Back to the error message: When you receive this error message, what is the next step?

Trend is a complicated beast – An appliance can have issues for a number of reasons – whether there is a fault with the appliance or one of its dependencies is what you need to figure out. It could be something as basic as the appliance dropping off the network, losing connectivity back to the DSM or to the vShield Endpoint VMkernel port, or possibly its no  longer activated (not registered as a security appliance in vShield Manager.)

If you get this warning  message, open the virtual appliance that the VM is currently residing on and first ‘Clear Warnings/Errors’  so you remove any old status\error messages and then run ‘Check Status’ to see if there are any new issues. If there are errors reported on the appliance try and resolve them by following the patented ‘Trend DS Virtual Appliance Health Check’ below.

My main bugbear with Trend is that it is too complicated and it does not report its current state accurately and concisely. When I run a Check Status I want to know exactly what is going on. It would be most useful to have a health check screen on the appliance where the health check tests I mention below in the article are run sequentially in full view for the benefit of the administrator. Issue could be highlighted immediately and it would give us confidence that the appliance and its dependencies are all configured correctly, rather than having to check all the different components individually.

For example if you check the status of your appliance and it reports back that it is Managed and Online you would expect it to be managed, online and offering anti malware protection. In my testing after I changed the vShield VMkernel IP address on my ESXi host from 169.254.1.1 to 169.254.1.2, so the appliance could not offer anti malware protection, I ran a Check Status and the virtual appliance would still report that it was managed, online and offering anti malware protection.

On the plus side when I migrated a VM to the ESXi host with the misconfigured VMkernel port, the warning message was still generated that the VM is unprotected. What this shows is this error message is symptomatic of an underlying issue with your virtual appliance or ESXi host. While the issue may not be immediately noticable because the DSM reports that all is well, you should dig deeper following the ‘Trend DS Virtual Appliance Health Check’ below.

Bottom line — You cannot fully trust the DSM when you notice this error message. The only way to verify for sure that the appliance is actually working or not would be to drop the EICAR virus on the VM to confirm whether anti malware protection is working.

‘Trend DS Virtual Appliance Health Check’:

  1. Synchronise your Virtual Center(s) in Trend DSM
  2. Confirm your credentials for VVC and vShield are uptodate
  3. Confirm filter driver is installed on ESXi host via Trend DSM
  4. Confirm vShield driver is installed on ESXi host via vShield Manager
  5. Confirm Trend Appliance is registered as Security VM with vShield Manager
  6. Confirm the appliance is in the correct VLAN
  7. Confirm the appliance network configuration is correct
  8. Confirm you can ping the Appliance from the DSM.
  9. Confirm the VMkernel IP address for vShield Endpoint is correct on ESXi host – 169.254.1.1

and if nothing works follow my last resort:

10. Deactivate and reactivate the appliance

And if that fails…. Follow the blocksandbytes ‘Triple D’ process:

11. Deactivate, Delete and Deploy the appliance.

When I’m being lazy and I know the config hasn’t changed I will Deactivate and reactivate the appliance immediately. What I find with Trend is that as long as your environment is static, Trend will continue to stay Green, but if your environment is fairly dynamic and hosts are being rebooted, VMs are being built and vMotioned, you are performing SRM fail overs and fail backs, etc. it struggles to keep up with environment changes.

Every week I have to try and figure out why virtual machines are unhappy and do not have anti-malware protection. Hopefully this will help others stay on top of Trend DS 8.

vBlock Tip: Set VMFS3.MaxHeapSizeMB to 256MB on all ESXi hosts


This is the sort of issue I came across this week that you would expect VCE to make a de facto standard in the vBlock.

Why? Because the performance hit is negligible (a slight increase in additional kernel memory of 64mb), vBlock customers are likely to hit this ceiling and its another setting that we then don’t have to worry about.

I started running into issues vMotioning two VMs. It turns out this is a known issue as per KB1004424.

I was told by VMware: ‘ESXi 5.0 Host(Source on which you are trying to power on the VM)  already has 18 virtual disks(.vmdk) greater than 256GB in size open and you are trying to power on a virtual machine with another virtual disk of greater than 256GB in size.’

The heap size is effectively the amount of VMDK storage that can be hosted, across all virtual machines, for a given host.

The default heap size is a value of 80. To calculate the amount of open VMDK storage on the host that is available,  multiply 80 x 256 * 1024 — A 80MB heap value extends to 20 TB of open VMDK storage on a host.

Increasing the size to 256 results in 256 x 256 * 1024 — A 256MB heap value extends to 64 TB of open VMDK storage on a host, which should be plenty.

I was provided the following instructions which I repeated on each host to fix:

  1. Login to the vCenter Server or the ESXi host using the vSphere Client.
  2. Click on the configuration tab of the ESXi Host
  3. Click on Advanced Settings under Software
  4. Select VMFS3
  5. Change the value of VMFS3.MaxHeapSizeMB to 256
  6. Click on OK
  7. Reboot host

After rebooting each host the problem was solved. That was an easy one, for once!