Citrix XenApp Showdown: AMD Opteron 6278 vs Intel Xeon E5-2670

I recently had the chance to compare similar generation AMD and Intel blade offerings from a single vendor with LoginVSI to see if there was any great difference between the two available chipsets when deploying server based computing environments leveraging Citrix XenApp.

This produced some really interesting results… Allow me to introduce ‘The Citrix XenApp Showdown: AMD Opteron 6278 vs Intel Xeon E5-2670′

These processors were chosen as they are two similar spec HPC blades from the same vendor. However it is clear from a July 2012 AMD document titled HPC Processor Comparison that the Opteron 6278 has a lower price and lower SPECint_rate2006 benchmark score than the Intel Xeon E5-2670.

AMD Positioning Guidance

AMD HPC Processor Comparison

This test has been conducted on identical infrastructure, i.e. same storage, same 10GB network, same Citrix XenApp 6.5 infrastructure, same PVS 6.1 image and uses the well known benchmarking tool LoginVSI to produce a VSImax score to determine how many concurrent users each blade can safely handle before the user experience deteriorates.

The environment used for the test includes:

  • ESXi 5.1
  • Citrix XenApp 6.5
  • Citrix PVS 6.1
  • LoginVSI 4.0.4
  • Virtual Machines running Windows Server 2008 R2 SP1

In the left corner we have the AMD blade based on the Bulldozer 6200 Opteron Processor:

  • AMD Processor – Opteron 6278 Dual 16 core 2.4 Ghz with 256GB memory

In the right corner we have the Intel blade based on the Sandy Bridge E5-2600 Xeon processor:

  • Intel Processor – Xeon E5-2670 Dual 8 Core 2.6Ghz with 256GB memory

At a high level the blades are pretty similar, with dual sockets and similar generation processors, released within 3 months of each other in 2012, similar clock speed and no of logical processors. The architecture of the AMD and Intel blade are however quite different.

The AMD blade has a big advantage with a total of 32 physical cores but without any hyper threading equivalent therefore it has 32 logical cores available to ESXi. The Intel blade has 16 physical cores, but with hyper threading enabled this also gives 32 logical CPU available to ESXi.

Because AMD are able to offer twice as many cores as the Intel blade, ESXi reports as having almost twice the GHz available, shown in the picture below, on the AMD blade (R) than the Intel blade. (L)

AMD vs Intel Total GHz

Intel (L) 41.5 GHz vs AMD (R) 76.8GHz total available

ESXi reports the Intel blade has 16*2.599 = 41.5GHz and the AMD blade reports 32*2.4 = 76.8 GHz so you would expect on face value for the AMD blade to offer almost double the performance.

I’m not going to go into too much detail here, this is well documented elsewhere, but essentially AMD and Intel have come up with two different approaches to solving the same problem – CPU under utilisation. Intel rely on a single core which they try to increase performance by concurrently supplying two threads with Hyper threading to increase CPU utilisation.

AMD chose to split the core in two, so rather than having one complex core, they opted for two simple cores with shared components, with each core having their own execution thread. Hence how AMD are able to offer 16 core blades vs Intel 8 core blades with twice the available GHz of the Intel.

Each approach clearly has its own benefits and advantages… There is a comparison available between the processors at To summarise, the Intel Xeon processor is more expensive, has a higher clock speed, more L3 cache vs the AMD blade (20MB vs 15.3MB), but the AMD blade is cheaper, has 8x times the L2 cache (16MB vs 2MB) and double the cores.

But which is more suited to Citrix XenApp?

Medium Workload Test

The first test run was the default LoginVSI Medium Workload Test.

Each blade was configured according to Citrix best practices: 8x 2008 R2 SP1 VMs with 4vCPU each so that no of vCPU = logical CPU. Memory was configured at 16GB for a total of 128GB per blade.

Shown below is the AMD blade with a Medium Workload VSIMax score of 83. Note the high VSIbase score (4633) which indicates the performance of the system with no load on the environment. The lower the score the better the performance and this is used to determine the performance threshold.

There are a no of high number of maximum responses (in red). The user experience starts to suffer almost immediately and the maximum responses start to spike and exceed 6000ms after only 24 users have logged on (3 users per VM). The VSImax score indicates that you would be hard pressed to run more than 10 users per VM which is pretty poor.

AMD Medium Workload

AMD Opteron 6278 Medium Workload

Shown below is the Intel Blade test with a Medium Workload with a VSImax score of 134. No official VSImax score was reached, although there is a blue X indicating VSImax at 150 users, less the 16 stuck sessions equals a VSImax corrected score of 134. For anyone with doubts, this is an accurate figure based on other medium workload tests that we ran.

In comparison to the AMD Opteron 6278, note the much lower VSIbase score for the Intel Xeon E5-2670 (2217) indicating better system performance and the complete lack of high maximum response times indicating a more reliable user experience. Maximum response times only start to exceed 6000ms around the 90 user mark indicating the blade is able to process user logons and run applications in the background consistently. 134 users equals a much more respectable 16 users per VM for the Intel blade.

Intel Medium Workload

Intel Xeon E5-2670 Medium Workload

Conclusion: There is a pretty impressive 53 user increase in user density between the AMD and Intel blades on a Medium Workload. In other words if you replace your AMD blades with Intel blades you are looking at a 63% gain in user density with comparable Intel hardware with a medium workload user. For a blade with half the number of cores and GHz that is quite  impressive and a massive endorsement of the Intel chipset architecture.

Heavy Workload Test

I re-ran the tests with a LoginVSI Heavy Workload. Again each blade was configured according to Citrix best practices: 8x 2008 R2 SP1 VMs with 4vCPU each so that no of vCPU = logical CPU. Memory was configured at 16GB for a total of 128GB per blade.

The VSImax results get really interesting with the LoginVSI heavy workload test. Here is a summary of the LoginVSI workloads. The Heavy workload is “higher on memory and CPU consumption because more applications are running in the background.”

Shown below is the AMD blade with a Heavy Workload VSImax score of 61. As expected the VSImax score drops due to the heavier workload. Note the similar high VSIbase score to the previous AMD test and how maximum response times start to exceed 6000ms after only 26 users. A VSImax score of 61 is a maximum of 7 users per VM. We’re heading into really poor territory now.

Heavy Workload - AMG

AMD Opteron 6278 Heavy Workload

Shown below is the Intel Blade test with a Heavy Workload VSImax score of 129.  This is a drop of only 5 users from the Medium workload test which is remarkable. The Intel blade appears to perform better when the workload is increased. Maximum response times have improved and only exceed 6000ms at 90 users (and never exceeds 10000ms unlike the medium workload test.) A VSImax score of 129 ensures that the number of users per VM remains at 16 even on a high workload.

Heavy Workload - Intel

Intel Xeon E5-2670 Heavy Workload

Conclusion: The difference between the two results is startling. The high frequency of maximum response times in the AMD test show how the blade is simply struggling to cope with the task of processing user logons and launching and using standard desktop applications.

These numbers are hard to believe, but increasing the workload shows an even bigger gap between the AMD and Intel blades. There is now a 68 user increase in user density by moving from AMD to Intel. If you have a higher proportion of heavy users in your environment, you will see even greater gains by moving from AMD to Intel. In this case you are looking at a 111% gain in user density with comparable Intel hardware.


The clear winner here, by a large margin is the Intel Sandy Bridge Xeon E5-2670 processor blade. Although the Intel blade will be more expensive due to the more expensive processor, it more than pays for itself by offering a far higher user density and a surprising ability to cope with heavy workloads.

VSIMax Summary

VSIMax Summary

I’m still scratching my head here as the AMD blade appears to offer a decent performance/price point alternative to the Intel blade, but the results do not support this. Although it offers twice the number of cores and almost doubles the available GHz to the hypervisor, it is not able to translate this into providing a similar user experience. Although the Intel has a higher SPECint_rate2006 benchmark score I never thought this would translate into more than double (111%) user density increase when testing with LoginVSI.

I would be interested to do a comparison between two blades where the AMD blade has a higher SpecInt_rate2006 benchmark score to see at what level a lower Intel spec blade can outperform its AMD rival. My guess is that even the entry level Xeon E5-2620 (SPECint_rate2006 score 396) would be able to match the top of the range Opteron 6284 SE (SPECint_rate2006 score 573).

As the workload gets heavier, the results skew even more in Intel’s favour. A heavier workload need not necessarily come from your users behaviour. It has been documented by Citrix and ProjectVRC that moving from Office 2010 to Office 2013 results in a 20-30% increase in the user workload. After reviewing these results I know which processor I would rather have in my SBC environment.

In other words choosing Intel over AMD not only provides better user density, lower CapEx and OpEx costs (due to the smaller infrastructure footprint, licensing, etc) and an improved ability to cope with heavier workloads but can provide some future proofing if you are planning on upgrading to Office 2013.

Clearly the AMD Bulldozer architecture has some advantages over the Intel Sandy Bridge, but server based computing (SBC) is not one of them.

Steer clear if you can.

Improving Citrix PVS 6.1 write cache performance on ESXi 5 with WcHDNoIntermediateBuffering

I’ve being doing a lot of Citrix XenApp 6.5 and PVS 6.1 performance tuning in our ESXi 5 environment recently. This post is about an interesting Citrix PVS registry setting that is no longer enabled by default in PVS 6.1. Credit to Citrix guru Alex Crawford for alerting me to this.

The setting is called WcHDNoIntermediateBuffering – there is a current article CTX126042 on the Citrix website but it is out of date and this document only applies to PVS 5.x.

What I noticed in our ESXi 5 environment was that if you compared an IOmeter test on your write cache volume with the PVS image read-only C:, you would see a huge IO penalty incurred when writes are redirected by PVS to the .vdiskcache file. In my testing with IOMeter, I would regularly achieve ~27000 IOPS (shown below) with a VDI test on the persistent disk.

Persistent Disk IO without PVS

Persistent Disk IO without PVS

When the same test was run against the read-only C: and the PVS driver had to intercept every write and redirect it to the .vdiskcache file IOPS would drop to 1000 (or x27 times), which is a pretty massive penalty.

WcHDNoIntermediateBuffering Disabled

WcHDNoIntermediateBuffering Disabled

Clearly this bottleneck would have an impact on write cache performance and latency and directly impact write intensive operations such as user logon and launching applications which would negatively impact the user experience.

WcHDNoIntermediateBuffering enables or disables intermediate buffering which aims to improve system performance. In PVS 5.x, PVS used an algorithm to determine whether the setting was enabled based on the free space available on the write cache volume if no registry value was set (default setting).

This is no longer the case, WcHDNoIntermediateBuffering in PVS 6.x is permanently disabled. I have confirmed this with Citrix Technical Support. Why was it disabled? Not sure, probably too onerous for Citrix to support – here are two current articles relating to issues with the setting – CTX131112 and CTX128038.

With PVS 6.1 the behaviour of the “HKLM\SYSTEM\CurrentControlSet\Services\BNIStack\Parameters\WcHDNoIntermediateBuffering” value is as follows:

  • No value present – (Disabled)
  • REG_DWORD=0 (Disabled)
  • REG_DWORD=1 (Disabled)
  • REG_DWORD=2 (Enabled)

As you can see the default behaviour is now disabled and the only way to enable WcHDNoIntermediateBuffering is to set the value to 2.

In testing in our ESXi 5 environment, with XenApp VMs running on VM8 hardware with an eager zero persistent disk on a SAS storage pool with the paravirtual SCSI adapter I saw a +20x increase in IO with WcHDNoIntermediateBuffering enabled. The throughput performance with WcHDNoIntermediateBuffering enabled is 76% of the true IO of the disk which is a much more manageable penalty.

WcHDNoIntermediateBuffering Enabled

WcHDNoIntermediateBuffering Enabled

Enabling WcHDNoIntermediateBuffering increased IOPS in our IOmeter VDI tests from 1000 IOPS to over 20000 IOPS, a pretty massive x20 increase.

Bottom Line: While CPU will be the bottleneck in most XenApp environments, if you are looking for an easy win, enabling this setting will align write cache IO performance closer to the true IO of your disk, eliminating a write cache bottleneck and improving the user experience on your PVS clients. We’ve rolled this into production without any issues and I recommend you do too.

Update 15/08/2013: Since upgrading to PVS 6.1 HF 16 I’ve since not seen any deterioration in IOmeter tests between our persistent disk and the read-only C:\. This may be due to improvements in HF16 or changes in our XenApp image, but this is good news nonetheless as there is now no IO penalty on the System drive with WcHDNoIntermediateBuffering enabled.

Recreating the test in your environment:

I used a simple VDI test to produce these results that included 80% writes / 20% reads with 100% Random IO on 4KB for 15 minutes.

Follow these instructions to run the same test:

  1. Download the attachment and rename it to iometer.icf.
  2. Spin up your XenApp image in standard mode
  3. Install IOmeter
  4. Launch IOmeter
  5. Open iometer.icf
  6. Select the computer name
  7. Select your Disk Target (C:, D:, etc)
  8. Click Go
  9. Save Results
  10. Monitor the Results Display to see Total I/O per second