Citrix XenApp Showdown: AMD Opteron 6278 vs Intel Xeon E5-2670


I recently had the chance to compare similar generation AMD and Intel blade offerings from a single vendor with LoginVSI to see if there was any great difference between the two available chipsets when deploying server based computing environments leveraging Citrix XenApp.

This produced some really interesting results… Allow me to introduce ‘The Citrix XenApp Showdown: AMD Opteron 6278 vs Intel Xeon E5-2670′

These processors were chosen as they are two similar spec HPC blades from the same vendor. However it is clear from a July 2012 AMD document titled HPC Processor Comparison that the Opteron 6278 has a lower price and lower SPECint_rate2006 benchmark score than the Intel Xeon E5-2670.

AMD Positioning Guidance

AMD HPC Processor Comparison

This test has been conducted on identical infrastructure, i.e. same storage, same 10GB network, same Citrix XenApp 6.5 infrastructure, same PVS 6.1 image and uses the well known benchmarking tool LoginVSI to produce a VSImax score to determine how many concurrent users each blade can safely handle before the user experience deteriorates.

The environment used for the test includes:

  • ESXi 5.1
  • Citrix XenApp 6.5
  • Citrix PVS 6.1
  • LoginVSI 4.0.4
  • Virtual Machines running Windows Server 2008 R2 SP1

In the left corner we have the AMD blade based on the Bulldozer 6200 Opteron Processor:

  • AMD Processor – Opteron 6278 Dual 16 core 2.4 Ghz with 256GB memory

In the right corner we have the Intel blade based on the Sandy Bridge E5-2600 Xeon processor:

  • Intel Processor – Xeon E5-2670 Dual 8 Core 2.6Ghz with 256GB memory

At a high level the blades are pretty similar, with dual sockets and similar generation processors, released within 3 months of each other in 2012, similar clock speed and no of logical processors. The architecture of the AMD and Intel blade are however quite different.

The AMD blade has a big advantage with a total of 32 physical cores but without any hyper threading equivalent therefore it has 32 logical cores available to ESXi. The Intel blade has 16 physical cores, but with hyper threading enabled this also gives 32 logical CPU available to ESXi.

Because AMD are able to offer twice as many cores as the Intel blade, ESXi reports as having almost twice the GHz available, shown in the picture below, on the AMD blade (R) than the Intel blade. (L)

AMD vs Intel Total GHz

Intel (L) 41.5 GHz vs AMD (R) 76.8GHz total available

ESXi reports the Intel blade has 16*2.599 = 41.5GHz and the AMD blade reports 32*2.4 = 76.8 GHz so you would expect on face value for the AMD blade to offer almost double the performance.

I’m not going to go into too much detail here, this is well documented elsewhere, but essentially AMD and Intel have come up with two different approaches to solving the same problem – CPU under utilisation. Intel rely on a single core which they try to increase performance by concurrently supplying two threads with Hyper threading to increase CPU utilisation.

AMD chose to split the core in two, so rather than having one complex core, they opted for two simple cores with shared components, with each core having their own execution thread. Hence how AMD are able to offer 16 core blades vs Intel 8 core blades with twice the available GHz of the Intel.

Each approach clearly has its own benefits and advantages… There is a comparison available between the processors at cpuboss.com. To summarise, the Intel Xeon processor is more expensive, has a higher clock speed, more L3 cache vs the AMD blade (20MB vs 15.3MB), but the AMD blade is cheaper, has 8x times the L2 cache (16MB vs 2MB) and double the cores.

But which is more suited to Citrix XenApp?

Medium Workload Test

The first test run was the default LoginVSI Medium Workload Test.

Each blade was configured according to Citrix best practices: 8x 2008 R2 SP1 VMs with 4vCPU each so that no of vCPU = logical CPU. Memory was configured at 16GB for a total of 128GB per blade.

Shown below is the AMD blade with a Medium Workload VSIMax score of 83. Note the high VSIbase score (4633) which indicates the performance of the system with no load on the environment. The lower the score the better the performance and this is used to determine the performance threshold.

There are a no of high number of maximum responses (in red). The user experience starts to suffer almost immediately and the maximum responses start to spike and exceed 6000ms after only 24 users have logged on (3 users per VM). The VSImax score indicates that you would be hard pressed to run more than 10 users per VM which is pretty poor.

AMD Medium Workload

AMD Opteron 6278 Medium Workload

Shown below is the Intel Blade test with a Medium Workload with a VSImax score of 134. No official VSImax score was reached, although there is a blue X indicating VSImax at 150 users, less the 16 stuck sessions equals a VSImax corrected score of 134. For anyone with doubts, this is an accurate figure based on other medium workload tests that we ran.

In comparison to the AMD Opteron 6278, note the much lower VSIbase score for the Intel Xeon E5-2670 (2217) indicating better system performance and the complete lack of high maximum response times indicating a more reliable user experience. Maximum response times only start to exceed 6000ms around the 90 user mark indicating the blade is able to process user logons and run applications in the background consistently. 134 users equals a much more respectable 16 users per VM for the Intel blade.

Intel Medium Workload

Intel Xeon E5-2670 Medium Workload

Conclusion: There is a pretty impressive 53 user increase in user density between the AMD and Intel blades on a Medium Workload. In other words if you replace your AMD blades with Intel blades you are looking at a 63% gain in user density with comparable Intel hardware with a medium workload user. For a blade with half the number of cores and GHz that is quite  impressive and a massive endorsement of the Intel chipset architecture.

Heavy Workload Test

I re-ran the tests with a LoginVSI Heavy Workload. Again each blade was configured according to Citrix best practices: 8x 2008 R2 SP1 VMs with 4vCPU each so that no of vCPU = logical CPU. Memory was configured at 16GB for a total of 128GB per blade.

The VSImax results get really interesting with the LoginVSI heavy workload test. Here is a summary of the LoginVSI workloads. The Heavy workload is “higher on memory and CPU consumption because more applications are running in the background.”

Shown below is the AMD blade with a Heavy Workload VSImax score of 61. As expected the VSImax score drops due to the heavier workload. Note the similar high VSIbase score to the previous AMD test and how maximum response times start to exceed 6000ms after only 26 users. A VSImax score of 61 is a maximum of 7 users per VM. We’re heading into really poor territory now.

Heavy Workload - AMG

AMD Opteron 6278 Heavy Workload

Shown below is the Intel Blade test with a Heavy Workload VSImax score of 129.  This is a drop of only 5 users from the Medium workload test which is remarkable. The Intel blade appears to perform better when the workload is increased. Maximum response times have improved and only exceed 6000ms at 90 users (and never exceeds 10000ms unlike the medium workload test.) A VSImax score of 129 ensures that the number of users per VM remains at 16 even on a high workload.

Heavy Workload - Intel

Intel Xeon E5-2670 Heavy Workload

Conclusion: The difference between the two results is startling. The high frequency of maximum response times in the AMD test show how the blade is simply struggling to cope with the task of processing user logons and launching and using standard desktop applications.

These numbers are hard to believe, but increasing the workload shows an even bigger gap between the AMD and Intel blades. There is now a 68 user increase in user density by moving from AMD to Intel. If you have a higher proportion of heavy users in your environment, you will see even greater gains by moving from AMD to Intel. In this case you are looking at a 111% gain in user density with comparable Intel hardware.

Summary

The clear winner here, by a large margin is the Intel Sandy Bridge Xeon E5-2670 processor blade. Although the Intel blade will be more expensive due to the more expensive processor, it more than pays for itself by offering a far higher user density and a surprising ability to cope with heavy workloads.

VSIMax Summary

VSIMax Summary

I’m still scratching my head here as the AMD blade appears to offer a decent performance/price point alternative to the Intel blade, but the results do not support this. Although it offers twice the number of cores and almost doubles the available GHz to the hypervisor, it is not able to translate this into providing a similar user experience. Although the Intel has a higher SPECint_rate2006 benchmark score I never thought this would translate into more than double (111%) user density increase when testing with LoginVSI.

I would be interested to do a comparison between two blades where the AMD blade has a higher SpecInt_rate2006 benchmark score to see at what level a lower Intel spec blade can outperform its AMD rival. My guess is that even the entry level Xeon E5-2620 (SPECint_rate2006 score 396) would be able to match the top of the range Opteron 6284 SE (SPECint_rate2006 score 573).

As the workload gets heavier, the results skew even more in Intel’s favour. A heavier workload need not necessarily come from your users behaviour. It has been documented by Citrix and ProjectVRC that moving from Office 2010 to Office 2013 results in a 20-30% increase in the user workload. After reviewing these results I know which processor I would rather have in my SBC environment.

In other words choosing Intel over AMD not only provides better user density, lower CapEx and OpEx costs (due to the smaller infrastructure footprint, licensing, etc) and an improved ability to cope with heavier workloads but can provide some future proofing if you are planning on upgrading to Office 2013.

Clearly the AMD Bulldozer architecture has some advantages over the Intel Sandy Bridge, but server based computing (SBC) is not one of them.

Steer clear if you can.

Advertisements

Citrix XenApp: Is it worth upgrading to B200 M3 to improve user density?


We are currently running Cisco UCS B200 M2 blades in our current XenApp cluster. Now that the M2s are end of life and we are beginning to procure newer Cisco UCS B200 M3’s, I am starting to wonder what the benefits would be of moving our XenApp cluster to the newer Intel E5-2600 processor family. I would expect a decent increase as the M3 adds 8 logical CPU so it can support x2 additional 4vCPU XenApp VMs per blade (Citrix recommendation is to use 4vCPU VMs and align no of vCPU to logical CPU)…. so I’m hoping to increase user density by 30 users per host.

But how to know for sure? Well all things being equal, i.e. same storage, same XenApp environment, same PVS image, same network, we have Intel Xeon X5680 vs Xeon E5-2680, 12 vs 16 pCPU, 24 vs 32 logical CPUs and 39.888 GHz vs 43.184 GHz but it’s difficult to quantify the increase in user density without running live users on the blade or using software than can calculate the additional no of users that can be accommodated without compromising the user experience.

Into the ring enters “LoginVSI“… the de facto load testing tool for virtual desktop environments. LoginVSI generates a VSImax score, which is the maximum number of users workloads your VM\blade\environment can support before the user experience degrades. We will use the same LoginVSI test and the same applications to create a VSimax baseline on the B200 M2 and then compare this figure to the VSImax generated on the B200 M3 to calculate the increase in user density that can be safely accommodated by the M3.

In the left corner is the Cisco UCS B200 M2, a half width 2 socket blade based on Intel’s Nehalem 5600 processor.  In our case we are running a X5680 2 socket 6 core blade running at 3.324 GHz, rated at 130W, 12MB cache size, 1333MHz DDR3 DIMM, for a total of 12 cores and 39.888GHz. In the right corner is the Cisco UCS B200 M3, a half width 2 socket blade based on Intel’s Sandy Bridge E5-2600 processor. Our test blade is a E5-2680 2 socket 8 core blade running at 2.699 GHz, rated at 130W, 20MB cache size and 1666MHz DDR3 DIMM, for a total of 16 cores and 43.139GHz.

Shown below is a LoginVSI 150 user test with a Medium No Flash workload on a single B200 M2 running 6x XenApp VMs with 4vCPU and 12GB RAM each. The image below shows a VSImax score of 105, which is very similar to our current real user load per blade. As you can see the user experience degrades quite rapidly after 100 users.

B200 M2 X5680 VSImax

B200 M2 X5680 – Medium No Flash VSImax

The same test was run against a B200 M3 E5-2680 running 16 CPUs at 2.699 GHz for a total of 32 logical CPUs and 43.184GHz.

Shown below is the same LoginVSI 150 user test with a Medium No Flash workload on a single B200 M3, but this time running 8x XenApp VMs with 4vCPU and 12GB RAM each. The image below shows a VSImax score of 141, with little degradation until the 120 user mark, meaning the host was able to safely handle an additional 36 users without compromising the user experience. Not bad.

B200 M3 VSImax

B200 M3 E5-2680 – Medium No Flash VSImax

What happens when the workload is increased?

I ran the tests again with a LoginVSI Medium Workload. Here is a good link for the differences in LoginVSI workloads.

With a Medium Workload the VSImax on the M2 drops to 86, a drop of 19 users.

B200 M2 Medium Workload

B200 M2 X5680 – Medium VSImax

The VSImax on the M3 drops to 123, a similar drop of 18 users. The difference between them improves slightly to 37 users or 30% over the M2 VSImax.

B200 M3 150 user - Medium VSImax

B200 M3 150 user – Medium VSImax

Analysing the Results:

Has the user (and host) density increased? Indeed it has: The B200 M3 has improved the VSImax by 25% with a Medium No Flash workload and 30% with a Medium workload over the B200 M2; therefore user\host density has increased.
Is it worth it? That depends on your environment and your phase of deployment. If your M2’s are due to be replaced, this 25-30% increase will make quite a big difference if you have thousands of XenApp workloads to support. If your M2s still have some life left in them, but you are looking at procuring new hardware to support additional XenApp workloads, then factor in an additional 25-30% users per blade with the Cisco UCS B200 M3.
What is interesting is the number of users per physical core has changed very little.
  • B200 M2: 105 Medium No Flash users divided by 12 pCPU is 8.75 users per core.
  • B200 M3: 141 Medium No Flash users divided by 16 pCPU is 8.8 users per core.

It’s impressive that the M3 is able to support the same number of users per core as the M2, as is stated in CTX129761, “processor speed has a direct impact on the number of XenApp users that can be supported per processor.” In our test the M3 is rated substantially lower than the M2, 2.7GHz vs 3.33 GHz so Intel processors are definitely becoming more efficient.

Bring on 12 core… 🙂

Trend Micro Deep Security and Citrix XenApp: The effect of Agentless AV on VSImax


I’ve been doing some benchmarking recently on our 2 socket 6 core 3.3GHz B200 M2’s used in our dedicated XenApp cluster (each ESXi host providing a total of 39.888GHz) to quantify the impact of AV protection on VSImax. (If you haven’t heard of LoginVSI before, it is a load testing tool for virtual desktop environments. VSImax is the maximum number of users workloads your environment can support before the user experience degrades (response times > 4 seconds) and is a great benchmark as it can be used across different platforms.)

We use Trend Micro Deep Security 9.1 in our environment providing agentless anti malware protection for our XenApp VMs. The Deep Security Virtual Appliances provides the real time scanning via the vShield Endpoint API using a custom XenApp policy that includes all the Anti Virus best practices for Citrix XenApp and Citrix PVS.

Test Summary:

  1. Testing Tool: LoginVSI 3.6 with Medium No Flash workload
  2. Citrix XenApp anti-malware policy: Real Time Scanning enabled with all the best practice directory, file and extension exclusions set as well as the recommendation to disable Network Directory Scan and only scan files on Write.
  3. Deep Security Virtual Appliance (DSVA): Deployed with the default settings: 2vCPU, 2GB RAM, no CPU reservation and a 2 GB memory reservation.

Shown below is a LoginVSI 150 user test with a medium (no Flash) workload on a single B200 M2 running 6x VMs with 4vCPU and 12GB RAM each with agentless protection disabled. The image below shows a VSImax score of 105, which is very similar to our current real user load per blade.

VSIMax with No AV

VSIMax with No AV

Shown below is the same 150 user test with a medium (No Flash) workload on a single B200 M2 running 6x VMs with 4vCPU and 12GB RAM each with agentless anti malware protection enabled. The image below shows a VSImax score of 101.

VSIMax with AV

VSIMax with AV

The impact on VSImax with Deep Security agentless protection enabled is only 4 users per blade which is only a 3.8% user penalty. Shown below is the CPU MHz usage of the DSVA during the LoginVSI test. CPU MHz peaks at 550MHz which is 1.3% of the total available MHz of the host (39888MHz).  An acceptable penalty to keep our security boys happy!

DSVA CPU MHz

DSVA CPU MHz

Improving Citrix PVS 6.1 write cache performance on ESXi 5 with WcHDNoIntermediateBuffering


I’ve being doing a lot of Citrix XenApp 6.5 and PVS 6.1 performance tuning in our ESXi 5 environment recently. This post is about an interesting Citrix PVS registry setting that is no longer enabled by default in PVS 6.1. Credit to Citrix guru Alex Crawford for alerting me to this.

The setting is called WcHDNoIntermediateBuffering – there is a current article CTX126042 on the Citrix website but it is out of date and this document only applies to PVS 5.x.

What I noticed in our ESXi 5 environment was that if you compared an IOmeter test on your write cache volume with the PVS image read-only C:, you would see a huge IO penalty incurred when writes are redirected by PVS to the .vdiskcache file. In my testing with IOMeter, I would regularly achieve ~27000 IOPS (shown below) with a VDI test on the persistent disk.

Persistent Disk IO without PVS

Persistent Disk IO without PVS

When the same test was run against the read-only C: and the PVS driver had to intercept every write and redirect it to the .vdiskcache file IOPS would drop to 1000 (or x27 times), which is a pretty massive penalty.

WcHDNoIntermediateBuffering Disabled

WcHDNoIntermediateBuffering Disabled

Clearly this bottleneck would have an impact on write cache performance and latency and directly impact write intensive operations such as user logon and launching applications which would negatively impact the user experience.

WcHDNoIntermediateBuffering enables or disables intermediate buffering which aims to improve system performance. In PVS 5.x, PVS used an algorithm to determine whether the setting was enabled based on the free space available on the write cache volume if no registry value was set (default setting).

This is no longer the case, WcHDNoIntermediateBuffering in PVS 6.x is permanently disabled. I have confirmed this with Citrix Technical Support. Why was it disabled? Not sure, probably too onerous for Citrix to support – here are two current articles relating to issues with the setting – CTX131112 and CTX128038.

With PVS 6.1 the behaviour of the “HKLM\SYSTEM\CurrentControlSet\Services\BNIStack\Parameters\WcHDNoIntermediateBuffering” value is as follows:

  • No value present – (Disabled)
  • REG_DWORD=0 (Disabled)
  • REG_DWORD=1 (Disabled)
  • REG_DWORD=2 (Enabled)

As you can see the default behaviour is now disabled and the only way to enable WcHDNoIntermediateBuffering is to set the value to 2.

In testing in our ESXi 5 environment, with XenApp VMs running on VM8 hardware with an eager zero persistent disk on a SAS storage pool with the paravirtual SCSI adapter I saw a +20x increase in IO with WcHDNoIntermediateBuffering enabled. The throughput performance with WcHDNoIntermediateBuffering enabled is 76% of the true IO of the disk which is a much more manageable penalty.

WcHDNoIntermediateBuffering Enabled

WcHDNoIntermediateBuffering Enabled

Enabling WcHDNoIntermediateBuffering increased IOPS in our IOmeter VDI tests from 1000 IOPS to over 20000 IOPS, a pretty massive x20 increase.

Bottom Line: While CPU will be the bottleneck in most XenApp environments, if you are looking for an easy win, enabling this setting will align write cache IO performance closer to the true IO of your disk, eliminating a write cache bottleneck and improving the user experience on your PVS clients. We’ve rolled this into production without any issues and I recommend you do too.

Update 15/08/2013: Since upgrading to PVS 6.1 HF 16 I’ve since not seen any deterioration in IOmeter tests between our persistent disk and the read-only C:\. This may be due to improvements in HF16 or changes in our XenApp image, but this is good news nonetheless as there is now no IO penalty on the System drive with WcHDNoIntermediateBuffering enabled.

Recreating the test in your environment:

I used a simple VDI test to produce these results that included 80% writes / 20% reads with 100% Random IO on 4KB for 15 minutes.

Follow these instructions to run the same test:

  1. Download the attachment and rename it to iometer.icf.
  2. Spin up your XenApp image in standard mode
  3. Install IOmeter
  4. Launch IOmeter
  5. Open iometer.icf
  6. Select the computer name
  7. Select your Disk Target (C:, D:, etc)
  8. Click Go
  9. Save Results
  10. Monitor the Results Display to see Total I/O per second