SRM doesn’t support RecoverPoint point in time fail over?


I have to admit I was a little surprised when I found out from EMC that SRM does not support Recoverpoint Point in Time fail over.

Funny how VMware couldn’t tell me this? They just passed the buck to EMC… typical!

What’s the point of purchasing a product like EMC Recoverpoint if you cannot use it to its full potential? Well, that’s not entirely true, you can, you just have to do it outside of SRM!

Maybe SRM 5 has spoilt me a little. It does what it is supposed to do extremely well, but I just assumed the SRM RecoverPoint SRA would integrate with RecoverPoint’s image access and allow you to pick a previous point in time if required during a failover.

Alas, this is not the case. You cannot pick a point in time with SRM, it always uses the latest image only.

This is a feature request for SRM, VMware employees if you are reading this: When I perform a Test Failover I would like the ability to pick a previous point in time if required, before the failover commences.

What if you have to performed a Disaster Recovery failover and your latest image is corrupt. How do you then roll back to a previous journal entry in your Recovery Site?

These are some of the scenarios I don’t quite fully understand and I’m going to do some testing to see if I can combine some SRM and RP steps to at least partially automate the process – the thought of using RP natively, enabling image access,  mounting LUNS in the recovery site, rescanning hosts, registering VMs in vCenter, etc has really put me off using RecoverPoint’s point in time features.

More on this to follow.

Advertisements

What to do if a reprotect fails in SRM… Protection group has protected VMs with placeholders which need to be repaired


I had this issue today where a reprotect failed after a Planned Migration. I thought it was worth running through what I had to do to resolve the issue without performing a ‘Force Cleanup’ Reprotect as there is currently no KB article describing this workaround.

In my case the planned migration went ahead as planned without issues. All VMs were powered on at the Recovery Site and the Recovery Plan completed successfully.

When it came to the reprotect however, the reprotect failed on Step 3 ‘Cleanup storage’ with the error ‘Error – Protection group ‘xxx’ has protected VMs with placeholders which need to be repaired.’

The reprotect cancelled and when I looked at the protection group, only 3 of the 11 virtual machines were in an ‘OK’ state. The other 8 had a number of different error messages, including SRM Placeholder datastore could not be accessed, insufficient space, etc. Nothing that seemed to correlate.

I tried the reprotect again, without the force cleanup option and it failed again, so I removed protection from all the VMs with errors, and ran the Reprotect again. This time it worked fine after a few tense minutes.

To get SRM back to a protected state, I then had to delete from disk the placeholder VMs in the Recovery Site and manually reprotect all the VMs.

 

Hope this helps…

 

What to do if a SRM cleanup operation fails…


I recently encountered an issue where we ran a SRM Test Failover and afterwards it failed to cleanup correctly.

When the cleanup operation fails what I normally do is run the Force Cleanup and continue on with my life. How wrong I could be…

What happened next is I ran a planned migration and because the force cleanup had not worked correctly, not all virtual machines were protected. When the storage failed over, only 3 of the 8 VMs powered up in the Recovery Site. We ended up in a SRM failed state and had to manually failback the storage and reinstall SRM. It was a complete disaster and a big waste of a weekend.

So… this post outlines what you should do when a cleanup operations fails… As usual I learnt the hard way…!

If a cleanup operations fails:

  1. Run the force cleanup to try and finish the cleanup operation.
  2. Once Force Cleanup completes, check the following components manually to confirm that the force cleanup completely successfully.
  3. Open the Protection Group in SRM and open the protection group status  for the virtual machines.
  4. Select refresh and confirm all VMs are still protected – there status should be ‘OK’
  5. If any are not OK, select Reprotect VMs to fix the issues and recreate the placeholder VMs
  6. Change to vcenter datastore view
  7. Confirm the snap datastore for the Test Failover has been removed
  8. If the snap datastore still exists in italics or normal text, manually unmount and detach the snap datastore from all hosts.
  9. Once the datastore has been unmounted and detached from all hosts, right-click the datacenter (DC1 or DC2) and execute a ‘Rescan for Datastores’.
  10. On the next screen, untick ‘scan for new storage devices’
  11. Confirm the snap datastore has been removed.
And now you can carry on with your life…. and your planned migrations.

ESXi Hosts Not Responding APD – PowerPath PPVE 5.7 SP3 Build 173 and SRM 5


I’ve recently uncovered an issue with SRM 5 and the latest released version of Powerpath for ESXi – PPVE 5.7 SP3 Build 173 where Powerpath is not handling detaches devices properly after a SRM failover.

This is a known issue with SRM and Powerpath documented in VMware KB2016510  – ‘SRM TestFailover cleanup operation times out and fails with the error: Cannot detach SCSI LUN. Operation timed out: 900 seconds.’

This wasn’t the exact operation we had been performing. We had been undertaking Planned Migrations in the week preceding the incident rather than Test Failovers. Also there were no errors reported in SRM. In this post I wanted to document our symptoms so if you have a vBlock and SRM and you notice hosts becoming disconnected in vCenter; don’t panic… read on!

We had been running SRM 5 for a few months, but it seems we recently reached a tipping point after a period of extensive testing of  SRM planned migrations, test failovers and clean-ups. While we didn’t have any errors with our cleanup operations as per the above VMware KB article, out of the blue our ESXi hosts started to drop out of vCenter.

As we performed Planned Migration from the Recovery to Protected Site and back again, SRM was unmounting and detaching LUNS and Powerpath was incorrectly detaching the devices. Over time this caused the ESXi Hosts to stop responding within vCenter as they went into an APD (all paths down) state. First it was one host and then the following week it was five. Thankfully the VMs were not affected, but the hosts were completely unresponsive through the DCUI and we found the only fix was to gracefully shutdown the virtual machines via the OS and reboot the ESXi hosts. It was a real pain. Troubleshooting the  issue was compounded as lockdown mode was enabled and SSH\ESXi shell disabled.

The good news — this is a known issue with Powerpath VE that EMC are aware of. This is detailed in the emc knowledgebase – emc284091 “Esxcfg-rescan hangs after unmapping LUN from ESX with Powerpath/VE”. The root cause as per emc284091 — ‘This is a known issue with Powerpath/VE 5.7 where it is not handling detached devices properly. Detaching a device results to setting the device state to OFF and Powerpath/VE is not properly handling this state.’

We were advised by VMware not to perform any more SRM failovers until we have installed powerpath 5.7 P01. Thankfully EMC will supply you with an early beta to resolve the issue as P01 is only out in Q3 2012. We were supplied with PowerPath VE 5.7 P01 b002 and this appears to have solved the problem.

If you want to try and identify the fault yourself, look out for the following error message – ‘PowerPath: EmcpEsxLogEvent:1260:Error:emcp:MpxEsxVolProbe: Device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx probe already in progress’

There is also a SCSI sense code that you will normally find in the vmkernel.log but in our case we did not see it because I had to reboot the host to gather logs:

WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device “xxxxxxxxxxxxxxx” –  issuing command 0x4125001e09c0”

The above sense code  is the sense bye series. The PPVE / Hot Fix will now recognise it and respond accordingly.

SIOC is fully supported by SRM 5…er…except for planned migrations!


I was having some issues with SIOC and SRM 5 Planned Migrations. I noticed that my planned migration were failing when SIOC was enabled.

This got me thinking to whether SIOC is even supported with SRM 5, but I couldn’t even find documentation online whether it was supported. Looks like any mention of it has been omitted from the official documentation.

So after a bit of digging here is what I’ve found from VMware:

1) SIOC is supported for use with SRM – you can use SRM to protect SIOC enabled datastores

2) to execute a “planned migration” with SRM – you will need to disable SIOC first (on the datastores). You can not do a “planned migration” with SIOC enabled.

Lets start with the good news — SIOC is supported by SRM 5, so you can leave it enabled on all your replicated datastores.

This leads us to Point 2 – There are a few caveats:

As per KB2004605, you cannot unmount a datastore with SIOC enabled. If you are going to initiate a Planned Migration, you need to disable SIOC first on your protected site (active) LUNS. This is because SRM needs to unmount the active LUNS before it breaks the mirror and sets the read-only LUNS in your Recovery Site to read-write and mounts them on all ESXi hosts.

If you attempt a Planned Migration without disabling SIOC, the unmounting of LUNS and therefore the Planned Migration will fail.

There are other instances where a mounted datastore would need to be unmounted. Consider the following scenario, I haven’t had a chance to test this, but this is what I think will happen:

  1. For whatever reason your protected site (DC1) goes offline.
  2. Login to SRM at your Recovery Site (DC2) and initiate your Disaster Recovery plan
  3. The Protected (DC1) site array is unavailable so SRM is unable to synchronise changes, but it continues the recovery –
  4. SRM Instructs RecoverPoint\SRDF to break the mirror and convert the read-only recovery site (DC2) LUNS to read-write and SRM mounts them in vCenter.
  5. SRM powers on your VMs. Job done!
  6. But wait, the old protected site (DC1) eventually comes back online.
  7. You login back in SRM and hit Reprotect to start replicating back the other way
  8. SRM tries to unmount the LUNS in vCenter in DC1 before it begins replication back the other way but cannot because SIOC is enabled.
  9. The reprotection fails.

It seems clumsy to me that SRM isn’t aware of SIOC – It doesn’t matter whether it’s during a planned migration or a re-protect, if you have to keep disabling and re-enabling it’s a pain in the arse.

Clearly this isn’t going to happen a lot once you go live and its an annoyance at best, but this is the sort of minor issue that a polished product like SRM 5 shouldn’t have. Maybe I’m being so critical because it is such a good product now – they’ve raised my expectations!

I’ve raised a feature request with VMware to have this automated in a next release and I’ve been told the documentation will be updated to ‘state the obvious’.

Maybe I am blissfully ignorant of the complexity involved but as an enterprise end user it looks like a gap to me that needs fixing.

Manual steps introduce uncertainty and risk and this looks like an issue that should be solved.