What to do if a SRM cleanup operation fails…

I recently encountered an issue where we ran a SRM Test Failover and afterwards it failed to cleanup correctly.

When the cleanup operation fails what I normally do is run the Force Cleanup and continue on with my life. How wrong I could be…

What happened next is I ran a planned migration and because the force cleanup had not worked correctly, not all virtual machines were protected. When the storage failed over, only 3 of the 8 VMs powered up in the Recovery Site. We ended up in a SRM failed state and had to manually failback the storage and reinstall SRM. It was a complete disaster and a big waste of a weekend.

So… this post outlines what you should do when a cleanup operations fails… As usual I learnt the hard way…!

If a cleanup operations fails:

  1. Run the force cleanup to try and finish the cleanup operation.
  2. Once Force Cleanup completes, check the following components manually to confirm that the force cleanup completely successfully.
  3. Open the Protection Group in SRM and open the protection group status  for the virtual machines.
  4. Select refresh and confirm all VMs are still protected – there status should be ‘OK’
  5. If any are not OK, select Reprotect VMs to fix the issues and recreate the placeholder VMs
  6. Change to vcenter datastore view
  7. Confirm the snap datastore for the Test Failover has been removed
  8. If the snap datastore still exists in italics or normal text, manually unmount and detach the snap datastore from all hosts.
  9. Once the datastore has been unmounted and detached from all hosts, right-click the datacenter (DC1 or DC2) and execute a ‘Rescan for Datastores’.
  10. On the next screen, untick ‘scan for new storage devices’
  11. Confirm the snap datastore has been removed.
And now you can carry on with your life…. and your planned migrations.