VMware Disaster Recovery To Azure Step-By-Step – Part 4

Azure Disaster Recovery

In the 1st part, we saw how to deploy and configure the Azure Site Recovery Configuration Server, in this 2nd part we will see how to configure Azure.

In the 2nd part, we saw the Azure configuration and how to enable replication of our on-premises VMs.

In the 3rd part, we saw how to test Failover on Azure.

Image result for azure site recovery

Validation

During my Failover, I change the color theme on my Storefront then validate

Reprotect and fail back machines to an on-premises site after failover to Azure

After failover of on-premises VMware VMs to Azure, the first step in failing back to your on-premises site is to reprotect the Azure VMs that were created during failover.

Before you begin

If you used a template to create your virtual machines, ensure that each virtual machine has its own UUID for the disks. If the on-premises virtual machine’s UUID clashes with the UUID of the master target because both were created from the same template, reprotection fails. Deploy another master target that wasn’t created from the same template. Note the following information:

  • If you are trying to fail back to an alternate vCenter, make sure that your new vCenter and the master target server are discovered. A typical symptom is that the datastores aren’t accessible or aren’t visible in the Reprotect dialog box.
  • To replicate back to on-premises, you need a failback policy. This policy is automatically created when you create a forward direction policy. Note the following information:
    • This policy is auto associated with the configuration server during creation.
    • This policy isn’t editable.
    • The set values of the policy are (RPO Threshold = 15 Mins, Recovery Point Retention = 24 Hours, App Consistency Snapshot Frequency = 60 Mins).
  • During reprotection and failback, the on-premises configuration server must be running and connected.
  • If a vCenter server manages the virtual machines to which you’ll fail back, make sure that you have the required permissions for discovery of VMs on vCenter servers.
  • Delete snapshots on the master target server before you reprotect. If snapshots are present on the on-premises master target or on the virtual machine, reprotection fails. The snapshots on the virtual machine are automatically merged during a reprotect job.
  • All virtual machines of a replication group must be of the same operating system type (either all Windows or all Linux). A replication group with mixed operating systems currently isn’t supported for reprotect and failback to on-premises. This is because the master target must be of the same operating system as the virtual machine. All the virtual machines of a replication group must have the same master target.
  • A configuration server is required on-premises when you fail back. During failback, the virtual machine must exist in the configuration server database. Otherwise, failback is unsuccessful. Make sure that you make regularly scheduled backups of your configuration server. If there’s a disaster, restore the server with the same IP address so that failback works.
  • Reprotection and failback require a site-to-site (S2S) VPN or ExpressRoute private peering to replicate data. Provide the network so that the failed-over virtual machines in Azure can reach (ping) the on-premises configuration server. You need to deploy a process server in the Azure network of the failed-over virtual machine(s). This process server must also be able to communicate with the on-premises Configuration Server and Master Target Server.
  • In case the IP addresses of replicated items were retained on failover, S2S or ExpressRoute connectivity should be established between Azure virtual machines and the failback NIC of the configuration server. Note that IP address retention requires configuration server to have two NICs – one for source machines connectivity and one for Azure failback connectivity. This is to avoid overlap of subnet address ranges of the source and failed over virtual machines.
  • Make sure that you open the following ports for failover and failback:
Ports for failover and failback

Deploy a process server in Azure

You need a process server in Azure before you fail back to your on-premises site:

  • The process server receives data from the protected virtual machine(s) in Azure, and then sends data to the on-premises site.
  • A low-latency network is required between the process server and the protected virtual machine. Hence, it is recommended that you deploy a process server in Azure. The replication performance is higher if the process server is closer to the replicating virtual machine (the failed-over machine in Azure).
  • For a proof of concept, you can use the on-premises process server and ExpressRoute with private peering.

To deploy a process server in Azure:

  1. If you need to deploy a process server in Azure, see Set up a process server in Azure for failback.
  2. The Azure VMs send replication data to the process server. Configure networks so that the Azure VMs can reach the process server.
  3. Remember that replication from Azure to on-premises can happen only over the S2S VPN or over the private peering of your ExpressRoute network. Ensure that enough bandwidth is available over that network channel.

Deploy a separate master target server

The master target server receives failback data. By default, the master target server runs on the on-premises configuration server. However, depending on the volume of failed-back traffic, you might need to create a separate master target server for failback. Here’s how to create one:

  • Create a Linux master target server for failback of Linux VMs. This is required. Note that, Master target server on LVM is not supported.
  • Optionally, create a separate master target server for Windows VM failback. To do this, run Unified Setup again, and select to create a master target server. Learn more.

After you create a master target server, do the following tasks:

  • If the virtual machine is present on-premises on the vCenter server, the master target server needs access to the on-premises virtual machine’s Virtual Machine Disk (VMDK) file. Access is required to write the replicated data to the virtual machine’s disks. Ensure that the on-premises virtual machine’s datastore is mounted on the master target’s host with read/write access.
  • If the virtual machine isn’t present on-premises on the vCenter server, the Site Recovery service needs to create a new virtual machine during reprotection. This virtual machine is created on the ESX host on which you create the master target. Choose the ESX host carefully, so that the failback virtual machine is created on the host that you want.
  • You can’t use Storage vMotion for the master target server. Using Storage vMotion for the master target server might cause the failback to fail. The virtual machine can’t start because the disks aren’t available to it. To prevent this from occurring, exclude the master target servers from your vMotion list.
  • If a master target undergoes a Storage vMotion task after reprotection, the protected virtual machine disks that are attached to the master target migrate to the target of the vMotion task. If you try to fail back after this, detachment of the disk fails because the disks are not found. It then becomes hard to find the disks in your storage accounts. You need to find them manually and attach them to the virtual machine. After that, the on-premises virtual machine can be booted.
  • Add a retention drive to your existing Windows master target server. Add a new disk and format the drive. The retention drive is used to stop the points in time when the virtual machine replicates back to the on-premises site. Following are some criteria of a retention drive. If these criteria aren’t met, the drive isn’t listed for the master target server:
    • The volume isn’t used for any other purpose, such as a target of replication.
    • The volume isn’t in lock mode.
    • The volume isn’t a cache volume. The master target installation can’t exist on that volume. The custom installation volume for the process server and master target isn’t eligible for a retention volume. When the process server and master target are installed on a volume, the volume is a cache volume of the master target.
    • The file system type of the volume isn’t FAT or FAT32.
    • The volume capacity is nonzero.
    • The default retention volume for Windows is the R volume.
    • The default retention volume for Linux is /mnt/retention.
  • You must add a new drive if you’re using an existing process server/configuration server machine or a scale or process server/master target server machine. The new drive must meet the preceding requirements. If the retention drive isn’t present, it doesn’t appear in the selection drop-down list on the portal. After you add a drive to the on-premises master target, it takes up to 15 minutes for the drive to appear in the selection on the portal. You can also refresh the configuration server if the drive doesn’t appear after 15 minutes.
  • Install VMware tools or open-vm-tools on the master target server. Without the tools, the datastores on the master target’s ESXi host can’t be detected.
  • Set the disk.EnableUUID=true setting in the configuration parameters of the master target virtual machine in VMware. If this row doesn’t exist, add it. This setting is required to provide a consistent UUID to the VMDK so that it mounts correctly.
  • The ESX host on which the master target is created must have at least one virtual machine file system (VMFS) datastore attached to it. If no VMFS datastores are attached, the Datastore input on the reprotect page is empty and you can’t proceed.
  • The master target server can’t have snapshots on the disks. If there are snapshots, reprotection and failback fail.
  • The master target can’t have a Paravirtual SCSI controller. The controller can only be an LSI Logic controller. Without an LSI Logic controller, reprotection fails.
  • For any instance, the master target can have at most 60 disks attached to it. If the number of virtual machines being reprotected to the on-premises master target has more than a total number of 60 disks, reprotects to the master target begin to fail. Ensure that you have enough master target disk slots, or deploy additional master target servers.

Enable reprotection

After a virtual machine boots in Azure, it takes some time for the agent to register back to the configuration server (up to 15 minutes). During this time, you won’t be able to reprotect and an error message indicates that the agent isn’t installed. If this happens, wait for a few minutes, and then try reprotection again:

  • Select Vault > Replicated items. Right-click the virtual machine that failed over, and then select Re-Protect. Or, from the command buttons, select the machine, and then select Re-Protect.
This image has an empty alt attribute; its file name is image-64.png
  • Verify that the Azure to On-premises direction of protection is selected.
  • In Master Target Server and Process Server, select the on-premises master target server and the process server.
  • For Datastore, select the datastore to which you want to recover the disks on-premises. This option is used when the on-premises virtual machine is deleted, and you need to create new disks. This option is ignored if the disks already exist. You still need to specify a value.
  • Select the retention drive.
  • The failback policy is automatically selected.
This image has an empty alt attribute; its file name is image-65.png
  • Select OK to begin reprotection. A job begins to replicate the virtual machine from Azure to the on-premises site. You can track the progress on the Jobs tab. When the reprotection succeeds, the virtual machine enters a protected state.

Note the following information:

  • If you want to recover to an alternate location (when the on-premises virtual machine is deleted), select the retention drive and datastore that are configured for the master target server. When you fail back to the on-premises site, the VMware virtual machines in the failback protection plan use the same datastore as the master target server. A new virtual machine is then created in vCenter.
  • If you want to recover the virtual machine on Azure to an existing on-premises virtual machine, mount the on-premises virtual machine’s datastores with read/write access on the master target server’s ESXi host.
  • You can also reprotect at the level of a recovery plan. A replication group can be reprotected only through a recovery plan. When you reprotect by using a recovery plan, you must provide the values for every protected machine.
  • Use the same master target server to reprotect a replication group. If you use a different master target server to reprotect a replication group, the server can’t provide a common point in time.
  • The on-premises virtual machine is turned off during reprotection. This helps ensure data consistency during replication. Don’t turn on the virtual machine after reprotection finishes.

After the virtual machine has entered a protected state, you can initiate a failback. The failback shuts down the virtual machine in Azure and boots the on-premises virtual machine. Expect some downtime for the application. Choose a time for failback when the application can tolerate downtime.

Fail back VMware VMs and physical servers from Azure to an on-premises site

Overview of failback

After you have failed over to Azure, you can fail back to your on-premises site by executing the following steps:

  • Reprotect the virtual machines on Azure so that they start to replicate to VMware virtual machines in your on-premises site. As part of this process, you also need to:
    • Set up an on-premises master target. Use a Windows master target for Windows virtual machines and a Linux master target for Linux virtual machines.
    • Set up a process server.
    • Start reprotect to turn off the on-premises virtual machine and synchronize the Azure virtual machine’s data with the on-premises disks.
  • After your virtual machines on Azure replicate to your on-premises site, you start a failover from Azure to the on-premises site.
  • Validate Direction and click on OK
  • You can check the Status of the synchronization
  • After your data fails back, you reprotect the on-premises virtual machines again so that they start replicating to Azure.
  • Select the VM and start Failover
  • Click on OK
  • You can follow the process
  • When process is finished, click on Virtual machines
  • The VM is deallocated but still present (and the original VMware VM has been started on-premises)
  • Commit is required to remove the failed-over virtual machine from Azure. Right-click the protected item, and then select Commit. A job removes the failed-over virtual machines in Azure.
  • When Commit is finished click on Virtual machines
  • VM is not anymore there

Reprotect from on-premises to Azure

After failback finishes and you have started commit, the recovered virtual machines in Azure are deleted. Now, the virtual machine is back on the on-premises site, but it won’t be protected. To start to replicate to Azure again, do the following:

  • Select Vault > Setting > Replicated items, select the virtual machines that failed back, and then select Re-Protect.
  • Enter the value of the process server that needs to be used to send data back to Azure.
  • Select OK to begin the reprotect job.

Validate Failback

Now that the Failback process is finished, we can validate that change made during Failover on the StoreFront has been replicated with the on-premises StoreFront VM.

Feedback

I really like the Azure Site Recovery tool, it’s not quick to deploy, configure and test, but it seems reliable.

My main concern here is that if you want to Failback, you need your protected VM to be configured with MBR and Boot BIOS enabled, which is not the default neither VM recommendations.

It will involve some steps to modify VM before protection but it’s necessary if you want to be able to successfully run a Failback.

It could be interesting and helpful if Microsoft remove this limitations to have a BIOS Boot to allow Failback.

Even if this was for test purpose and to discover the product, I am confident and hope to be able to implement it for customers in the near future.

Microsoft article stand that you need a process server in Azure before you fail back to your on-premises site, however I did not need it, I used my on-premise process server.

Thanks

Arnaud