I have a number of people who are very interested in using AWS for a couple of use cases specifically around Backup and Disaster Recovery. I took on a project of finding both an efficient and fast recovery method for both scenarios. I built this environment to address both. My observation are below:
Overview
Backup Requirements
Offsite copies of data stored for a certain length of time depending on regulatory and corporate needs. I have chosen Veeam as my backup software for this use case for a number of reasons.
Firstly, it is compatible with Hyper-V, vSphere, Bare Metal, AWS, and Azure.
Secondly, It has a number of recovery options such as single file, Exchange and the list goes on and on.
Architecture
In my environment I have Hyper-V and Bare Metal windows on-premises and in Amazon I use S3 and a Veeam Backup server (EC2) with EBS local repository. I chose to use an EBS repository on the AWS Veeam Server for faster recovery than pulling data from S3 (For DR recovery, detailed later). I DO NOT have Hyper-v in the cloud as it is not supported.
The Workflow is something like this:
Server/VM -> Backup to Veeam Local (Capacity Tier to S3 for archive) ->Copy Job to AWS Veeam repository.
On-Premises Veeam Configuration
I have two backup jobs (one for Bare Metal, one for Hyper-V) saving to a local repository for the performance tier and Amazon S3 for the capacity tier. This allows you to save long term archive in the cloud. I then also have a copy job that then places a copy on the AWS Veeam local repository (for faster DR Recovery)
AWS Veeam Configuration
I have no backup jobs created and just have the two repositories (one local and one S3) configured.
AWS VPC / Networking
In my VPC I have created the a few IP Subnets:
- “Backup/VDI” Subnet – A different IP Subnet than the on-premises environment that has VPN to on-premises to host the backup server and also AWS Workspaces for DR Access.
- Server Recovery Subnet- This subnet uses the same subnet as on-premises for ease of recovery but is NOT connected to on-premises in any way for obvious routing issues.
Under normal conditions the routing looks like this:
In the below example my “Backup/VDI Subnet” is .100-.102, My On-Premises is .0 and .1 which are connected over a VPN.
My Recovery Network is a separate Route Table that (under normal circumstances) only has local networks however they are both part of the same VPC. In the case of a DR event I would simply change the .0 and .1 Target above to local and it would route to the Server Recovery Subnet instead of over the VPN.
AWS IAM Permission requirements
A number of permissions are required to complete a successful recovery, I have pasted the JSON (formatting it terrible but it saves space):
{ “Version”: “2012-10-17”, “Statement”: [{ “Action”: [ “ec2:DescribeInstances”, “ec2:RunInstances”, “ec2:TerminateInstances”, “ec2:StartInstances”, “ec2:StopInstances”, “ec2:ModifyInstanceAttribute”, “ec2:DescribeImages”, “ec2:ImportImage”, “ec2:DeregisterImage”, “ec2:DescribeVolumes”, “ec2:CreateVolume”, “ec2:ModifyVolume”, “ec2:ImportVolume”, “ec2:DeleteVolume”, “ec2:AttachVolume”, “ec2:DetachVolume”, “ec2:CreateSnapshot”, “ec2:DescribeSnapshots”, “ec2:DeleteSnapshot”, “ec2:DescribeSubnets”, “ec2:DescribeNetworkInterfaces”, “ec2:DescribeSecurityGroups”, “ec2:DescribeKeyPairs”, “ec2:CreateKeyPair”, “ec2:DeleteKeyPair”, “ec2:DescribeAvailabilityZones”, “ec2:DescribeVpcs”, “ec2:DescribeConversionTasks”, “ec2:DescribeImportImageTasks”, “ec2:DescribeVolumesModifications”, “ec2:CancelImportTask”, “ec2:CancelConversionTask”, “ec2:CreateTags”, “ec2:DescribeAccountAttributes”, “ec2:DescribeDhcpOptions”, “ec2:DescribeVpcAttribute”, “iam:GetRole”, “iam:CreateRole”, “iam:PutRolePolicy”, “iam:DeleteRolePolicy”, “s3:CreateBucket”, “s3:ListBucket”, “s3:ListAllMyBuckets”, “s3:DeleteBucket”, “s3:PutObject”, “s3:DeleteObject”, “s3:GetBucketLocation”, “s3:PutLifeCycleConfiguration”, “s3:GetObject”, “s3:RestoreObject”, “s3:AbortMultiPartUpload”, “s3:ListBucketMultiPartUploads”, “s3:ListMultipartUploadParts” ], “Effect”: “Allow”, “Resource”: “*” }] }
You can also find them here: https://helpcenter.veeam.com/docs/backup/vsphere/restore_amazon_permissions.html?ver=100
VEEAM / AWS RECOVERY Considerations and Steps
The beauty of using Veeam is that it can convert any VM or bare metal type to EC2 although I ran into a couple issues but found the following article which helps greatly:
https://docs.aws.amazon.com/vm-import/latest/userguide/vmie_prereqs.html#prepare-vm-image
The error I was getting was: FirstBootFailure: This import request failed because the instance failed to boot and establish network connectivity
The cause of this error is because if we think about a VM, It started as Windows on Hyper-V or bare metal and after Veeam converts it to an EC2 instance it needs new drivers to support the underlying AWS networking (Citrix). In our case the antivirus was preventing the drivers from being installed causing a network connectivity issue due to a lack of drivers. Our solution was to install the acquire and install the citrix drivers on the on-premises servers which has been quite successful and had no downsides since the drivers are essentially dormant since the hardware required to use them was non-existent on the VM.
Restore / Recovery Steps
So for my test I wanted to recover servers into AWS both from the S3 archives and the AWS VEEAM Repository to prove out my architecture. There are a couple steps needed to accomplish this:
1. Rescan the local AWS Veeam repository – This step scans the repository for any backup files. I found this process wasnt being done automatically so before a DR recovery I would run this to catalog the backups on the local AWS Veeam Repository, in my case a direct attached disk
2. Next you go to “home” and select the VM to recover and choose “Restore to Amazon EC2
3. Select your region
4. Select a name and use any tags necessary
5. Select an instance size and any disk options
6. Select a VPC and security group. This is where I select the recovery subnet instead of the backup/vdi subnet.
7. Select whether or not you want a malware scan upon recovery
8. Deploy Proxy Server – The Security group needs 22 and 443 access to the recovery subnet. To make life easy I created a security group that allows communication between the local subnets.
9. Lastly click Finish to start the restore. When it is complete you will now see the VM in EC2:
Now Accessing the data is another story. I considered multiple options in this case: Client VPN to the Subnet or Utilize AWS Workspaces. I chose workspaces because it was a more seamless process for end users. In addition, due to coronavirus we use the workspaces for remote users in normal circumstances for remote access since they have VPN access to on-premises resources and prevent the problems that BYOD can introduce (viruses across vpn etc). It also gives us the flexibility to scale up and down on a monthly basis to meet the demands of the Org as needed.