Designing an architecture for your application resiliency objectives
It is important to ensure that you're meeting your defined recovery (DR) objectives and the DR metric thresholds for your application. Review the following sample application architecture examples to understand how you can meet your recovery objectives by using IBM Cloud®.
The resiliency options, proposed profiles, and associated information are presented so that you can define your application's DR requirement levels. Information that is stated is not a warranty and IBM® will not issue credits for failure to meet an objective. These recovery time objectiveThe maximum duration of time within which an application should be restored after any type of disaster. (RTO) and recovery point objectiveIn disaster recovery planning, the point at which data is restored to in the event of a disaster. (RPO) examples are presented as a reference for additional steps that can be taken to achieve different levels of resiliency. Refer to the Service Level Agreements (SLAs) for any commitments and credits that are issued upon failure to meet any committed SLAs. For more information about the recovery strategy classes, see Planning your applications recovery strategy objectives.
Hybrid disaster recovery: Microservices
Business continuity is a top priority for modern applications. The following example shows patterns and best practices to implement DR for hybrid microservice applications in an active/passive configuration.
Some of the significant challenges that are unique to the hybrid DR microservices implementation architecture include the following:
- Business continuity
- Business processes can continue to be processed despite man-made or natural disasters.
- Operational flexibility
- Having a well-designed solution with code and data in multiple sites allows for flexibility for where applications and data are deployed based on user need and traffic.
- Cost reduction
- By placing a DR site in the cloud, not all of the resources need to be always running. This results in cost savings compared to a cold standby at an on premises site.
Review the following functional requirements:
- Support for both SQL and NoSQL databases
- Failover and fallback of the whole application stack must be completely automated, or it must be activated by humans with a single action
- Monitor availability of both primary and backup hosting services
- Alert in case of failures
Cloud architecture based on two multizone regions (MZRs)
To design a resilient architecture, you need to consider the individual blocks of your solution and their specific capabilities.
In this example, three different infrastructure solutions are used across two MZRs. To achieve the goal of having a resilient architecture, the required deployment footprint differs between IaaS solutions like VMware and Kernel-based virtual machine (KVM) and Kubernetes (Kubernetes Service and Red Hat OpenShift on IBM Cloud) versus a fully-managed serverless service like Code Engine.
Network architecture
Review the following network architecture assumptions:
- As a baseline, the current architecture and procedures being used today in a dedicated deployment also work in public isolated. This might not be the preferred architecture, but it should be used as a starting point. Public cloud offers more flexibility, agility, and customer control that might allow for changes from current procedures to save money, reduce effort, or provide other benefits than are available today with a dedicated cloud.
- Users of the applications will not see any difference in their usage of the application, whether it's running in dedicated or public locations.
- Administrators will manually decide when a failover occurs and trigger it manually.
- All networking is preconfigured in both sites before any sort of disaster declaration.
- VPC subnets in all locations will have nonoverlapping CIDR blocks.
- Routes for CIDR blocks in one multizone region (MZR) are advertised to the other ones such that no routing changes are required in a disaster situation.
- Provisioning of Event Streams Enterprise, IBM Cloudant, and other services that use dedicated hardware will need to be preprovisioned, as they might take human intervention and several days to provision. Event Streams might have automated provisioning on Enterprise Plan, but IBM Cloudant is manual per the IBM Cloudant documentation.
- When applications deploy to production, they get deployed to both US South and US East regions, and deployment validation testing is performed in dark mode (in the backup MZR). The testing can be just enough to validate that the app is ready should it be necessary to fail over.
- There is currently no feature set in IBM Cloud Code Engine to replicate the configuration of one project to another one in a different region. Therefore, the customer team needs to ensure that all necessary configuration changes to the IBM Cloud Code Engine projects in all the regions are enforced.
Something to consider in hybrid architecture is the flow of traffic to the public internet. If there are no public VLANs or other direct connectivity to the public internet from IBM Cloud, all traffic to and from IBM Cloud to the public internet gets routed through your on premises network.
Therefore, any workload running in the US East region must traverse the IBM private network all the way from Washington DC to the west coast. In many instances, there might be multiple trips that are needed to complete requests to on premises systems. Latency is a factor that should be considered to achieve true active/active workload between US South and US East regions in this model.
Application profiles
One approach is to build a set of architecture profiles that represent most apps. These profiles include options for various classes of service, compute requirements, and all the other components that are listed.
Continuous availability profile
The continuous availability recovery class can be defined as one for applications that requires the platform to be available in less than an hour in the event of a disaster where the primary MZR becomes unavailable.
Continuous availability profile - Compute
In order to meet continuous availability RTO/RPO goals, these components will need to be preconfigured and at their production workload capacity.
The size of the Kubernetes Service clusters in US East might need to be only large enough to support continuous availability and advanced recovery class workloads. All applications can always be deployed there (as they are today in Dedicated), but for standard recovery and no-recovery class applications they could be stopped. The idea would be to allow for higher availability requirement apps to immediately become available with enough capacity to run them if a failure occurs. But to save money, the cluster would be scaled up to match only the primary cluster and support the lower requirement workloads when needed. These apps have longer RTO in which the scaling operation would complete.
The same technique can be applied to updates to the Code Engine applications or Kubernetes Service clusters. The first step would be to scale up the backup cluster to full capacity, matching the primary cluster, before starting the upgrade. Then, do some sort of blue/green update where workload can be switched to the backup while the primary cluster is upgraded. After the primary cluster is upgraded and verified, the workload would be switched back, the backup cluster could be scaled back down, and then upgraded.
How does the failover happen? For example, using Code Engine:
There are two Code Engine projects that are provisioned, one in US South (primary MZR) and the other in US East (backup MZR). During normal operation, all traffic is routed to the project in US South. To manage traffic between the two MZRs, follow the Code Engine instructions.
Any configuration that is mentioned on the Code Engine project is performed on both projects to maintain consistency between the primary and backup instances.
In the event that a failover needs to be triggered, these are the basic steps to follow.
- Make sure that the app is deployed in Code Engine in the backup MZR.
- Make sure that any data that is needed by the app is available in the backup MZR. This could be accomplished in different ways, depending on the service. IBM Cloudant has bidirectional replication, PostgreSQL uses read-only replicas that can be promoted to be leaders, MongoDB and Redis rely on backup and restore procedures.
- If necessary, scale up the application in the backup MZR to handle the expected load.
- When the data is ready and the Code Engine application is appropriately scaled, start the application.
- Update the DNS CNAME record to point to the Code Engine application endpoint of the Code Engine application in the backup MZR.
Continuous availability profile application - IBM Cloudant
In this scenario, IBM Cloudant will replicate the data automatically. In the event of a disaster, the only change that is needed will be to reconfigure DNS to point to the cluster in US East.
- Benefits
- Fastest recovery time, dependent only on the time it takes to verify that the data was replicated and make the DNS switch.
- Impacts
- Cost. To support the RTO, the only viable option is the active/hot-standby model.
- Development teams need to update their deployment pipelines to also deploy their apps to the standby IBM Cloud Code Engine project every time they deploy to production. And, validate that it was successful.
Continuous Availability profile application - PostgreSQL
In this scenario, the data is replicated automatically by IBM Cloud Databases to a read-only replica in the backup MZR. In the event of a disaster, the application team needs to manually trigger a promotion of the read-only replica to become the leader. This action will take some time as a read-only replica of the service instance is not configured using a high availability (HA) topology. When the promotion occurs, several steps happen to elevate the instance to an HA configuration.
The only other change that is needed is to reconfigure DNS to point to the cluster in US East when the database promotion is complete.
- Benefits
- Faster recovery time, as the data is already replicated to the backup MZR. There is still some latency related to the time it takes to reconfigure the database to an HA configuration.
- Impacts
- Cost. To support the recovery time objective the only viable option is the active/hot standby model.
- Development teams need to update their deployment pipelines to also deploy their apps to the standby IBM Cloud Code Engine project every time they deploy to production. And, validate that it was successful.
- Manual intervention is required to trigger and monitor the promotion of the read-only replica to leader status.
For more details, see Configuring read-only replicas.
Here are the basic steps:
- Create instance in primary MZR (for example, Dallas)
- Create read-only replica in DR MZR (for example, Washington, DC) Read-only replica is a single zone instance
- If primary MZR is unavailable, promote read-only replica to leader. Now, DR MZR becomes the leader.
- This updates the config in the DR MZR to be MZR resilient, meaning additional nodes are added.
- This takes a full backup of the database in the DR MZR
- DR MZR becomes the leader, and ties to original leader in primary MZR are broken
- Original instance in primary MZR can be deleted. This deletes all backups in the primary MZR.
- If original instance is not deleted (for example, backups are still available) a new instance can be created by restoring from a backup, should that become necessary.
- The backups that are taken in the original MZR are still accessible even if that MZR is unavailable; they are stored in cross-regional IBM Cloud Object Storage buckets.
- If a read-only replica is promoted to leader, the original leader is no longer viable.
- To move the data or workload back to the original MZR, create a read-replica in the original MZR and promote it to be the leader. This does create a new instance of the database, not a restore of the original instance.
Additional database options
Cloud Databases offers several open source database systems as fully managed services. They are:
- Databases for PostgreSQL
- Databases for EnterpriseDB
- Databases for Redis
- Databases for Elasticsearch
- Databases for etcd
- Databases for MongoDB
- Messages for RabbitMQ
All of these services share the same characteristics:
- For high availability, they are deployed in clusters. Details can be found in the documentation of each service:
- Each cluster is spread over multiple zones.
- Data is replicated across the zones.
- Users can scale up storage and memory resources for an instance. See the example in the documentation on scaling for Databases for Redis for details.
- Backups are taken daily or on demand. Details are documented for each service. Here is an example of backup documentation for Databases for PostgreSQL.
- Data at rest, backups, and network traffic are encrypted.
- Each service can be managed using the Cloud Databases CLI plug-in
Continuous Availability profile application - IBM Cloud Object Storage
In this scenario, the data is always available in both MZRs through cross-regional buckets. Therefore, the only change that is needed is to update the DNS routing to point to the backup MZR.
Advanced recovery profile
The advanced recovery class requires that the platform is available in less than an hour in the event of a disaster where the primary MZR becomes unavailable. To achieve this with IBM Cloud Code Engine, it is necessary to have a fully configured instance of the IBM Cloud Code Engine project up and running in hot standby mode in the backup MZR. This includes:
- Platform
- All applications deployed with the same version as deployed in primary MZR
- Application
- All applications deployed with the same version as deployed in primary MZR
- Dependent services provisioned and necessary data replication strategy in place (bidirectional, read-only replica, backup, and restore, and so on)
- Service credentials and service bindings
When planning for an advanced recovery profile application, more options are available to application owners in terms of whether preprovisioning of resources is required. There is a tradeoff between the cost of maintaining preprovisioned compute, such as IBM Cloud Code Engine or Kubernetes Service, capacity and the risk of it taking longer than the RTO to provision or scale up the capacity needed.
Cloud capacity is not infinite. It is important to consider the possibility of capacity constraints in the backup MZR in the event the entire primary MZR is lost.
Recovery times for databases that use backup and restore depends on the size of the backup. Application owners need to take this into account when determining their RTO requirements.
Advanced recovery profile application - MongoDB
In this scenario, the data is not replicated to the backup MZR. When a disaster is declared the application team will need to create a new database instance in the backup MZR by restoring the data from backup.
The backup from the database instance in the primary MZR is available in the backup MZR even if the primary MZR is completely unavailable.
After the data is restored, the only other change that is needed is to change the DNS routing to point to the backup MZR.
Advanced Recovery Profile Application - Bare Metal Servers and Virtual Servers on Classic Infrastructure
Virtual Servers and Bare Metal Servers offer the capability to achieve a multi-region architecture. You can provision servers in multiple locations on IBM Cloud.
When preparing for such architecture using Virtual Servers and Bare Metal Servers, consider the following: file storage, backups, recovery, and databases, selecting between a database as service, or installing a database on a virtual server.
The following architecture demonstrates the deployment of a multi-region architecture using Virtual Servers in an active/passive architecture where one region is active and the second region is passive.
The components that are required for such architecture are as follows:
- Users access the application through IBM Cloud Internet Services (CIS).
- CIS routes traffic to the active location.
- Within a location, a load balancer redirects traffic to a server.
- Databases are deployed on a virtual server. Backup is enabled and replication is set up between regions. The alternative would be to use a database-as-service, a topic discussed later in the tutorial.
- IBM Cloud File Storage for Classic to store the application images and files, File Storage for Classic offers the capability to take a snapshot at a given time and date, this snapshot then can be reused within another region, something that you would do manually.
The tutorial Use Virtual Servers to build highly available and scalable web app implements this architecture.
Back up and restore procedures
Refer to Managing Cloud Databases backups for the backup and restore procedures.