Cyber resiliency pattern on VPC
The cyber resiliency reference architecture provides an overview and details for designing a secure cyber recovery solution on Virtual Private Cloud (VPC). Ransomware attacks attempt to encrypt, exfiltrate, or otherwise render primary and backup copies of data and configuration inoperable. The key objective of cyber recovery is on protecting, by backing up the workloads to a secure data bunker and validating available recovery points in an isolated cleanroom environment. Finally, in the event of a ransomware attack, recovering valid recovery points to a new and clean recovery environment. Unlike backup and DR solutions where the primary objective might be low RPO, the focus here is on clean recovery and returning business to working order swiftly and effectively.
The reference architecture is built on the VSI on VPC secure landing zone architecture and incorporates many of the cyber resiliency principles that are outlined in the Well Architected Framework.
Architecture diagram
The high-level architecture diagram gives an overview of the different environments and components that constitute the solution. It provides a generic overview of the requirements that drive the architectural decisions.
- The source environment with workloads to protect might be on premises or on a public cloud. Similarly, the recovery environment where the workloads and are restored, might be on premises or on the same or different public cloud. For the rest of this reference architecture, we consider IBM Cloud VPC as both the source and recovery environment.
- Management VPC hosts the secure access components - VPN Gateway and Bastion Host. External clients connect to the cyber resiliency environment through the management VPC. All access is logged for security audit. Identity and access management enforces restricted access only to cyber admins team to this environment.
- DataMover VPC is a virtually air-gapped environment that hosts the data mover component and backs up data from source environment by using a pull mechanism and stores the snapshots in an immutable storage.
- Cleanroom VPC is an isolated environment that hosts tools to analyze the backup snapshots, restore them, and identify clean recovery points. It can be torn down and rebuilt as needed.
- A recovery environment running on a separate VPC hosts the minimal set of workloads (and data) for business continuity in the event of a cyberattack. When needed, this environment is built as a clean infrastructure into which a valid and verified recovery point is restored and endpoints exposed in-lieu of source environment.
The following detailed reference architecture expands the high-level diagram for a single zone deployment. For a multi-zone deployment, use principles laid out in multi-zone resiliency patterns
- The production VPC (which maps to the source environment discussed earlier) shows a nonexhaustive list of workloads that need to be protected. It might be in a separate IBM Cloud account within the enterprise account structure.
- The choice of the Data mover component, which does the backup and restore is left open. The architecture allows for different products with different deployment models, workload support, and functional capabilities.
- Transit Gateway connects DataMover VPC and Production VPC, DataMover VPC and Recovery VPC, Management VPC and other VPCs.
- Virtual Air-Gap is enforced through VPC Access Control List and Security Groups, alternatively, a network firewall appliance might provide the functions.
- All VPC resources use Virtual Private Endpoints to connect to Cloud services, ensuring that all traffic remains on the private backbone.
- The Air-Gap can be extended to Cloud Services, like Cloud Object Storage by using Context Based Restrictions (CBRs) to allow only access from the Data Mover VPC to the direct endpoints of the service instances.
- Cloud Object Storage buckets are created with versioning and object lock feature that is enabled to provide Write-Once-Read-Many (WORM) capability. Multiple buckets with policies that match those governing data retention and immutable storage requirements for workload backups can be created.
The Flow
The detailed architecture diagram shows the various steps that are followed to backup, verify, and recover workloads. Initially, using secure access through the bastion host, the data mover components are deployed and configured to protect workloads in source. By default, the virtual air-gap is closed with rules to deny external access to data mover resources.
- At scheduled intervals, as a pre-step to the backup task, the virtual air-gap is opened by applying rules to allow traffic in and out of data mover VPC to selective hosts or subnets in the source environment.
- The Data Mover component runs the backup task, often with the help of agents that are installed on workload endpoints to pull data into local storage. The local storage might act as a cache and holds the data before it is moved to permanent storage.
- Post completion of the backup task, the virtual air-gap is closed by applying rules to deny all but necessary management traffic.
- The backup snapshots are written to Cloud Object Storage buckets as either primary or secondary immutable copy of the data. This operation might be asynchronous or synchronous with backup task, depending on its SLA.
- A cleanroom VPC is provisioned to verify the latest snapshot for ransomware or malware signatures. It is a clean and isolated environment and can be destroyed partially or fully post verification.
- Verification of backups can be offline or online. A mount server with forensic tools that analyzes the offline backup for ransomware signatures or encryption patterns or entropy scores might be faster when data sizes are large. In some cases, the snapshot is restored into a live instance, the application that is deployed and tested to verify the backup. Successfully verified backups of related workloads represent a clean logical recovery point group to be restored together.
- When a cyberattack is detected on the source environment, automation tools are used to provision a clean recovery environment. The recovery environment might be identical to the source environment or be a subset necessary to bring up the minimal set of workloads necessary.
- A clean recovery point group is selected and restored to bring up the workloads and expose them in place of the infected source environment.
The new recovery environment now becomes the source and Data Mover component is re-configured to start protecting these workloads. Forensic analysis and cleaning of the original source environment when completed might trigger a restore back to it or the recovery environment itself might be promoted as a new production environment.
Design concepts
The Architecture Design Framework provides a consistent approach to design cloud solutions by addressing requirements across a set of "aspects" and "domains", which are technology-agnostic architectural areas that need to be considered for any enterprise solution. See Introduction to the Architecture Design Framework for more details.
Requirements
The following table outlines the requirements that are addressed in this architecture.
Aspect | Requirements |
---|---|
Compute |
|
Storage |
|
Networking |
|
Security |
|
Resiliency |
|
Service Management |
|
Components
The following table outlines the products or services that are used in the architecture for each aspect.
Aspects | Architecture components | How the component is used |
---|---|---|
Compute | VPC VSIs | Data mover components are deployed on a cluster of VSIs to provide scale and high availability |
Storage | VPC Block Storage | Block Storage for VPC VSI |
Cloud Object Storage | Cloud Object Storage Buckets with Object Lock enabled | |
Networking | VPC Virtual Private Network (VPN) | Remote access to manage resources in a private network |
Transit Gateway | Connects across VPCs and source and recovery environments | |
Virtual Private Gateway & Virtual Private Endpoint (VPE) | For private network access to Cloud Services, for example, Key Protect, Cloud Object Storage, and so on. | |
Public Gateway | For resource access to the internet | |
Security | IAM | IBM Cloud Identity & Access Management |
BYO Bastion Host on VPC VSI | Remote access with Privileged Access Management | |
VPCs, Subnets, Security Groups and ACLS | Network isolation and Virtual Air-Gap | |
Key protect or HPCS | Hardware security module (HSM) and Key Management Service | |
Secrets Manager | Certificate and Secrets Management | |
Context Based Restrictions | Enforce access restrictions for service instances based on a rule's criteria | |
Delivery pipeline | Toolchain | Pipeline for build and deploy |
Resiliency | Cloud Object Storage | Data archived in Cloud Object Storage cross-region buckets |
Service Management | IBM Cloud Monitoring | Apps and operational monitoring |
IBM Log Analysis | Apps and operational logs | |
Activity Tracker Event Routing | Audit logs | |
Schematics | Automated deployment that uses Deployable Architecture and Ansible actions |