Understanding high availability and disaster recovery for Databases for PostgreSQL

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

Databases for PostgreSQL is a regional service that fulfills the defined Service Level Objectives (SLO) with the Standard plan. For more information, see Service Level Agreement (SLA). For more information about the available IBM Cloud regions and data centers for site.data.keyword.databases-for-postgresql, see Service and infrastructure availability by location.

High availability architecture

Databases for PostgreSQL provides replication, failover, and high-availability features to protect your databases and data from infrastructure maintenance, upgrades, and some failures. Deployments contain a cluster with two data members - leader and replica. The replica is kept up to date using asynchronous replication. A distributed consensus mechanism is used to maintain cluster state and handle failovers. If the leader becomes unreachable, the cluster initiates a failover, and the replica is promoted to leader, and a new replica rejoins the cluster as a replica. The leader and replica will always be in different zones of an MZR. If the replica fails, a new replica is created. If a zone failure results in a member failing, the new replica will be created in a surviving zone.

You can extend high-availability further by adding PostgreSQL members to the cluster for greater in-region redundancy, or by provisioning read-only replicas for cross-regional failover or read offloading.

Review the PostgreSQL documentation on replication techniques to understand the constraints and tradeoffs that are associated with the asynchronous replication strategy that is deployed by default.

In scenarios where a database becomes critically unhealthy, such as a server crash on the leader, Databases for PostgreSQL attempts a failover. This auto-failover capability is capped at 16 MB of data lag from leader to replica (a few rows of data once accounting for more PostgreSQL data overhead) and is not performed if the lag threshold is exceeded. If the potential for 16 MB of data loss is intolerable for the application, see synchronous replication below.

Workloads that programmatically access the cluster must follow the client availability retry logic to maintain availability.

The service will, at times, do controlled failovers under normal operation. These failovers are no-data-loss events but result in resets of active connections. There is a period of up to 15 seconds where reconnections can fail. At times, unplanned failovers might occur due to unforeseen events in the operating environment. These can take up to 45 seconds, but generally less than 30. Service maintenance, for example, triggers a controlled failover.

High availability features

Databases for PostgreSQL supports the following high availability features:

High availability features
Feature	Description	Consideration
Automatic failover	Standard on all clusters and resilient against a zone or single member failure
Member count	Minimum - 2 members. Default is a Standard two member deployment. A two-member cluster will automatically recover from a single instance or zone failure (with data loss up to the lag threshold). During data synchronization for a new replica, the cluster has exposure to second failure causing data loss. A three-member, see adding PostgreSQL members, is resilient to the failure of two members during the same failure period	Three members required for synchronous replication
Synchronous replication	Improves RPO by adding remote member sync to the data write path. Refer to Synchronous replication below.	Performance impact and cost.
Read-only replica	Read-only replicas can provide local access in remote regions, improving availability to potential network latency or connectivity issues.	All Write requests must be directed exclusively to the read-write cluster associated with the read-replica

Synchronous replication Databases for PostgreSQL

By default, streaming replication is asynchronous. If the leader crashes, some transactions that were committed might not have synced to the replica, causing data loss. Cloud Databases ensures that data loss is kept to a minimum substantial data loss; however, synchronous replication offers the ability to confirm that all changes made by a transaction have been synced to a replica. This ensures consistency across a cluster. This consistency comes from confirming that writes are written to a secondary before returning to the connecting client with success. For variables regarding synchronous replication, see synchronous_commit on the Changing Configuration page.

Synchronous replication brings replica availability into the primary write path. If there is no replica to acknowledge a write, it will hang until a replica is available. This requires at least three members to function reliably, as synchronous replication is not supported on two-member deployments. You must horizontally scale to at least three members before enabling synchronous replication. See adding PostgreSQL members.

While unlikely, it is possible that more than one replica might become unavailable simultaneously. If this happens, the primary database will not be able to complete any writes until a replica comes back online, effectively blocking all write traffic to your database. When you decide to use synchronous replication, weigh the relative costs and benefits of higher data durability versus potential availability issues.

Configuring synchronous replication can significantly increase write latency and reduce overall throughput. For optimal performance, it is recommended to use synchronous replication only on specific databases or workloads that require the highest degree of data durability.

Disaster recovery architecture

The general strategy for disaster recovery is to create a new database, like the Restore database below. The contents of the new database can be a backup of the source database created before the disaster. A new database can be created using the point-in-time feature if the production database is available.

Disaster recovery features

Databases for PostgreSQL supports the following disaster recovery features:

Disaster recovery features
Feature	Description	Consideration
Backup restore	Create database from previously created backup; see Managing Cloud Databases backups.	New connection strings for the restored database must be referenced throughout the workload.
Point-in-time restore	Create database from the live production using point-in-time recovery	This is only possible if the active database is available and the RPO (disaster) falls within the supported window. It is not useful if the production cluster is unavailable. New connection strings for the restored database must be referenced throughout the workload.
Promote read replica	Create a read-only replicas when planning for a disaster in the same or remote region. Promote the read-only replica to recover from a disaster.	Previously created read replica must be available. New connection strings for the restored database must be referenced throughout the workload.

Planning for disaster recovery

The disaster recovery steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

Failure scenarios and resolutions
Failure	Resolution
Hardware failure (single point)	IBM provides a database that is resilient from a single point of hardware failure within a zone - no configuration is required.
Zone failure	Automatic failover (#postgresql-high-availability). The database members are distributed between zones. Configuring three members will provide additional resiliency to multiple zone failures. Synchronous replication will reduce RPO at the expense of performance.
Data corruption	Backup restore. Use the restored database in production or for source data to correct the corruption in the restored database. Point-in-time restore. Use the restored database in production or for source data to correct the corruption in the restored database.
Regional failure	Backup restore. Use the restored database in production. Promote read replica. Promote a read-only replica to a read/write database. Use the restored database in production

Application-level high-availability

Applications that communicate over networks and cloud services are subject to transient connection failures. You want to design your applications to retry connections when errors are caused by a temporary loss in connectivity to your deployment or to IBM Cloud.

Because Databases for PostgreSQL is a managed service, regular updates and database maintenance occur as part of normal operations. This can occasionally cause short intervals where your database is unavailable. It can also cause the database to trigger a graceful fail-over, retry, and reconnect. It takes a short time for the database to determine which member is a replica and which is the leader, so you might also see a short connection interruption. Failovers generally take less than 30 seconds.

Your applications must be designed to handle temporary interruptions to the database, implement error handling for failed database commands, and implement retry logic to recover from a temporary interruption.

Several minutes of database unavailability or connection interruption are not expected. Open a support case with details if you have periods longer than a minute with no connectivity so we can investigate.

Connection limits

Databases for PostgreSQL sets the maximum number of connections to your PostgreSQL database to 115. 15 connections are reserved for the superuser to maintain the state and integrity of your database, and 100 connections are available for you and your applications. After reaching the connection limit, any attempts at starting a new connection result in an error. To prevent overwhelming your deployment with connections, use connection pooling or scale your deployment and increase its connection limit. See the Managing PostgreSQL connections page for more information.

Your responsibilities for HA and DR

The following information can help you create and continuously practice your plan for HA and DR.

When restoring a database from backups or using point-in-time restore, a new database is created with new connection strings. Existing workloads and processes must be adjusted to consume the new connection strings. Promoting a read replica to a cluster will have a similar impact, although existing read-only portions of the workload will not be impacted.

A recovered database may also need the same customer-created dependencies of the disaster database - make sure these and other services exist in the recovered region:

IBM® Key Protect for IBM Cloud®
Hyper Protect Crypto Services

Remember that deleting a database also deletes its associated backups. However, deleted databases may be recoverable within a limited timeframe. Refer to the documentation for specific details on database recovery procedures.

It is not possible to copy backups off the IBM Cloud, so consider using the database-specific tools for additional backups. It may be required to recover from malicious database deletion followed by a reclamation-delete of a database. Careful management of IAM access to databases can help reduce exposure to this problem.

The following checklist associated with each feature can help you create and practice your plan.

Backup restore
- Verify backups are available at the desired frequency to meet RPO requirements. Managing Cloud Databases backups documents backup frequency. Consider a script using IBM Cloud® Code Engine - Working with the Periodic timer (cron) event producer to create additional on-demand backups to improve RPO if the criticality and size of the database allow. However, given PostgreSQL's PITR capabilities, carefully evaluate the need for additional backups.
- There are some restrictions on database restore regions - verify your restore goals can be achieved by reading managing Cloud Databases backups.
- Verify the retention period of the backups meet your requirements.
- Schedule test restores regularly to verify that the actual restored times meet the defined RTO. Remember that database size significantly impacts restore time. Please consider strategies to minimize restore times, such as breaking down large databases into smaller, more manageable units and purging unused data.
- Verify the Key Protect service.
Point-in-time restore
- Verify the procedures covered earlier.
- Verify desired backup is in the window.
Promote read replica
- Verify that a read replica exists in the recovery region.
- Practice the promotion process - create a temporary read replica in the desired region. The temporary replica can be promoted to read/write and some testing performed with little impact to production.

To find out more about responsibility ownership between the customer and IBM Cloud for using Databases for PostgreSQL, see Shared responsibilities for Cloud Databases.

Stay informed: IBM notifications

Updates affecting customer workloads are communicated through IBM Cloud notifications. To stay informed about planned maintenance, announcements, and release notes related to this service, refer to the Monitoring notifications and status page. In addition, regularly review the Version policy page for the latest updates on End-of-Life versions and dates.