A new platform for documentation and tutorials is launching soon.
We are migrating HCP documentation into HashiCorp Developer, our new developer experience.
»High Availability and Disaster Recovery
Vault running on the HashiCorp Cloud Platform (HCP) is fully managed by HashiCorp and provides push-button deployment, fully managed clusters and upgrades, backups, and monitoring. HCP Vault is designed to avoid downtime whenever possible by using cloud architecture best practices to deliver a highly available environment. HCP Vault’s critical operational infrastructure is hosted on AWS across multiple availability zones (AZ), with data resiliency and on-call support to minimize downtimes and support disaster recovery.
This document provides an overview of key built-in capabilities that support recovery efforts in the event of downtimes or a disaster:
- 3-Node Highly Available Clusters
- Data Resiliency
- Encryption Key Ownership
- Backup and Recovery Features
- Support Coverage
- Incident Response
- Disaster Recovery Use Cases Not in Scope
Users that choose HCP Vault entrust HashiCorp to manage Disaster Recovery (DR) and High Availability (HA) of the Vault servers. As part of this managed offering, HashiCorp will use commercially reasonable efforts to maximize the availability of HashiCorp Cloud services, and provide uptime guarantees based on service level agreements (SLA). For details on HCP Vault support packages, visit the Enterprise Support website.
»3-Node Highly Available Clusters
All enterprise production-grade HCP Vault clusters (i.e., Starter, Standard, and Plus) consist of 3 High Availability (HA) nodes spanning different AZs within one region. HCP Vault has a number of orchestration mechanisms in place, including integrated monitoring with leading observability platforms to ensure nodes are health-checked regularly. Unhealthy nodes are quickly identified, triaged, and replaced or remediated as needed. This fault-tolerant architecture allows us to withstand failures of individual nodes in the case that there is an isolated hardware failure or issue within one of the cloud provider data centers. Regional outages are discussed in more detail below.
All HCP Vault nodes have attached encrypted volumes. Automated snapshots are taken daily for production-grade clusters and stored in an encrypted blob storage in the control plane. Users can initiate more frequent snapshots with push-button deployment from the UI. Snapshots currently reside within the US only.
»Encryption Key Ownership
A unique Key Management Service (KMS) cryptographic key is used for automatic unsealing of Vault and encrypting all user snapshots. This key is managed in the user’s dedicated HCP account using the cloud provider’s KMS and is configured to be trusted by the HCP Vault compute instances. This key is managed using carefully crafted, secure policies and all usage is audited. The key is not shared between clusters.
»Backup and Recovery Features
HCP Vault includes several built-in resiliency features in response to outages. This section provides an overview of typical outage scenarios and best practices for users to consider in order to minimize the impact of an outage.
An HCP platform outage does not impact running clusters. The API and UI may be affected, but clusters will remain intact and continue to operate to support established machine-to-machine Vault use cases such as dynamic secrets generation. During a platform outage, snapshots, seal/unseal, and the ability to generate admin tokens will be unavailable. In addition, replication between existing clusters will work, but setting up a new secondary (or any new cluster) will not be available. Production clusters have automated snapshots daily and users can opt to create snapshots more regularly should they choose. Snapshots are stored up to 30 days after they are created. This is not user configurable.
Best Practice for Admin Tokens: Users can mitigate the risk of not being able to generate admin tokens during a platform outage by setting up appropriate authentication- cluster admin token. See our documentation on Authentication methods for more information.
To understand the impacts to HCP Vault for a regional cloud provider outage, it’s important to note how HashiCorp Virtual Network (HVN) makes HCP networking possible. An HVN allows you to delegate an IPv4 CIDR range to HCP, which the platform then uses to automatically create a private network on the cloud provider. Clusters are all placed within the same region the HVN is created in. While HCP Vault is already a redundant cluster of three nodes split across availability zones, it does not support geographic (multi-region) redundant clusters in Active-Active or Active-Standby mode. As noted above, nodes are dispersed across multiple AZs to accommodate for AZ failures.
»Cluster Deletion and Snapshot Restore Limitations
Once a cluster is deleted, all affiliated resources (including snapshots and audit logs) are deleted with it. Therefore, snapshots cannot be used to recover a deleted cluster. Snapshots can also only be used within the cluster they are created in, which means that snapshots cannot be restored in another region.
Clusters are monitored 24/7 with on-call staff available to debug production cluster issues. All production-grade clusters are coupled with either Silver or Gold level support.
In the event of a critical incidence, incident response times are stipulated in the Support agreement of the SLA. HashiCorp will use commercially reasonable efforts to maximize the availability of HashiCorp Cloud services, and provide uptime guarantees based on service level agreements (SLA). Audit logs include key metrics that capture activity and performance. You can view a full list of metrics here. Platform events are reported in audit logs within a minute of the event, and can be viewed for up to 365 days. If a cluster is deleted, audit logs are deleted.
»Disaster Recovery Use Cases Not in Scope
While HCP Vault covers a large amount of DR functionality through HA, there are a few use cases that are not currently supported. These include:
No cross-region failover or cross-region disaster replication. Production-grade clusters are isolated to three nodes within the same region. If a cluster goes down in one region, it cannot be restored to another region. DR secondaries and DR replication are not available at this time.
No snapshot restoration is possible after a cluster has been deleted. Once a cluster is deleted, all snapshots and audit logs are removed with it.
Organizations that select HCP Vault entrust our expert teams to manage areas of technical operations, including disaster recovery, monitoring, and upgrades. Based on our experience supporting thousands of commercial Vault clusters, HCP Vault brings this expertise directly to users and reduces the manual overhead required to successfully Vault. We will continue to invest in HCP Vault to cover more DR scenarios. HCP uptimes and incidents can be viewed at https://status.hashicorp.com/.