High Availability and Disaster Recovery
This document provides guidelines on how to configure Self-Hosted Styra DAS to improve both availability and meantime to recovery (MTTR) in case of failure.
In order to maximize availability, the target Kubernetes cluster should be configured to span multiple availability zones.
The Kubernetes scheduler will attempt to spread pods across nodes by default. If stronger guarantees are required, the Kubernetes Scheduler can also be configured to preferentially schedules pods in different availability zones.
Styra DAS Services
As of Self-Hosted Styra DAS version 0.10.0, the best recommendations for scaling Styra DAS services can be found in the Helm Chart itself.
profile field in the
values.yaml file can be set to any of the following values:
demo: This profile sets all services to run at a lower capacity, and is intended for testing an installation without consuming too many resources. It can be useful when first configuring Self-Hosted Styra DAS, but is not recommended for production use.
production: This profile sets all services' replicas, CPU, and memory requests and limits to more production-ready values. These settings are a best estimate for an optimal balance between availability, scalability, and resource consumption for an average installation, and may still need to be adjusted on a per-customer basis.
custom: This profile allows the user to override all services' replicas, CPU, and memory requests and limits to custom values. These values can be set in the
The values used for the
production profiles can be found by downloading the Helm Chart and inspecting a
<SERVICE_NAME>.tpl file in
Depending on the cloud provider used, it may also be possible to configure the backing database for Styra DAS to run in multiple availability zones.
- Amazon RDS: Configuring and managing a Multi-AZ Deployment.
- GCP Cloud SQL for PostgreSQL: High Availability
Most incidents that may happen when running Styra DAS are expected to either recover automatically, or be recoverable in-place. Some examples are:
- Any Kubernetes Pods should be automatically re-created by their backing Deployments
- Any other Kubernetes objects should be re-installable by running the Helm installation command again.
- The backing data store should be recoverable from a backup as long as the Self-Hosted version is the same.
In the rare instance where an entire region becomes unavailable, however, a more manual recovery process may be needed.
Recovering from a region-wide outage is possible without data loss, but requires manual processes. Self-Hosted Styra DAS does not currently support automatic database failover.
For disaster recovery purposes, configure Styra DAS for disaster recovery across multiple geographically distant regions as follows:
In the secondary region, prepare a secondary standby Kubernetes cluster that is ready to be used to install Styra DAS if the primary cluster in the primary region fails.
In the primary region, configure the database to asynchronously replicate its database to a read-only replica in the secondary region.
In case of a disaster, execute the following:
Promote the read replica in the secondary region to a new standalone database instance.
- AWS RDS: Promoting a read replica
- GCP CloudSQL: Manage read replicas
Deploy Styra DAS to the secondary cluster and configure it to use the newly created database instance.
Change the DNS name for your Styra DAS installation to point to the new cluster. It is important to re-use the same DNS name as the previous installation in order to allow any currently running OPAs to gracefully cut over to the new installation.
In order to best support secondary region failover, Styra recommends using a short TTL for any DNS entry pointing to your Self-Hosted Styra DAS installation. This will reduce the mean time to fail over for any OPAs that rely on the address.