Fail Open Mitigation
Kubernetes Validating Webhooks that apply to all resources should not fail closed. Instead, Styra recommends that users run the Validating Webhook in fail open mode, and use the Styra DAS monitoring features to discover resources that have been admitted to the cluster due to a webhook failure.
Collect the Relevant Metrics
The first step is to set up automated collection of metrics from Styra DAS and use these to trigger alerts. Set up Prometheus monitoring as described in the metrics documentation.
Track the violations
metric from <das-id>.styra.com/v1/timeseries/metrics
. This violations
metric counts the number of objects on your Kubernetes cluster that violate an enforce
or monitor
rule. This indicates the number of resources on your cluster than violate your policy, and it includes labels to indicate the system id and system name that it applies to.
These resources may have been admitted to the cluster before the installation of Styra DAS or they may have been admitted due to a webhook failure. To ensure that you can alert off of this metric, you will need to get it to zero before continuing. This may lead to cleaning up resources that violate your policy or modifying the policy so that those resources do not violate it.
Once the violations
metric is zero, set up an alert to trigger anytime this metric is greater than zero.
At that point, the administrator should view the Compliance tab for the offending system, which will show the full JSON specifications of all resources on the cluster that violate your policy. The compliance data for a Kubernetes system is available at the endpoint <das-id>.styra.com/v1/data/systems/${SystemID}/policy/com.styra.kubernetes.validating/monitor
.
Policy Considerations
Styra DAS supports two types of policy violations:
-
enforce
: When a rule is inenforce
mode, it does not admit violating resources to the cluster. -
monitor
: When a rule is inmonitor
mode, it admits the resource, but reports it to theviolations
metric and it will appear in the Compliance tab.
Due to monitor
rules counting as violations, they must not be used in policies where fail open mitigation is needed. Instead, all rules should be placed into enforce
mode, and therefore all violations on the cluster are resources that should not have been admitted.
Scenarios
This section describes the possible scenarios where OPA will not be able to respond to a request, causing the webhook to fail open.
OPA Cannot Pull Policies From Styra DAS
Description
OPA periodically pulls policy bundles from Styra DAS and this can fail if OPA cannot compile the bundle.
Detection
Each system has a metric errors
that indicates operational errors at <das-id>.styra.com/v1/systems/metrics
. This count will be greater than zero if an OPA has not updated its bundle within five minutes.
The most likely source of this error is that the policy on Styra DAS is incompatible with the OPA version.
Recovery
-
Verify that you can download the policy bundle at
<das-id>.styra.com/v1/bundles/systems/${SystemID}
. -
Once this bundle is downloaded, attempt to open it with OPA using the command
opa run ${bundle-filename.tgz}
. -
If the bundle does not compile, try it with the same OPA version that is deployed on Styra DAS. This can be found at
<das-id>.styra.com/v1/system/version
. -
Upgrade the OPAs on the cluster. The can be done by changing the container image or by re-running the system's
install
command. -
Verify that the system is no longer reporting decisions with
error
status by examining new decisions in the system's decision log.
Network Partition
Description
The API server cannot reach OPA because of a networking issue.
Detection
The Kubernetes API Server logs validating webhook errors.
Recovery
-
Remove the validating webhook configured to call out to OPA.
-
Fix the network partition; note that once the validating webhook is removed, Kubernetes network providers will be able to fix themselves as long as there is not a physical networking issue.
-
Restart pod's from the Styra DAS
timeseries
deployment. This is optional and it resets the timeseries metrics and causes the metrics to be recomputed. -
Check Styra DAS compliance dashboard to find any resources on the cluster that violate your policy, and manually remediate these violations.
-
Re-install OPA on your cluster using the appropriate system's
install
command.
OPA Is Not Responsive
Description
OPA receives requests but is unable to respond to them. There are several potential root causes for this, including:
-
OPA is in a crash loop.
-
OPA liveness checks are failing, causing Kubernetes to kill the OPA pod.
Detection
The Kubernetes pod restart count for OPA will be greater than zero and no OPA pods will be in running state.
Recovery
-
Determine the root cause of the failure, possibilities include:
a. Liveness check timeout is too short.
b. OPA is CPU constrained, and cannot respond to requests in time, or
c. There is a bug in OPA that causes it to repeatedly panic.
-
For causes (a.) and (b.), increase the amount of CPU available to OPA by increasing the Kubernetes deployment's resource limits. For cause (c.), downgrade or upgrade to a different version of OPA.
OPA Policy Evaluation Takes Too Long
Description
It is possible to author very inefficient policies, causing OPA to use excessive CPU and not respond to requests within the validating webhook's timeout. Another cause is the use of the http.send
OPA built-in.
Detection
The metric errors
available at <das-id>.styra.com/v1/timeseries/metrics
indicates the number of decisions that OPA did not respond to and instead returned an error. This metric will be greater than zero if OPA evaluation times out. Additionally, you may see CPU throttling in the Kubernetes metrics.
Recovery
-
Revert the policy to a known good state.
-
Manually inspect the slow policy to look for iterations over very large datasets. This can be tricky and may require Styra's support.
-
If the policy uses the
http.send
built-in, the responsiveness of the endpoint invoked may need to be improved. Alternatively, it may be possible to load the data into OPA so that it does not require an http call. -
Publish the revised policy on Styra DAS.