Skip to main content

Fail Open Mitigation

Kubernetes Validating Webhooks that apply to all resources should not fail closed. Instead, Styra recommends that users run the Validating Webhook in fail open mode, and use the Styra DAS monitoring features to discover resources that have been admitted to the cluster due to a webhook failure.

Collect the Relevant Metrics

The first step is to set up automated collection of metrics from Styra DAS and use these to trigger alerts. Set up Prometheus monitoring as described in the metrics documentation.

Track the violations metric from <das-id>.styra.com/v1/timeseries/metrics. This violations metric counts the number of objects on your Kubernetes cluster that violate an enforce or monitor rule. This indicates the number of resources on your cluster than violate your policy, and it includes labels to indicate the system id and system name that it applies to.

These resources may have been admitted to the cluster before the installation of Styra DAS or they may have been admitted due to a webhook failure. To ensure that you can alert off of this metric, you will need to get it to zero before continuing. This may lead to cleaning up resources that violate your policy or modifying the policy so that those resources do not violate it.

Once the violations metric is zero, set up an alert to trigger anytime this metric is greater than zero.

At that point, the administrator should view the Compliance tab for the offending system, which will show the full JSON specifications of all resources on the cluster that violate your policy. The compliance data for a Kubernetes system is available at the endpoint <das-id>.styra.com/v1/data/systems/${SystemID}/policy/com.styra.kubernetes.validating/monitor.

Policy Considerations

Styra DAS supports two types of policy violations:

  1. enforce: When a rule is in enforce mode, it does not admit violating resources to the cluster.

  2. monitor: When a rule is in monitor mode, it admits the resource, but reports it to the violations metric and it will appear in the Compliance tab.

Due to monitor rules counting as violations, they must not be used in policies where fail open mitigation is needed. Instead, all rules should be placed into enforce mode, and therefore all violations on the cluster are resources that should not have been admitted.

Scenarios

This section describes the possible scenarios where OPA will not be able to respond to a request, causing the webhook to fail open.

OPA Cannot Pull Policies From Styra DAS

Description

OPA periodically pulls policy bundles from Styra DAS and this can fail if OPA cannot compile the bundle.

Detection

Each system has a metric errors that indicates operational errors at <das-id>.styra.com/v1/systems/metrics. This count will be greater than zero if an OPA has not updated its bundle within five minutes.

The most likely source of this error is that the policy on Styra DAS is incompatible with the OPA version.

Recovery

  1. Verify that you can download the policy bundle at <das-id>.styra.com/v1/bundles/systems/${SystemID}.

  2. Once this bundle is downloaded, attempt to open it with OPA using the command opa run ${bundle-filename.tgz}.

  3. If the bundle does not compile, try it with the same OPA version that is deployed on Styra DAS. This can be found at <das-id>.styra.com/v1/system/version.

  4. Upgrade the OPAs on the cluster. The can be done by changing the container image or by re-running the system's install command.

  5. Verify that the system is no longer reporting decisions with error status by examining new decisions in the system's decision log.

Network Partition

Description

The API server cannot reach OPA because of a networking issue.

Detection

The Kubernetes API Server logs validating webhook errors.

Recovery

  1. Remove the validating webhook configured to call out to OPA.

  2. Fix the network partition; note that once the validating webhook is removed, Kubernetes network providers will be able to fix themselves as long as there is not a physical networking issue.

  3. Restart pod's from the Styra DAS timeseries deployment. This is optional and it resets the timeseries metrics and causes the metrics to be recomputed.

  4. Check Styra DAS compliance dashboard to find any resources on the cluster that violate your policy, and manually remediate these violations.

  5. Re-install OPA on your cluster using the appropriate system's install command.

OPA Is Not Responsive

Description

OPA receives requests but is unable to respond to them. There are several potential root causes for this, including:

  • OPA is in a crash loop.

  • OPA liveness checks are failing, causing Kubernetes to kill the OPA pod.

Detection

The Kubernetes pod restart count for OPA will be greater than zero and no OPA pods will be in running state.

Recovery

  1. Determine the root cause of the failure, possibilities include:

    a. Liveness check timeout is too short.

    b. OPA is CPU constrained, and cannot respond to requests in time, or

    c. There is a bug in OPA that causes it to repeatedly panic.

  2. For causes (a.) and (b.), increase the amount of CPU available to OPA by increasing the Kubernetes deployment's resource limits. For cause (c.), downgrade or upgrade to a different version of OPA.

OPA Policy Evaluation Takes Too Long

Description

It is possible to author very inefficient policies, causing OPA to use excessive CPU and not respond to requests within the validating webhook's timeout. Another cause is the use of the http.send OPA built-in.

Detection

The metric errors available at <das-id>.styra.com/v1/timeseries/metrics indicates the number of decisions that OPA did not respond to and instead returned an error. This metric will be greater than zero if OPA evaluation times out. Additionally, you may see CPU throttling in the Kubernetes metrics.

Recovery

  1. Revert the policy to a known good state.

  2. Manually inspect the slow policy to look for iterations over very large datasets. This can be tricky and may require Styra's support.

  3. If the policy uses the http.send built-in, the responsiveness of the endpoint invoked may need to be improved. Alternatively, it may be possible to load the data into OPA so that it does not require an http call.

  4. Publish the revised policy on Styra DAS.