Skip to main content

ADR-0010: Custom Inheritance Behavior for Affinity and Tolerations

Status:ACCEPTED
Date:2021-11-22
Author(s):Max Maass max.maass@iteratec.com

Context

Kubernetes-based cloud environments allow controlling which nodes are used for which workloads through two mechanisms: Affinity and Tolerations. These can be selected by setting them in the Kubernetes job.

Problem 1: Sources of Configuration

The secureCodeBox operator is managing Kubernetes jobs for us, and templating their values from the provided Scan (for a concrete scan), ScanType (for a type of scan, e.g. nmap), ParseDefinition (for the parser related to a scan), or ScanCompletionHook (for a hook) into the job itself. Since they each pull information from different places, supporting affinity and tolerations for all of them requires adding them in multiple locations. Generally, such settings are pulled into the job from one of two (or in some cases three) places:

  • The helm values (configured via the values.yaml during install of the ScanType, ParseDefinition or ScanCompletionHook)
  • The Scan specification for the running scan
  • The CascadingScan specification when creating a cascaded scan (or, more precisely: both the Scan spec of the parent scan and whatever information is given in the ScanSpec of the CascadingScan)

Specified as a table, this is where values for the different jobs scheduled by the operator normally come from:

Job typeHelm valuesScan.spec CascadingScan.spec.scanSpec
Scan ✅
Parse 
Hook 

This presents us with a problem: All three job types should be configurable with an affinity and tolerations, but two of them only read their relevant configuration from the helm values (provided during install). This makes it impossible to ensure that all jobs triggered by a single Scan use a specific affinity or toleration (which may be different from the default). We can address this issue in two ways:

Option 1: Accept And Move On

One option is to accept that this is the case and leave it unchanged. This is unsatifactory, as affinity and tolerations are a powerful tool, not only for controlling the cost of cloud deployments (by using cheaper node types, like preemptible nodes), but also for other aspects like controlling the geographic location of nodes, the presence of special node features, and other aspects. In some cases, there may be no one valid default value for a single ScanType that is correct for all Scans using it. Additionally, for other features that can have defaults set by the helm values, it is possible to add to these defaults using fields in the Scan definition.

Option 2: Use Affinity and Tolerations from Scan in all Jobs

The other option is to deviate from the usual method of setting values for the jobs by making all three types of jobs (scans, parsers and hooks) use the affinity and tolerations defined in the Scan (if it defines them), and fall back to the defaults from the Helm values otherwise. This allows the user to specify affinity and tolerations in one place (the Scan), and be confident that any jobs started by the Scan will use the same affinity and toleration settings. The downside is that now, values set in the Scan will influence the execution of parsers and even hooks, which is different from the behavior of the system in other places, where settings on a Scan will not impact the hooks and parsers.

To summarize, this would make the table look as follows:

Job typeHelm valuesScan.spec CascadingScan.spec.scanSpec
Scan ✅
Parse 
Hook 

(There are no checkmarks on the CascadingScan.spec.scanSpec column for parser and hook because it is merged into the Scan.spec when creating the cascaded scan. From there, it will influence the parser and hook the same way it would if it had been directly added to the parent Scan.)

Problem 2: Merging vs. Replacing Defaults

Normally, defaults set in the helm values are merged with any additional values provided in the Scan, and the same merging behavior governs combining the Scan.spec of the triggering scan with the CascadingScan.spec.scanSpec of a cascading scan (assuming inheritance is enabled). However, since the affinity is defined as a deeply nested dictionary, merging is both technically challenging and may lead to unexpected results. In the worst case, it can lead to an invalid configuration, or one that is impossible to schedule because of conflicting requirements. It is thus advisable to replace any default affinity with one that is specified in the Scan.spec (for Helm values) or the CascadingScan.spec.scanSpec (for cascaded scans), instead of attempting to merge them. However, this raises the question of how to handle tolerations.

Option 1: Consistency With Other Values

One option is to have it behave in the same way it is done for labels, environmental variables, etc.: merging the default with the values provided in the Scan. The downside of this approach is that it is purely "additive": It is never possible to create a set of tolerations that does not include the default. Additionally, it is inconsistent with the behavior of the affinity, which is the most closely related feature.

Option 2: Consistency with Affinity

The alternative is to replace instead of merge for the tolerations as well. This is inconsistent with the other values, but it ensures that affinity and tolerations behave the same.

Decision

For the first problem, we choose to use Option 2: Affinity and tolerations defined in the Scan will be used by all jobs related to this scan. They will also be inherited by default, although a special cascading scan flag can be used to disable this (inheritAffinity / inheritTolerations, as per the standard naming scheme). If no values are defined in the Scan, the default values from the Helm install are used.

For the second problem, we choose Option 2 as well: Both affinity and tolerations will have the more specific value (Scan.spec for scans, CascadingScan.spec.scanSpec for cascaded scans) replace the more general value (Helm values for scans, Scan.spec of the parent scan for cascaded scans), assuming the more specific value is set. If the value is not set (i.e., the key is omitted in the configuration), the more general value is used. If the more specific key is set to an empty value ([] for tolerations, {} for affinity), it will still replace the more general value and thus remove any affinity or toleration that was previously configured.

Consequences

This decision leads to a system that is less concerned with consistency in behavior between unrelated features (e.g., tolerations and environmental variables), and more interested in consistency between related features (affinity and tolerations) and convenience for the operator (changing the scheduling behavior of parsers and hooks from the scan definition).