ADR-0009: Architecture for pre-populating the file system of scanners


Status:	ACCEPTED
Date:	2021-10-07
Author(s):	Max Maass max.maass@iteratec.com, Jannik Hollenbach jannick.hollenbach@iteratec.com

Context

SecureCodeBox should be able to run scanners that require access to the source code of the target. This could include secret scanners like Gitleaks or SAST tools like Semgrep. To achieve this and expose the feature to the user, we need to extend the syntax of the scan definitions and implement the pre-loading functionality.

The current scanners do not support dynamically loading data from a Git repository (except Gitleaks) or other sources. Thus, we need a solution for retrieving data from a source and exposing it to the scanner, without modifying the scanners or their images. In our view, the best way to achieve this is using Kubernetes init containers. They can be used to retrieve data and place it in a volume they share with the scanner, and they can be trivially attached to any existing scanner container using built-in features of Kubernetes.

Question 1: Implementing Different Pre-Population Data Sources

To be useful, the feature will need to support a number of different data sources. At a minimum, Git repositories and file downloads via curl should be supported, but further data sources may be desireable (e.g., SVN, curl+unzip, en- or decryption of data using openssl CLI, ...). Thus, the extensibility of the pre-populator architecture should be considered.

Option 1: Modular design using CRDs

For scanners, secureCodeBox uses a modular architecture where additional scanners can be added without modifying the controller. This allows people without knowledge of Go to contribute new scanners without worrying about implementation details of the operator, at the price of having to install the scanners individually via helm.

The pre-populators could be designed in a similar way: A general syntax in the ScanSpec defines which downloader should be started, and parameterizes it using arguments, similar to how the scanners are parameterized. It would use a general-purpose syntax (with the same YAML structure for all downloaders), which can be directly transformed into a Kubernetes specification, like it is done for the scan jobs. Power users can define their own pre-populator jobs, which should be a lot simpler than defining a scanner, as it does not require a parser, sidecar, etc.

A configuration may look something like this, using a fictional semgrep scanner job (disregard the mountPoint for now, it will be discussed later in the ADR):

apiVersion: "execution.securecodebox.io/v1"
kind: Scan
metadata:
  name: "semgrep-juiceshop"
spec:
  scanType: "semgrep"
  parameters:
    - "--config"
    - "p/ci"
    - "--config"
    - "/etc/rules/semgrep-rules.yaml"
    - "/etc/repo"
  prepopulateFilesystemFrom:
    - mountPoint: /etc/repo
      prepopulateType: "git"
      parameters:
        - "--recursive"
        - "https://github.com/bkimminich/juice-shop.git"
        - "/etc/repo/juice-shop/"
    - mountPoint: /etc/rules
      prepopulateType: "curl"
      parameters:
        - "https://example.com/semgrep-rules.yaml"
        - "--output"
        - "/etc/rules/semgrep-rules.yaml"

Advantages

Modular and extensible
No task-specific business logic inside the controller
Easy to also support the feature for hooks
User can specify the exact options they want the tool to use (e.g., git clone with or without submodules, curl with specific authentication headers, ...)
Consistent architecture with how the scanners are designed

Disadvantages

Integration into the controller would require changes in several places (@J12934: Can you elaborate a bit on that?)
Each pre-populator would have to be installed separately using helm
Building a "combination" pre-loader like "curl+unzip" may require wrapper scripts, or splitting it into two preloaders (one that downloads and one that unzips).

Option 2: Implementation Inside the Operator

If we want to avoid adding another CRD and dealing with the overhead of managing the scanners that way (including the overhead of installing the preloaders etc.), the feature can also be implemented directly inside the operator. For this, the ScanSpec could be extended with specific fields for the different pre-populators (i.e., the Git downloader could explicitly ask for the repository, access key, etc., while the curl downloader would have other flags and options). The controller would then translate the spec into a Kubernetes job with the correct parameters and add it as an init container. Optionally, we could also let users specify their own custom jobs that use a container and script of their choice. This would allow them to run special scripts that are specific to their deployment and that would not be useful to send upstream to the project.

An example configuration could look like this:

apiVersion: "execution.securecodebox.io/v1"
kind: Scan
metadata:
  name: "semgrep-juiceshop"
spec:
  scanType: "semgrep"
  parameters:
    - "--config"
    - "p/ci"
    - "--config"
    - "/etc/rules/semgrep-rules.yaml"
    - "/etc/repo"
  prepopulateFilesystemFrom:
    - mountPoint: /etc/repo
      git:
        url: "https://github.com/bkimminich/juice-shop.git"
        accessToken: "ABC123..."
        recurseSubmodules: true
    - mountPoint: /etc/rules
      http:
        url: "https://example.com/semgrep-rules.yaml"
    - mountPoint: /etc/repo
      custom:
        image: "myuser/mypreparationimage"
        parameters:
          - "my"
          - "custom"
          - "parameters"

Advantages

No need for a new CRD or separate helm installs for all pre-populators
More beginner-friendly way of specifying what the pre-populator jobs should do
More approachable method for running custom pre-populator scripts (no need to place a number of YAML files in the right places inside the secureCodeBox directory structure)

Disadvantages

Requires implementing core functionality of the feature inside the controller instead of the YAML files
Experts may prefer to parameterize their Git and curl jobs themselves instead of relying on our flags
Much higher barrier of entry for contributors that want to send new pre-populator jobs upstream (requires knowledge of the controller)
Allowing custom pre-populator images is a security concern for the infrastructure in a scenario where secureCodeBox is hosted by one company and used by another (i.e., the SaaS model).

Option 3: Expose the initContainer Config in the ScanSpec CRD

Another solution would be to simply allow the user to specify their own initContainer(s) in the ScanSpec CRD. This would allow them to create their own, arbitrary pre-populators, completely independently of how we support the feature. There is precedent for this as well: Scanners can define init containers in their values.yaml files. The documentation could provide configuration snippets for common tasks as a baseline for users to base their own configurations on.

An example config may look like this:

apiVersion: "execution.securecodebox.io/v1"
kind: Scan
metadata:
  name: "semgrep-juiceshop"
spec:
  scanType: "semgrep"
  parameters:
    - "--config"
    - "p/ci"
    - "--config"
    - "/etc/rules/semgrep-rules.yaml"
    - "/etc/repo"
  volumes:
    - name: repo
      emptyDir: {}
    - name: rules
      emptyDir: {}
  volumeMounts:
    - name: repo
      mountPath: "/etc/repo"
    - name: rules
      mountPath: "/etc/rules"
  initContainers:
    - name: git-clone
      image: bitnami/git
      command:
        - git
        - clone
        - "--recursive"
        - https://github.com/bkimminich/juice-shop.git
        - /etc/repo
      volumeMounts:
        - name: repo
          mountPath: "/etc/repo"
    - name: curl-rules
      image: busybox
      command:
        - curl
        - "--output"
        - "/etc/rules/semgrep-rules.yaml"
        - "https://example.com/semgrep-rules.yaml"
      volumeMounts:
        - name: rules
          mountPath: "/etc/rules"

Advantages

Maximum flexibility: Users can do anything that can be done with init containers
No need for us to implement any feature-specific code, aside from one change to the ScanJob CRD

Disadvantages

Assumes the user is familiar with how to use init containers (although we can of course provide example configurations in the documentation)
In a scenario where secureCodeBox is hosted by one company and used by another (i.e., the SaaS model), letting users specify their own init containers is a security concern for the infrastructure

Question 2: Specifying Persisted Folders

Transferring data from the init container to the scanner requires using volumes that are shared between the init container(s) and the scan job. These volumes can be defined in multiple ways.

Option 1: User-managed Kubernetes Volumes

The easiest solution is to allow the user to specify the volumes and volumeMounts themselves. This is already supported by the ScanSpec CRD, so it would just need to be added to the pre-populator CRD as well. Then, the user can specify exactly which container should see which volume under which path in the file system. This would be the obvious solution if we choose option 3 for question 1 (fully user-specified init containers), but can also work for option 1, and technically also with option 2, although it would likely not be a good fit.

Advantages

Full control for the user
No need to add any custom code to manage volumes
Validation is handled by Kubernetes

Disadvantages

Requires the user to know how to create and handle volumes (although we can provide code snippets in the documentation)
In a scenario where secureCodeBox is hosted by one company and used by another (i.e., the SaaS model), letting users specify their own volume mounts is a security concern for the infrastructure (Denial of service through resource consumption, mounting host directories, ...). However, this is already possible right now.

Option 2: Limited User-Managed Volumes

A variation on option 1 would be to allow the user to specify the names of volumes, but not the type (i.e., we force it to be one of an allow-listed set of types: emptyDir, secret, ...). This would require slightly more validation code, but limit the attack surface from the volume mounts. However, this would likely be a separate ADR, as it would also impose new limitations on scans that do not use the pre-populate filesystem / init container feature. I will thus not consider it in greater detail here.

Option 3: secureCodeBox-Managed volumes

Finally, we can also abstract away the details of the Kubernetes volume definitions and simply have the user specify which folders they want to modify with their pre-populate job. An example of such a configuration is given in Option 1 and 2 of Question 1. In this case, the controller would be responsible for determining which volumes are needed, and creating and mounting them to the right containers. While this may seem a deceptively simple concepts, it hides a lot of complexity.

Advantages

Does not expose volume definitions to the user, thereby decreasing the attack surface towards malicious users in a SaaS setting
Does not require the user to know how to define and mount Kubernetes volumes

Disadvantages

Requires a lot of edge case handling for the controller (what if one pre-populator wants to access /etc/repo and another /etc/repo/some/folder? Do both get a mount for /etc/repo? What if the folder /etc/repo/some/folder does not exist when the relevant job is started?)
Has a high capacity for causing unintuitive and hard-to-debug issues for the user, who cannot benefit from Kubernetes documentation on volumes to understand and resolve them
Reduces flexibility for power users

Decision

We decided to utilize the built-in initContainer functionality of Kubernetes and expose it through the ScanTypes, combined with the existing syntax for volume management. Information on how to use the feature will be added to the documentation to explain how common operations (Git clone, etc.) can be performed, for people who do not have experience with Kubernetes init containers.

Consequences

The possibility of using init containers adds a large number of new possible features to the secureCodeBox, and lays the groundwork for adding new types of scanners. We will gather experience with how the feature is used, and may come back to the architecture and make changes if a simpler interface is desired.

Context​

Question 1: Implementing Different Pre-Population Data Sources​

Option 1: Modular design using CRDs​

Advantages​

Disadvantages​

Option 2: Implementation Inside the Operator​

Advantages​

Disadvantages​

Option 3: Expose the initContainer Config in the ScanSpec CRD​

Advantages​

Disadvantages​

Question 2: Specifying Persisted Folders​

Option 1: User-managed Kubernetes Volumes​

Advantages​

Disadvantages​

Option 2: Limited User-Managed Volumes​

Option 3: secureCodeBox-Managed volumes​

Advantages​

Disadvantages​

Decision​

Consequences​

Context

Question 1: Implementing Different Pre-Population Data Sources

Option 1: Modular design using CRDs

Advantages

Disadvantages

Option 2: Implementation Inside the Operator

Advantages

Disadvantages

Option 3: Expose the initContainer Config in the ScanSpec CRD

Advantages

Disadvantages

Question 2: Specifying Persisted Folders

Option 1: User-managed Kubernetes Volumes

Advantages

Disadvantages

Option 2: Limited User-Managed Volumes

Option 3: secureCodeBox-Managed volumes

Advantages

Disadvantages

Decision

Consequences