Limitations with current database operators in Kubernetes: StatefulSet (Part 3).

The StatefulSet is a useful API to define applications with state. However, it falls short to encode all the database complexities effectively. In this blog we explain what those limitations are and what the future of databases on Kubernetes looks like.

As it was briefly mentioned in Part 2, the fragmentation of the database ecosystem in Kubernetes is in part due to the use of the StatefulSet.

The StatefulSet is a workload API for stateful deployments. Nodes inside the set are identified with unique and ordered names (unlike ReplicaSets) such that in the event of updates or failures the pods can be recreated with the same identity and associated resources. Independent PVCs can be associated with each node. A simple deployment looks like this:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: "nginx"
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

Though at first glance a StatefulSet might offer useful abstractions and utilities to handle applications with state, it is far from a magic bullet. Since it only offers generic building blocks, it is hard to accommodate under the StatefulSet all the nuances and deployment strategies of each database.

Thus, each operator reimplements over and over the same abstraction layers, patches and workarounds. This makes it hard to achieve the same consistency and production levels for the whole operator ecosystem. Besides, it is a burden for organizations that again requires DBA teams to bridge the gap with each technology.

In this blog post we are going to detail some of these limitations when using StatefulSets to deploy databases.

StatefulSet as an abstract black box

Like any other Kubernetes deployment, the StatefulSet is a black box with a declarative api. The user supplies a target state (YAML file) and the internal logic will do its best to reach that state. The process is rather hidden from the outside. Once started, the control flow cannot be neither inspected nor changed, developers can only observe state changes for the containers (i.e. created, scheduled, failed…) but never be part of the process itself.

This lack of flexibility makes it frustrating to implement the different database requirements that demand custom logic inside that flow. For example, a Postgresql master cannot be removed during a scale down or a Redis node is not ready and available to receive queries until a join command is executed.

The StatefulSet includes two main utilities to cover these types of use cases: Hooks and the OnDelete update strategy. However, both have important tradeoffs and limitations.

Hooks

Hooks are part of the Pod lifecycle and it enables containers to run handlers every time a new pod is created (PostStart) or deleted (PreStop). The hooks are either HTTP or Exec actions, being bash scripts the normal execution model. PreStop hooks are synchronous and block the flow until completion while PreStart hooks are asynchronous and non blocking.

These primitives are useful to perform join operations after a node is up or start decommission actions before a node is going down.

Nevertheless, you usually need custom docker images that include the bash scripts (or an initContainer) which limits the use of official docker upstream images. A developer can no longer use up to date docker images from the database maintainers but has to wait for the developer of the operator to build his own image which is slow and introduces new security issues in the thread model.

OnDelete update strategy

The OnDelete update strategy does not automatically update Pods with changes but only when a Pod is removed. It is recommended to the alternative RollingUpdate which automatically updates Pods in reverse ordinal order, which has seen before might not be suitable for many databases.

OnDelete is part of the legacy behaviour prior to 1.6 which does not guarantee future availability and it requires very much heavy lifting on the operator side to create an extra layer that coordinates the updates.

PVCs on the StatefulSet are limited

Another important feature of the StatefulSet is that it natively handles PVC volumes. Nevertheless, that support is still far from perfect.

A Persistent Volume Claim (PVC) is a Kubernetes object to request external storage. It has plugins (e.g. csi) to expose arbitrary storage systems like AWS EBS as storage for the pods. A PVC can be assigned to a Pod and it has its own lifecycle such that whenever the PVC size gets updated, the volume is expanded (if possible) and the Pod is restarted.

The PVC can also be defined directly on the StatefulSet such that independent storage volumes are created for each node in the set. It is a helpful abstraction since it is a common requirement among all the stateful applications.

However, there are some life cycle features on the PVC that are not available when defined through the StatefulSet.

Resize

First, PVC volumes described in the StatefulSet cannot be resized (ISSUE-68737) when changing the YAML file.

The only way to support resize is to delete the StatefulSet without the pods being deleted (--cascade=false), apply the change directly in the PVC and let the Pod lifecycle resize the storage. Then, create the StatefulSet again with the new size. Note that this is only an accidental side effect.

As with the previous OnDelete strategy, supporting this patch requires substantial implementation on the operator side. As an alternative, users can also execute the commands by themselves but that action is prone to errors and not recommended since the operator is the “owner” of the StatefulSet.

Due to the work required and inasmuch as it is a largely undocumented effect, most operators do not implement this patch and cannot resize the storage automatically. Overall, this is probably the biggest showstopper for the adoption of databases on Kubernetes for an organization.

Decommission

Second, PVC volumes are not automatically decommissioned and removed when a node in the StatefulSet is deleted (ISSUE-55045). There is active work in progress to potentially fix this for the upcoming 1.22 release.

Again, this is not a problem of the PVC itself but how the StatefulSet fails to interact with it. Someone has to manually delete the PVC to delete the data, either the operator or the organization via some manual workflow.

Conclusion

Though at first glance the StatefulSet is a useful primitive to create deployments that have state, it falls short to encode all the meaningful complexities required when running databases, either for the lack of flexibility on the logic or the inability to configure and manage PVCs.

Then, it is not uncommon that operators have to build again and again similar abstraction layers, fix the same patches and figure out the security and operational models. This unexpected extra burden explains why not that many database operators are ready for production.

If the goal is to make databases native to Kubernetes a change of paradigm is required. The community should aim to create a database framework (instead of using the StatefulSet black boxes) that combines both stateful logic to handle PVCs and rolling updates plus custom modular support to address specific databases.

Besides, the ecosystem needs more open protocols, helper libraries and standards around database deployments and workflows. Otherwise, as it already happens now, each operator needs to reinvent the same stack on their own.

Then, there could be a similar model for deployments, more operators production ready from day one and more consistent workflow to manage databases.