Limitations with current database operators in Kubernetes: Introduction (Part 1).

Kubernetes is the go-to solution to orchestrate containers. Though there is a thriving ecosystem for stateless and user applications, organizations still face challenges and limitations to run the data-layer.

Today, Kubernetes is a key infrastructure component for many organizations. What started as a standalone container orchestrator, evolved into a powerful management framework that spans multiple levels of the application lifecycle, and not just their deployment.

Though there is an expanding ecosystem for stateless and user applications, there are still some limitations and concerns when working on the data layer (queues, databases, schedulers…).

Natively, Kubernetes includes two main deployments to handle stateful applications: Persistent Volume Claims (PVC) and StatefulSets. However, these static definitions are not enough to describe the complex logic of the database lifecycle, instead, developers rely on Kubernetes operators to deploy databases.

Currently, there are more than 30 database operators listed on OperatorHub.io. Many of them developed by the own creators of the databases. Nevertheless, developers will not find the same streamlined experience with databases as with their stateless counterparts.

Current landscape

There are many operations involved in the database lifecycle, the job does not end after the initial deployment (day 1) but the main load of the work is done afterwards during routine maintenance (day 2) tasks, these are some of the common tasks:

  • Provision: Install required software.
  • Discovery: Connect nodes in the cluster.
  • Distribution: Make the cluster available to the end user.
  • Scale: Horizontal or vertical scaling of the resources.
  • Failure recovery: Handle node failures and restarts.
  • Monitoring: Retrieve metrics, insights and workload analysis.
  • Security: In-transit and at-rest encryption.

Because we are on Kubernetes, there is already native support to assist in some of these tasks, specially the ones that handle the lifecycle of the nodes in the cluster. For instance, containers to provision resources, Service endpoints to provide access and dynamically load balance user requests or StatefulSets to manage ordered updates, scales and failure recovery.

For other tasks that are more workflow oriented like monitoring and security, there are many cloud native open protocols that integrate with Kubernetes like Prometheus to export metrics and Istio or Linkerd to do service mesh encryption with certificate rotation.

Now, on top of these building blocks provided by Kubernetes, each operator needs to implement the custom logic that bridges the gap between the imperative api of the database and the declarative api in Kubernetes. The most critical part relates to cluster coordination actions every time a new node is included, removed or updated from the set. For example, migrate data off an Elasticsearch node before it is removed from the cluster or update the topology configuration file on each Clickhouse server when a new shard is added.

Ideally, the ecosystem should aim for two objectives if databases on Kubernetes are meant to be a replacement for hosted solutions. First, the operators should cover as many features as possible from the lifecycle. Second, the user experience to use any operator should be as consistent as possible.

However, this is not true in practice. The current operator ecosystem is highly fragmented, there is no consistency among operators and each one uses his own strategies and opinionated patterns to manage their databases.

Despite many successful case studies like the ones from Zalando and Shopify running databases in Kubernetes, the lack of a unified workflow makes it difficult for an organization to reliably use a database in production. It requires a knowledgeable DevOps team to understand each operator, their individual capabilities and each one fits in the current security and deployment models of the company. In this blog series we will try to explain more in detail these limitations and how they can affect the organization.