Day 2 for the Operator Ecosystem

March 19, 2020

This blog post was originally published on Devops.Com

The original focus for Kubernetes was very much on stateless applications, which don’t rely on coordination between instances or sharing data between client sessions. In practice, that usually means that we can stop and start our stateless applications without data loss or impacting on client connections.

If all applications were stateless, the world of application deployment and management would be super simple, but unfortunately that’s not all use cases. The second class of applications are stateful, where the application is retaining data or state across its entire lifecycle. Whilst older monolithic applications such as SQL databases fit into this paradigm, there are also many architectures where the individual elements making up a particular service are actually clustered together in some way. They may be sharing state data between them, or they need to be synchronized together.

One of the fundamental features of Kubernetes is that the scheduler can move pods around between agents, which works fine for stateless applications but not so well for stateful, which tend to like their storage and networking to not change during operation, unless handled in a very specific operational way. In order to try to handle the complexities of stateful applications, Kubernetes 1.5 added StatefulSets, which gave us the concepts of stable network and storage. But this doesn’t solve all of our problems, since Kubernetes still doesn’t know what’s actually going on inside the pods.

Typically these kinds of applications have a set of lifecycle states which require logical ordering of actions in order to maintain operation. Stateful applications tend to have coordination between their elements, so, in general, if we were to just start and stop instances then we may impact on the correct behavior of the application—we may need to wait for re-balancing of data, or there may be multiple steps involved in an upgrade operation. This also applies to many operations that you are likely to perform such as scaling, and these actions also tend to be unique to the application, so a Cassandra cluster is managed differently from a Kafka cluster. That domain-specific knowledge of how to administer this class of applications isn’t captured in the paradigms that Kubernetes gives us out of the box, so how do we automate and package that knowledge?

The solution to this problem is the Operator pattern. Operators encapsulate operational tasks in code, so we can orchestrate application lifecycle actions using Kubernetes APIs. Operators encode domain-specific knowledge about the lifecycle actions required for the application—how to scale it, how to upgrade it, how to recover from failure scenarios and so on.

When we install an Operator into our cluster, we will get a controller, which is the code that manages the lifecycle of the particular thing and where the domain-specific knowledge is encapsulated. We’ll also get a bunch of custom resource definitions, which extend the Kubernetes API to address new resource types, so these add a new type which you can then put data into and instantiate.

There are now many different ways to build Operators. The first one to emerge was the Operator Framework, which originated at CoreOS, but since then a whole range of different ways to build Operators have appeared, including Kubebuilder, Metacontroller and KUDO. These all take different approaches to building Operators, from SDK’s requiring Go expertise and deep Kubernetes knowledge to the KUDO approach of a polymorphic Operator, which can be configured very simply by using YAML.

For cluster administrators managing large clusters, this proliferation of development methodologies and management interfaces can be problematic. The class of applications that Operators were developed to manage are almost always part of a larger application stack, with dependencies between them. If we have a lot of Operators written in different ways running on our clusters, how can we ensure they will interoperate with each other, and how do we validate and test them?

When we depend on other services, we primarily care about two things: how to access and use it, and what it does. The KUDO project is working on the Kubernetes Operator Interface (KOI) to address this, enabling Operators to easily compose with each other based on a well-known set of CRDs. This interface is not just about connection strings and secrets, either; KOI aims to answer the question of “What does this Operator actually do?” By exporting an Operator’s behaviors as a CRD, tooling such as CLIs and GUIs can be written that better abstract the tasks that users need to perform rather than having to understand a library of custom resources.

Defining behavior is one thing, validating it is another. How do we write standardized tests that confirm an Operator actually does the thing it says it is going to do? In order to try and address this problem, the KUDO project has also developed a test framework, Kuttl, which allows Operator developers to write tests to prove correct behavior. Kuttl has wider use cases outside of the KUDO project, with the ability to test conformance for all kinds of Kubernetes resources, so this is in the process of spinning out into a separate project.

These kinds of efforts around standardization, definitions and testing will enable us to build ecosystems of Operators that naturally compose with each other, no matter the implementation—allowing users to build application stacks they can be confident will inter-operate consistently, rather than a constellation of unrelated Operators strung together by assuming the right configuration values. Collaboration and cross-project working are key to the future of the Operator ecosystem, and there are now more and more collaborative efforts emerging across the industry with these goals in mind.

comments powered by Disqus