June 21, 2024


The Joy of Technology

Whose Cluster Is It Anyway? – Grape Up

While researching how enterprises adopt Kubernetes, we can outline a common scenario; implementing a Kubernetes cluster in a company often starts as a proof of concept. Either developers decide they want to try something new, or the CTO does his research and decides to give it a try as it sounds promising. Typically, there is no roadmap, no real plan for the future steps, no decision to go for production.

First steps with a Kubernetes cluster in an enterprise

And then it is a huge success – a Kubernetes cluster makes managing deployments easier, it’s simple to use for developers, cheaper than the previously used platform and it just works for everyone. The security team creates the firewall rules, approves the configuration of the network overlay and load balancers. Operators create their CI/CD pipelines for the cluster deployments, backups and daily tasks. Developers rewrite configuration parsing and communication to fully utilize the ConfigMaps, Secrets and cluster internal routing and DNS. In no time you are one click from scrapping the existing infrastructure and moving everything to the Kubernetes.

This might be the point when you start thinking about providing support for your cluster and the applications in it. It may be an internal development team using your Kubernetes cluster, or PaaS for external teams. In all cases, you need a way to triage all support cases and decide which team or a person is responsible for which part of the cluster management. Let’s first split this into two scenarios.

A Kubernetes Cluster per team

If the decision is to give a full cluster or clusters for a team, there is no resource sharing, so there is less to worry about. Still, someone has to draw the line and say where a cluster operators’ responsibility ends, and the developers have to take it.

The easiest way would be to give the full admin access to the cluster, some volumes for persistent data and a set of LBs (or even one LB for ingress), and delegate the management to the development team. Such a solution would not be possible in most cases, as it requires a lot of experience from the development team to properly manage the cluster and make sure it is stable. Also, this is not always optimal from the resources perspective to create a cluster for even a small team.

The other problem is that when a team has to manage the whole cluster, the actual way it works can greatly diverge. Some teams decide to use nginx ingress and some traefik. End of the day, it is much easier to monitor and manage the uniform clusters. 

Shared cluster

The alternative is to utilize the same cluster for multiple teams. There is quite a lot of configuration required to make sure the team doesn’t interfere and can’t affect other teams operations, but adds a lot of flexibility when it comes to resource management and limits greatly the number of clusters which have to be managed, for example in terms of backing them up. It might be also useful if teams work on the same project or the set of projects which use the same resources or closely communicate – at the current point it is possible to communicate between the cluster using service mesh or just load balancers, but it may be the most performant solution. 

Responsibility levels

If the dev team does not possess the skills required to manage a Kubernetes cluster, then the responsibility has to split between them and operators. Let’s create four examples of this kind of distribution:

Not a developer responsibility

This is probably the hardest version for the operators’ team, where the development team is only responsible for building the docker image and pushing to the correct container registry. Kubernetes on it’s own helps a lot with making sure that new version rollout does not result in a broken application via deployment strategy and health checks. If something silently breaks, it may be hard to figure out if it is a cluster failure or a result of the application update, or even database model change.

Developer can manage deployments, pods, and configuration resources

This is a better scenario. When developers are responsible for the whole application deployment by creating manifests, all configuration resources, and doing rollouts, they can and should do a smoke test afterwards to make sure everything remains operational. Additionally, they can check the logs to see what went wrong and debug in the cluster.

This is also the point where the security or operations team need to start to think about securing a cluster. There are settings on the pod level which can elevate the workload privileges, change the group it runs as or mount the system directories. This can be done for example via Open Policy Agent. Obviously, there should be no access to the other namespaces, especially the kube-system, but this can be easily done with just built-in RBAC.

Developers can manage all namespace level resources

If the previous version worked maybe we can give developers more power? We can, especially when we create quotas on everything we can. Let’s first go through additional resources that are now available and see if something seems risky (we have stripped the uncommon ones for clarity). Below you can see them gathered in two groups:

Safe ones:

  • Job
  • PersistentVolumeClaim
  • Ingress
  • PodDisruptionBudget
  • DaemonSet
  • HorizontalPodAutoscaler
  • CronJob
  • ServiceAccount

The ones we recommend to block:

  • NetworkPolicy
  • ResourceQuota
  • LimitRange
  • RoleBinding
  • Role

This is not really a definitive guide, just a hint. NetworkPolicy depends really on the network overlay configuration and security rules we want to enforce. ServiceAccount is also arguable depending on the use case. Other ones are commonly used to manage the resources in the shared cluster and the access to it, so should be available mainly for the cluster administrators.

DevOps multifunctional teams

Last, but not least, the famous and probably the hardest to come by approach: multifunctional teams and a DevOps role. Let’s start with the first one – moving part of the operators to work in the same team, same room, with the developers solves a lot of problems. There is no going back and forth and trying to keep in sync backlogs, sprints, and tasks for multiple teams – the work is prioritized for the team and treated as a team effort. No more waiting 3 weeks for a small change, because the whole ops team is busy with the mission-critical project. No more fighting for the change that is top-priority for the project, but gets pushed down in the queue.

Unfortunately, this means each team needs its own operators, which may be expensive and rarely possible. As a solution for that problem comes the mythical DevOps position: developer with operator skills who can part-time create and manage the cluster resources, deployments and CI/CD pipelines, and part-time work on the code. The required skill set is very broad, so it is not easy to find someone for that position, but it gets popular and may revolutionize the way teams work. Sad to say, this position is often described as an alias of the SRE position, which is not really the same thing. 

Triage, delegate, and fix

The responsibility split is done, so now we should only decide on the incident response scenarios, how do we triage issues, and figure out which team is responsible for fixing it (for example by monitoring cluster health and associating it with the failure), alerting and, of course, on-call schedules. There are a lot of tools available just for that.

Eventually, there is always a question “whose cluster is it?” and if everyone knows which field or part of the cluster they manage, then there are no misunderstandings and no blaming each other for the failure. And it’s getting resolved much faster.