You're reading for free via Stefanie Lai's Friend Link. Become a member to access the best of Medium.

Member-only story

Do You Need Multi-Clusters?

Evaluate CNCF multi-clusters solutions and go our own way

Stefanie Lai

Published in

ITNEXT

9 min readMar 15, 2022

Multi-clusters is something that all companies should embrace when their services deployed on the cloud have reached a certain level. According to the State of Kubernetes report of VMware 2020, 33% of Kubernetes adopters are operating 26 clusters or more, and 20% are running more than 50 clusters.

No matter using AWS, GKE, or other cloud providers, whether building the Kubernetes platform yourself or not, whether using the auto-scaling function or not, you’ll find a single cluster is limited in capacity. Recommended by Kubernetes official, the “largest” cluster should be

No more than 110 pods per node
No more than 5000 nodes
No more than 150000 total pods
No more than 300000 total containers

Many may think that their current cluster is performing greatly and the need for the multi-clusters is far away. However, they’ll be on the way to multi-clusters very soon.

Motivation

The single cluster we are adopting bears a great burden: It contains about 2000 namespaces, 3000 pods, but most of the resources in the cluster are GCP-related, including 10000 more StorageBucket, 10000 more IAM resources, and a considerable number of Bigtable and BigQuery related ones. Meanwhile, the 50 operators are running dozens of internal or open-source controllers.

In terms of performance, the GKE auto-scaler works well currently. But we still started to seek the multi-clusters strategy and migrate our resources and services to multi-clusters for solving the following issues.

Expansibility. When deploying or upgrading a certain operator, the pods that depend on this operator have to restart, causing the thunder-herding in the entire cluster since these pods may manage a lot of resources in each namespace. And when the APIServer fails to respond to requests, errors occur. All pods are restarted repeatedly in hours due to Kubernetes’ failure-restart mechanism. Therefore, the single cluster can no longer meet our need to quickly upgrade, introduce new users, and deploy new resources.

"error": "Get "https://10.1.1.1:443/apis?timeout=32s": context deadline exceeded"Failed to get API Group-Resources", "error": "context deadline exceeded"

Availability and user experience. Being a part of the CI/CD pipeline company-wide, our service will affect almost all the services that are or will be deployed, like PR merging or master building. The blast radius of the single cluster is too big that much effort is needed to bypass our single node failure if a service requires an urgent deployment.
Isolation. The resources in the cluster can be grouped into two based on their change frequency: stable ones, and unstable ones. We can save those stable resources(P99) from being affected by other operators or resources by moving them into a relatively stable environment.
Performance. Some cluster-level operators will have various performance problems as the number of resources they manage increases to 10k. It is necessary to reduce their pod usage overhead and improve user experience.
Multi-environments. The dozens of operators running in our cluster are beyond our management that most of the operator maintainers have their own publishing and deployment permissions. And the entire cluster will be faced with availability trouble once there’s a problem with one of them. However, it’s hard to predict the issues that may occur in production because of the big gap between the current test environment and production. So, we need a safer solution for testing and releasing these operators.

Multi-cluster saves all, greatly lightening the pressure on the current APIServer, speeding up our upgrade and deployment of new functions, lessening the impact on users, shrinking the blast radius, and improving user experience.

Besides, “Don’t put your eggs in one basket” seems also the trend in cloud strategy. In the latest 2021 report from VMware, 36% of the users are already pushing the multi-cloud strategy, which is also on the many company’s schemes.

As is deeply bound to GCP, it is sensible for us to be closer to the multi-cloud strategy, and deploying the multi-clusters will be the very first step.

How to set up the multi-clusters

Our research starts with the existing tools in CNCF that can support and meet our needs. Besides the pains listed above, we should now focus on what we need practically.

Cluster deployment, which can quickly deploy multiple clusters and solve network issues.
Cluster Discovery, a central service that searches for the cluster that needs to be operated at present when user resources are dispersed into multiple clusters. It should also provide functions similar to consistent hashing to avoid redeploying resources to different clusters when the clusters increase subsequently.
GCP resources and custom CRD support. We need more support for those GCP resources defined by Google Config Connector, and the resources defined by the company’s internal CRD.

Meanwhile, we need to adapt to the multi-clusters changes in many places as well.

To group the resources based on the current request in CI/CD. Sometimes, we need to initiate operations on multiple clusters simultaneously.
To make sure each operator is supportive of the multi-clusters.
To provide user UX or CLI tools to discover their resources for CLI operations.

With all these needs in mind, let’s take a look at how users in the community implement multi-clusters, in which strategies, with which tools.

Strategy

Generally, there are two multi-clusters strategies, Kubernetes-centric and network-centric.

Kubernetes-centric, managing multiple clusters, distributing and scheduling resource management requests by building a control plane layer on multiple clusters, which are isolated, and each cluster manages different resources.
Network-centric, enabling the communication between different clusters via network configuration to implement resources replicating, and keep the resources consistent on each cluster.

You may find these two strategies similar to the legacy approach to conquering the bottleneck we met in the database. Exactly! Kubernetes-centric is like the master-master pattern, which relies on a database middleware(control plane) to handle and distribute requests, while network-centric is the master-slave pattern, and the communication between master and slave keeps the data consistent eventually.

In our case, the network-centric is not applicable, but the Kubernetes-centric strategy can do to lighten the burden on each cluster.

Tools

Those tools from cloud providers, such as GKE’s multi-cluster support, are out. They are basically network-centric and are accompanied by two major problems.

Not so open-sourced, which is hard for customizing or secondary development.
Highly coupling to the cloud provider itself, which is not conducive to the future multi-cloud or hybrid-cloud strategies.

Now to see what popular tools in the community we can choose from.

Kubefed

This popular multi-clusters tool has two versions, v1 kube-federation(already abandoned), and v2 kubefed.

As we can see from kubefed’s overall architecture, it is essentially composed of two CRDs, Type Configuration and Cluster Configuration, and implements the two main functions through the CRD’s respective controllers and admission webhooks.

Cross-cluster distribution. By abstracting the three concepts of Templates, Placement, and Overrides, it can deploy resources (such as Deployment) to different clusters and achieve multi-cluster scaling.
Scheduling, implementing resource scheduling with clusterSelector and ReplicaSchedulingPreference.

In addition, kubefed also supports many features, such as high availability, easier resource migration, multi-cloud and hybrid-cloud deployment, etc.

However, it is not the one we are looking for.

It is a complex tool with a steep learning curve for most developers.
It does not support customized-distributing strategies, and cannot meet our requirements to isolate resources of different stability. For example, ReplicaSchedulingPreference can only support Deployments and Replicasets.
It does not support customizing CRDs. We need extra efforts to manage GCP resources from Config Connector.

Gitops

Gitops is already a very popular CI/CD standard for deploying Kubernetes resources to the cluster, such as FluxCD, Fleet, and ArgoCD, most of which support multi-clusters. Is it what we can rely on?

FluxCD supports multi-cluster and multi-tenant, uses kustomization and helm to add new clusters, and is resource isolation supportive.

After evaluation, it is not the one, either.

No dynamic distribution. Resources allocation needs to be injected in advance in a way similar to go template, rather than dynamic allocation. And our own CI/CD tools cannot be seamlessly integrated with it.
Side effect. The overhead of applying helm and kustomize is too high.

These disadvantages are shared by other Gitops, which are inapplicable, either.

Karmada

Karmada is from the CNCF sandbox and supports multi-clusters and multi-cloud strategies.

Its design is partially inherited from kubefed v1 and v2, forming a control-plane with APIServer, Scheduler, and controller-manager, and implementing resource scheduling and distribution through the two CRDs of PropagationPolicy and OverridePolicy.

Karmada has many advantages, including but not limited to,

Cluster management. The unified API supports cluster registration and lifecycle management.
Declarative resource scheduling. It supports customized scheduling strategies and dynamically schedules resources to the appropriate cluster according to tags and states.

Karmada’s main advantages lie in the native Kubernetes resources processing, but it does not support CRD naturally.

For Kubernetes native resources, Karmada knows how to parse them, but for custom resources defined by CRD(or extended by something like aggregated-apiserver), as lack of the knowledge of the resource structure, they can only be treated as normal resources. Therefore, the advanced scheduling algorithms cannot be used for them.

That is, Resource Interpreter development is needed to implement those functions such as dynamic scheduling. The biggest reason why we don’t pick it.

Our own way

Realizing that we are using Kubernetes in a very “unique” way, which is different from most of the scenarios in the communities, we decided to customize our own implementation to fully meet our needs.

There is not much to do in the transition from the current single cluster to the multi-clusters.

Cluster management. We have already used terraform to create the current cluster, so we can quickly create the additional clusters with terraform after solving the network issue and having enough IPs.
Scheduling. We need a central scheduling service when we decide to distribute resources to different clusters based on namespace: When a namespace is passed in, the service can return the corresponding cluster context, so that we can deploy the resource to the corresponding cluster in the CI/CD service. This service should also be GRPC supportive to be unified with the distribution strategy within the company.
Golang support. We have many self-developed operators, so the operator maintainers will need support from a Golang package. Of course, it is a nice-to-have option because they can get cluster information through the RPC interface.
Automation. With a single cluster, we finish building the cluster, installing tools, installing and deploying various operators through CLI or simple script manually. Now with multi-clusters, we need an upgrade when deploying clusters one by one manually seems inefficient and error-prone. Therefore, automation-related tools and complete scripts are necessary.
CLI tool. We want to give the users the same experience with multi-clusters as they have in the one-cluster scenario that they can log in to our cluster to view their resources. For that, we are required to implement a kubectl plugin to encapsulate the relevant logic, and the kubectl request can be forwarded to the corresponding context as long as the user passes in the namespace.

Basically, the changes in architecture from a single cluster to multi-clusters are as follows.

As is seen, it’s more efficient than any existing tool and fits better in our current situation.

The end

It is a review of the problems we encountered, the discussions we had, and the thoughts provoked in implementing multi-clusters. We did not use any of the tools in the community, but we definitely paved our own way by researching their features, pros, and cons.

Hope it also gives you a big picture of multi-clusters implementation and means something when you are at the crossroad.

Thanks for reading!