FAQ & Known Issues of SKE

FAQs

General

Q: What is Kubernetes?

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications (see Kubernetes docs)

Q: What is SKE?

STACKIT Kubernetes Engine (SKE) is a robust, scalable and managed Kubernetes service. SKE delivers CNCF-compliant Kubernetes clusters and makes it easy to deploy standard Kubernetes applications and containerized workloads. Customized Kubernetes clusters can be easily created as self-service via the STACKIT Cloud Portal.

Q: How does SKE work?

The SKE service ensures that all control plane components of your SKE Cluster are up and running. You can access the cluster with a kubeconfig to deploy and operate your applications. There are components running in the background ensuring OS and Kubernetes version updates. SKE also provides cost-efficient mechanisms like hibernation and auto-scaling of node pools.

Q: Why should I use SKE?

SKE is a CNCF-certified managed Kubernetes provider. It manages your cluster's control plane and enables you to just run your applications on top without the need to worry about the management.

Kubernetes

Q: Can I use my existing Kubernetes tools for SKE clusters?

Yes, SKE provides a fully upstream-compliant Kubernetes service.

Q: Which Kubernetes versions can I choose?

For an overview of currently supported versions please have a look at cluster creation wizard in the STACKIT Cloud Portal. Generally, we try to get new versions supported as fast as possible after the release and support the latest 3 minor versions.

Q: Can I run Windows containers on SKE?

No, since SKE currently does not support Windows nodes.

Q: Which container runtimes can I use?

Currently, depending on the operating system that you select for your nodes docker and containerd are supported.

Cluster Management

Q: How long does it take to create an SKE cluster?

Cluster creations normally take about 5-10 minutes. During this time, SKE will automatically create all required nodes and start the control plane components. It will also make sure that all components (e.g. the Kubernetes API server) are ready, so you can jump right in once your cluster is available.

Q: How does the deletion of clusters work?

The deletion process takes care of a well-organized deletion of cluster resources and therefore could potentially take some time. SKE charges only apply to clusters that are not in the deletion process, so you do not have to pay for any clusters that are already being deleted.

Q: What do I need to backup?

The control plane of your cluster is backed up by the SKE team. This type of backup is fully automated and used for disaster recovery. It is not intended to be a customer service tool. The data on the worker nodes and data in Persistent Volumes is not backed up automatically. Please use a tool like Velero to backup these resources yourself.

Q: Can I temporarily shut down my SKE cluster?

Yes, the clusters can be hibernated. This will delete all existing worker nodes and therefore reduce costs of your cluster drastically. However, the control plane and Persistent Volume Claims (PVCs) will be retained. As soon as you wake up the cluster, everything will be restored as it was before the initiated shutdown.

Q: Can I restrict access to the Kubernetes API server?

Currently, all Kubernetes API servers are reachable with public IP addresses. You can restrict access to the Kubernetes resources based on ServiceAccounts, Roles and RoleBindings.

Q: How can I use audit logging for my SKE cluster?

STACKIT offers an Audit Log service. Audit logs are the key to answering the question: "Who did what, when and where?"

All requests sent via the SKE API are stored in the STACKIT Audit Log. This includes: cluster creation, changes to the SKE cluster (e.g. updates to the Kubernetes version, changes to node pools, credentials rotation, etc.), deletion of the cluster and much more.

You can also request audit logging for the control plan of your Kubernetes cluster. To do this, you need an SKE cluster and your own Splunk index to store the audit logs. Please contact STACKIT support with this information.

Important notes:

It is your responsibility to deploy and maintain the Splunk index
You cannot configure the granularity and scope of the audit events yourself after activation - contact STACKIT support if you want to make changes

Q: Does an OS or Kubernetes update affect the running workload?

No, all updates are executed as rolling updates, so it should not affect any running applications as long as your application's Pod Disruption Budget is configured correctly.

Q: How do I update my Kubernetes cluster to a new version?

There are the following update types: automatic updates (Kubernetes patch version updates will be applied automatically), manual updates (via the STACKIT Cloud Portal, needed for Kubernetes minor version updates) and forceful updates (if a version is expired an update will be performed forcefully).

For more information see: Updating & maintaining a SKE Cluster.

Q: How can I update my worker nodes' operating system?

Node pool OS version updates are generally applied automatically during the cluster's maintenance window, as long as this option is not disabled. If you want to try a new preview version you have the option to update manually by editing the node pool in the STACKIT Cloud Portal.

For more information take a look at Updating & maintaining a SKE Cluster.

Q: Why does my node roll take longer than expected?

Whenever a rolling node update is performed (e.g. Kubernetes or OS version update), SKE automatically tries to drain each node one by one (unless configured otherwise) by evicting each Pod.

Sometimes, PodDisruptionBudgets (PDBs) can prolong the update. An example could be that a PodDisruptionBudget is configured for applications with Allowed disruptions set to zero. As our system respects PodDisruptionBudget, SKE cannot continue with draining the node(s). As a result the node roll is blocked until it runs in the SKE pre-defined hard limit for drain operations which is 2h. After that, pods that are still on this node are forcefully deleted with terminationGracePeriod set to zero.

Q: What will control the amount of nodes running in my clusters?

The amount of nodes provisioned in the cluster is determined by the configured min and max values in the Node Pools. Besides that, SKE clusters are using the Cluster Autoscaler to dynamically change the size of the cluster. The amount of nodes in the cluster increases in case of new pods do not have sufficient infrastructure resources available or the Cluster Autoscaler detects that more nodes would be required. However, the Cluster Autoscaler will never exceed the minimum or maximum value configured via the Node Pools.

For more information see: When does Cluster Autoscaler change the size of a cluster?

Node Configuration

Q: Which Operating Systems can I use for my nodes?

Currently, only Flatcar/CoreOS is supported as OS. Windows nodes are not supported.

Q: Can I run different VM types in one cluster?

Yes, by using multiple node pools with different VM types. These can be scaled independently of each other.

Q: Which VM flavors can I choose from for my nodes?

All flavors supported by the STACKIT IaaS service with more than one vCPU and two GB RAM are supported. Please have a look at the IaaS docs for more information.

Q: Can I use the STACKIT Availability Zones in SKE?

Yes, SKE supports all Availability Zones that are supported by the STACKIT IaaS service.

Q: How can I add taints and labels to nodes?

You can edit labels and taints in the node pool editing screen in the STACKIT Cloud Portal. These are then set for all nodes in the edited node pool.

Q: How can I change the container runtime for my node pool?

List your SKE cluster via the STACKIT Cloud Portal and click on the cluster you want to change the container runtime. A container runtime is specified per node pool, so please select "Node pools" on the left side in the cluster view. Next, click on the node pool and you can see in the "overview"-tab a "Pool Configuration"-menu. Press "Edit" and change "docker" to "containerd" in the "Container Runtime"-field. Save the changes and our managed Kubernetes will start a rolling update for you.

Networking & Load Balancers

Q: Can I configure a static IP address for my load balancer?

Our SKE Services/Loadbalancers can be configured for inbound traffic, e.g. the use of existing Floating IPs (see Nutzung bestehender öffentlicher IPs für LoadBalancing -SKE-). These configurations have no impact on outbound traffic via OpenStack Router IP (learn more).

Q: Can I configure a static IP address for my egress cluster traffic?

No, it is not possible to assign a specific IP address for egress traffic. When a SKE cluster is created, an IPv4 address is automatically chosen from the available pool at STACKIT. This egress IP address remains fixed throughout the cluster's lifecycle and cannot be modified.

All egress traffic is routed through the router within your Kubernetes cluster. To identify the router's IP address, you can follow one of these methods:

Portal: currently not possible
IaaS API:
1. Use the ListNetworks endpoint to retrieve the network ID
2. Use the GetNetwork endpoint to get the networks publicIp
Openstack:
1. List the openstack routers: openstack router list (the router is named shoot--xyz–name, name is the cluster name)
2. Show the router details: openstack router show ID
3. The egress IP can be found in external_fixed_ips.ip_address located in the external_gateway_info field

Q: Why is my local Pod to Service communication restricted after updating to Kubernetes 1.25.x?

If a loadbalancerSourceRange is set in the Service, then this is also applied to the local communication since kubernetes version 1.25.x.
If a pod wants to reach the service than also the podCIDR needs to added to the loadbalancerSourceRange in the service. The podCIDR can be found in shoot-info ConfigMap within the kube-system namespace.

podCIDR

kubectl get configmap -n kube-system shoot-info -oyaml

BASH

Others

Q: If you have any open questions that are not answered in any of our docs?

Feel free to create a service request in the STACKIT Help Center.

Known Issues

Node Pool rollouts result in downtime when using multiple AZs per pool (added 2024-09-02)

This only occurs if your cluster has node pools that have more than one AZ configured and the maxSurge value is less than the amount of configured AZs.

The reason for this behavior is that each node pool is internally split into one "sub-pool" per AZ. The maxSurge and maxUnavailable values are distributed across these sub-pools which, if set too low, results in some sub-pools having undesired values.

Example:
A node pool with multiple AZs (eu01-1, eu01-2, eu01-3) and maxSurge=1 & maxUnavailable=0 is split into three sub-pools:

eu01-1: maxSurge=1 & maxUnavailable=0
eu01-2: maxSurge=0 & maxUnavailable=0
eu01-3: maxSurge=0 & maxUnavailable=0

For sub-pools with maxSurge=0 & maxUnavailable=0 the rollout is performed like maxSurge=0 & maxUnavailable=1 thus resulting in downtime for the affected AZs.

Mitigation:
Set maxSurge to at least the amount of configured AZs. We will also introduce additional validation for this.

Example:
A node pool with multiple AZs (eu01-1, eu01-2 & eu01-3) should have maxSurge=3.

Image pull fails with: "toomanyrequests: You have reached your pull rate limit."

This is not an issue specific to SKE but for all Docker Hub users.

Docker has established Docker Hub pull requests limits for free users in November 2020. Since then the free plan has been reduced to 100 pulls per 6 hours for anonymous users and 200 pulls for authenticated users.

For more information please take a look at How to deal with Docker Hub pull request limits -SKE-.

Cluster is stuck in unhealthy state

Sometimes your cluster gets stuck in the state "unhealthy". This can have multiple reasons. Sometimes some resources like the VMs for the node pools take longer than expected to be created. In this case the SKE service will try to reconcile the cluster regularly to fix the issues and it will automatically change to "healthy" state again.

Another common reason for the unhealthy state could be quota issues that stop some resources from being created. You can verify this by taking a look at your quotas in the STACKIT Cloud Portal.

If you have any issue and your cluster gets in state unhealthy, feel free to create a service request in the STACKIT Help Center.