In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that Kubelet can run them. It's the kube-scheduler
job to schedule pods to specific nodes in the Kubernetes cluster.
kube-scheduler
constantly watches for all newly created Pods which has no Node assigned to and for every Pod it discovers, it finds the best Node to run that on.
The kube-scheduler
filters cluster nodes based on the resource requests and limits of each container in the created Pod. And the nodes that meet the scheduling requirements for a Pod are known as feasible nodes
.
All the feasible Nodes will be assigned a score and a Node with the highest score will be picked up by the scheduler to run the Pod. The API server gets notified about this by scheduler and this process is called binding
.
There are different ways we can configure a Pod to schedule to a specific node. In this tutorial we will discuss about two of them:
- Node Selector
- Node Affinity
nodeSelector
nodeSelector
is the basic and recommended form of cluster node selection constraint. We can simply assign a node label as a key-value pair within PodSpec
field using nodeSelector
.
For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (having additional labels will not affect it's behaviour).
Let us understand this process step by step:
Step 1: Assign a Label to the Node
- List the nodes in your cluster, along with their labels by running the following command:
root@kube-master:~# kubectl get nodes --show-labels
- Now chose one of your cluster node, and add a label to it:
root@kube-master:~# kubectl label nodes kube-worker1 workload=prod
node/kube-worker1 labeled
- Verify the assigned label:
root@kube-master:~# kubectl get nodes kube-worker1 --show-labels
Another way to verify this is to run the following command:
root@kube-master:~# kubectl describe node kube-worker1
Step 2: Schedule a Pod using required nodeSelector
In this step we will configure the Pod manifest file with a nodeSelector
field so that it gets scheduled on the Node of our choice which is kube-worker1
in our case.
root@kube-master:~/nodeSelector# cat nodeSelector.yaml
apiVersion: v1
kind: Pod
metadata:
name: nodeselector-demo
labels:
env: prod
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
workload: prod
- Let's apply our Pod configuration.
root@kube-master:~/nodeSelector# kubectl apply -f nodeSelector.yaml
pod/nodeselector-demo created
- Verify that it worked and got scheduled to the Node it was assigned to.
root@kube-master:~/nodeSelector# kubectl get pods -o wide
Along with the labels we assign manually all the Nodes are assigned some built-in node labels such as:
kubernetes.io/hostname
;failure-domain.beta.kubernetes.io/zone
;failure-domain.beta.kubernetes.io/region
;topology.kubernetes.io/zone
;topology.kubernetes.io/region
;beta.kubernetes.io/instance-type
;node.kubernetes.io/instance-type
;kubernetes.io/os
;kubernetes.io/arch
Affinity and anti-affinity
Along with nodeSelector
Kubernetes has the affinity/anti-affinity feature, which greatly expands the types of constraints you can express while configuring the resources.
By using Affinity feature you can have following benefits:
You can have a flexible schedule requirements rather than a hard requirement by indicating rules as "soft"/"preference". By doing so we ensure that even if the scheduler can't satisfy the requirements, the pod will still be scheduled.
The affinity/anti-affinity rules are expressive in nature and you can have more matching rules besides exact matches created with a logical AND operation.
The affinity feature consists of two types of affinity, node affinity
and inter-pod affinity/anti-affinity
.
Node affinity
Conceptually it does the same job as nodeSelector
but in a more expressive manner.
There are two types of node affinity
which exists today.
- Hard type :
requiredDuringSchedulingIgnoredDuringExecution
- Soft type :
preferredDuringSchedulingIgnoredDuringExecution
Let us understand them taking examples.
Schedule a Pod using required
node affinity
Here is our Pod manifest file.
root@kube-master:~/affinity# cat required_affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: required-affinity-demo
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload
operator: In
values:
- staging
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
If you observe above manifest file we have a Pod scheduled with requiredDuringSchedulingIgnoredDuringExecution
type Node Affinity and the expression key-value as workload: staging
. This means that the pod will only be scheduled on a node that has a workload=staging
label.
The new node affinity syntax supports the following operators:
In
,NotIn
,Exists
,DoesNotExist
,Gt
,Lt
. You can useNotIn
andDoesNotExist
to achieve node anti-affinity behavior.If you specify both
nodeSelector
andnodeAffinity
, both must be satisfied for the pod to be scheduled onto a candidate node.If you specify multiple
nodeSelectorTerms
associated withnodeAffinity
types, then the pod can be scheduled onto a node if one of thenodeSelectorTerms
can be satisfied.If you specify multiple
matchExpressions
associated withnodeSelectorTerms
, then the pod can be scheduled onto a node only if allmatchExpressions
is satisfied.
- Let us apply the manifest to create a Pod that is scheduled onto your chosen node:
root@kube-master:~/affinity# kubectl apply -f required_affinity.yaml
pod/required-affinity-demo created
- Verify that the pod is running on your chosen node:
root@kube-master:~/affinity# kubectl get pods -o wide
If you look at the output above the Pod has not been scheduled on any available Node because the required condition doesn't meet here and the Pod is still in Pending
state.
- Now if I assign the required label to one of our cluster node the above Pod will get scheduled there.
root@kube-master:~/affinity# kubectl label nodes kube-worker2 workload=staging
root@kube-master:~/affinity# kubectl get pods -o wide
Schedule a Pod using preferred
node affinity
Here is our Pod manifest file.
root@kube-master:~/affinity# cat preferred_affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: preferred-affinity-demo
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: workload
operator: In
values:
- preprod
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
If you observe above manifest file we have a Pod scheduled with preferredDuringSchedulingIgnoredDuringExecution
type Node Affinity and the expression key-value as workload: preprod
. This means that the pod will prefer a node that has a workload=preprod
label.
The
weight
field in above is in the range 1-100. For each node that meets all of the scheduling requirements (resource request, RequiredDuringScheduling affinity expressions, etc.), the scheduler will compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node matches the correspondingMatchExpressions
. This score is then combined with the scores of other priority functions for the node. The node(s) with the highest total score are the most preferred.
root@kube-master:~/affinity# kubectl apply -f preferred_affinity.yaml
pod/preferred-affinity-demo created
If you see the above output we don't have any Node in our cluster with the label preprod
but the Pod still got scheduled on one of the cluster node in our case that is kube-worker2
node.
Inter-pod affinity and anti-affinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
There are two types of pod affinity and anti-affinity exists:
Hard type :
requiredDuringSchedulingIgnoredDuringExecution
Soft type :
preferredDuringSchedulingIgnoredDuringExecution
Inter-pod affinity is specified as field podAffinity
of field affinity in the PodSpec
. And inter-pod anti-affinity is specified as field podAntiAffinity
of field affinity in the PodSpec
.
An example manifest file:
root@kube-master:~/affinity# cat pod_affinity_antiaffinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: env
operator: In
values:
- Prod
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: env
operator: In
values:
- Staging
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: nginx
imagePullPolicy: IfNotPresent
In the above Pod manifest file the Pod affinity rule says that the pod can be scheduled onto a node only if that node is in the same zone as at least one already-running pod that has a label with key "env" and value "Prod".
The pod anti-affinity rule says that the pod should not be scheduled onto a node if that node is in the same zone as a pod with label having key "env" and value "Staging".
Inter-pod affinity and anti-affinity require substantial amount of processing which can slow down scheduling in large clusters significantly. Kubernetes Developers do not recommend using them in clusters larger than several hundred nodes.
Summary
This is all about Kubernetes Node Selectors and Node Affinity.
Hope you like the tutorial. Stay tuned and don't forget to provide your feedback in the response section.
Happy Learning!