Kubernetes Scheduling - Node Selectors and Node Affinity

Kubernetes Scheduling - Node Selectors and Node Affinity

Subscribe to our newsletter and never miss any upcoming articles

Listen to this article

In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that Kubelet can run them. It's the kube-scheduler job to schedule pods to specific nodes in the Kubernetes cluster.

kube-scheduler constantly watches for all newly created Pods which has no Node assigned to and for every Pod it discovers, it finds the best Node to run that on.

The kube-scheduler filters cluster nodes based on the resource requests and limits of each container in the created Pod. And the nodes that meet the scheduling requirements for a Pod are known as feasible nodes.

All the feasible Nodes will be assigned a score and a Node with the highest score will be picked up by the scheduler to run the Pod. The API server gets notified about this by scheduler and this process is called binding.

There are different ways we can configure a Pod to schedule to a specific node. In this tutorial we will discuss about two of them:

  • Node Selector
  • Node Affinity

nodeSelector

nodeSelector is the basic and recommended form of cluster node selection constraint. We can simply assign a node label as a key-value pair within PodSpec field using nodeSelector.

For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (having additional labels will not affect it's behaviour).

Let us understand this process step by step:

Step 1: Assign a Label to the Node

  • List the nodes in your cluster, along with their labels by running the following command:
root@kube-master:~# kubectl get nodes --show-labels

nodes_show_labels-1.png

  • Now chose one of your cluster node, and add a label to it:
root@kube-master:~# kubectl label nodes kube-worker1 workload=prod
node/kube-worker1 labeled
  • Verify the assigned label:
root@kube-master:~# kubectl get nodes kube-worker1 --show-labels

verify_label-2.png

Another way to verify this is to run the following command:

root@kube-master:~# kubectl describe node kube-worker1

verify_label.png

Step 2: Schedule a Pod using required nodeSelector

In this step we will configure the Pod manifest file with a nodeSelector field so that it gets scheduled on the Node of our choice which is kube-worker1 in our case.

root@kube-master:~/nodeSelector# cat nodeSelector.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nodeselector-demo
  labels:
    env: prod
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    workload: prod
  • Let's apply our Pod configuration.
root@kube-master:~/nodeSelector# kubectl apply -f nodeSelector.yaml
pod/nodeselector-demo created
  • Verify that it worked and got scheduled to the Node it was assigned to.
root@kube-master:~/nodeSelector# kubectl get pods -o wide

verify_pod_+running-selector.png

Along with the labels we assign manually all the Nodes are assigned some built-in node labels such as: kubernetes.io/hostname ; failure-domain.beta.kubernetes.io/zone ; failure-domain.beta.kubernetes.io/region ; topology.kubernetes.io/zone ; topology.kubernetes.io/region ; beta.kubernetes.io/instance-type ; node.kubernetes.io/instance-type ; kubernetes.io/os ; kubernetes.io/arch

Affinity and anti-affinity

Along with nodeSelector Kubernetes has the affinity/anti-affinity feature, which greatly expands the types of constraints you can express while configuring the resources.

By using Affinity feature you can have following benefits:

  • You can have a flexible schedule requirements rather than a hard requirement by indicating rules as "soft"/"preference". By doing so we ensure that even if the scheduler can't satisfy the requirements, the pod will still be scheduled.

  • The affinity/anti-affinity rules are expressive in nature and you can have more matching rules besides exact matches created with a logical AND operation.

The affinity feature consists of two types of affinity, node affinity and inter-pod affinity/anti-affinity.

Node affinity

Conceptually it does the same job as nodeSelector but in a more expressive manner.

There are two types of node affinity which exists today.

  • Hard type :

requiredDuringSchedulingIgnoredDuringExecution

  • Soft type :

preferredDuringSchedulingIgnoredDuringExecution

Let us understand them taking examples.

Schedule a Pod using required node affinity

Here is our Pod manifest file.

root@kube-master:~/affinity# cat required_affinity.yaml
apiVersion: v1
kind: Pod
metadata:
  name: required-affinity-demo
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: workload
            operator: In
            values:
            - staging
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent

If you observe above manifest file we have a Pod scheduled with requiredDuringSchedulingIgnoredDuringExecution type Node Affinity and the expression key-value as workload: staging. This means that the pod will only be scheduled on a node that has a workload=staging label.

The new node affinity syntax supports the following operators: In, NotIn, Exists, DoesNotExist, Gt, Lt. You can use NotIn and DoesNotExist to achieve node anti-affinity behavior.

If you specify both nodeSelector and nodeAffinity, both must be satisfied for the pod to be scheduled onto a candidate node.

If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms can be satisfied.

If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions is satisfied.

  • Let us apply the manifest to create a Pod that is scheduled onto your chosen node:
root@kube-master:~/affinity# kubectl apply -f required_affinity.yaml
pod/required-affinity-demo created
  • Verify that the pod is running on your chosen node:
root@kube-master:~/affinity# kubectl get pods -o wide

required_affinity.png

If you look at the output above the Pod has not been scheduled on any available Node because the required condition doesn't meet here and the Pod is still in Pending state.

  • Now if I assign the required label to one of our cluster node the above Pod will get scheduled there.
root@kube-master:~/affinity# kubectl label nodes kube-worker2 workload=staging

root@kube-master:~/affinity# kubectl get pods -o wide

required_affinity-2.png

Schedule a Pod using preferred node affinity

Here is our Pod manifest file.

root@kube-master:~/affinity# cat preferred_affinity.yaml
apiVersion: v1
kind: Pod
metadata:
  name: preferred-affinity-demo
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: workload
            operator: In
            values:
            - preprod
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent

If you observe above manifest file we have a Pod scheduled with preferredDuringSchedulingIgnoredDuringExecution type Node Affinity and the expression key-value as workload: preprod. This means that the pod will prefer a node that has a workload=preprod label.

The weight field in above is in the range 1-100. For each node that meets all of the scheduling requirements (resource request, RequiredDuringScheduling affinity expressions, etc.), the scheduler will compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node matches the corresponding MatchExpressions. This score is then combined with the scores of other priority functions for the node. The node(s) with the highest total score are the most preferred.

root@kube-master:~/affinity# kubectl apply -f preferred_affinity.yaml
pod/preferred-affinity-demo created

preferred_affinity.png

If you see the above output we don't have any Node in our cluster with the label preprod but the Pod still got scheduled on one of the cluster node in our case that is kube-worker2 node.

Inter-pod affinity and anti-affinity

Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.

There are two types of pod affinity and anti-affinity exists:

  • Hard type :

    requiredDuringSchedulingIgnoredDuringExecution

  • Soft type :

preferredDuringSchedulingIgnoredDuringExecution

Inter-pod affinity is specified as field podAffinity of field affinity in the PodSpec. And inter-pod anti-affinity is specified as field podAntiAffinity of field affinity in the PodSpec.

An example manifest file:

root@kube-master:~/affinity# cat pod_affinity_antiaffinity.yaml
apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: env
            operator: In
            values:
            - Prod
        topologyKey: topology.kubernetes.io/zone
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: env
              operator: In
              values:
              - Staging
          topologyKey: topology.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: nginx
    imagePullPolicy: IfNotPresent

In the above Pod manifest file the Pod affinity rule says that the pod can be scheduled onto a node only if that node is in the same zone as at least one already-running pod that has a label with key "env" and value "Prod".

The pod anti-affinity rule says that the pod should not be scheduled onto a node if that node is in the same zone as a pod with label having key "env" and value "Staging".

Inter-pod affinity and anti-affinity require substantial amount of processing which can slow down scheduling in large clusters significantly. Kubernetes Developers do not recommend using them in clusters larger than several hundred nodes.

Summary

This is all about Kubernetes Node Selectors and Node Affinity.

Hope you like the tutorial. Stay tuned and don't forget to provide your feedback in the response section.

Happy Learning!

 
Share this