🚨 Real Incident

A 500-node cluster started having pods stuck in Pending after a major infrastructure expansion. The team had just added 50 new nodes via the cluster autoscaler. All 50 showed Ready in kubectl get nodes. Yet dozens of pods from three different Deployments remained Pending for over two hours.

The team spent the first 90 minutes staring at resource limits. CPU utilization across existing nodes was around 78 percent. They bumped node instance types. Still Pending. Then someone ran kubectl describe pod and read the Events section carefully. The message was there the whole time:

0/500 nodes are available: 450 node(s) had insufficient cpu, 50 node(s) had untolerated taint {node.kubernetes.io/new-node: NoSchedule}.

The autoscaler applied a node.kubernetes.io/new-node:NoSchedule taint during node initialization to prevent scheduling before the node was fully warmed up. It was supposed to be removed automatically after the node passed its readiness checks. It wasn't, due to a version mismatch in the autoscaler config. The pods had no matching tolerations. The scheduler correctly rejected all 50 new nodes. The existing 450 were full. The fix was a one-line toleration addition to the affected Deployments — or removing the stuck taint withkubectl taint node. The team spent two hours on resources before checking taints. This article exists so you don't make the same mistake.

Why You Need to Understand the Scheduler

Most engineers treat the Kubernetes scheduler as a black box. A pod gets created, it shows up running somewhere, done. That model works fine until the day a pod gets stuck in Pending and you have no idea why. At that point, if you don't understand how the scheduler thinks, you're going to waste hours chasing the wrong problem.

The most common cause of pods stuck in Pending is not insufficient resources. It is scheduling constraints the team forgot about: a taint that was never tolerated, a node affinity rule that excludes every node in the cluster, a topology spread constraint with DoNotSchedule that can't be satisfied because one AZ is down. Resources are easy to see — CPU and memory numbers are right there in dashboards. Scheduling constraints are invisible unless you know where to look.

This article covers the complete Kubernetes scheduling pipeline: how the scheduler filters nodes, how it ranks the survivors, what makes a pod unschedulable, and how to debug any Pending pod in under five minutes once you know the process.

What the Scheduler Actually Does

The Kubernetes scheduler (kube-scheduler) does one thing: it assigns pods to nodes. That is the entirety of its job. It does not run pods. It does not pull container images. It does not communicate with the container runtime. It watches the API server for pods that have no nodeName set on their spec, runs its decision algorithm, and writes the chosen node name back to the pod spec via the API server. Then it moves on. The kubelet on the chosen node sees the assignment and actually starts the pod.

This separation of concerns matters for debugging. If a pod is Pending, the kubelet is not involved yet. The problem is entirely in the scheduler's decision logic. If a pod is scheduled but not starting, the problem is in the kubelet, the container runtime, or image pulling — not the scheduler.

🎯 Interview Tip

Interviewers love asking “what happens when you create a pod?” The correct answer traces the path: API server validation → etcd persistence → scheduler watches for unscheduled pods → scheduler assigns nodeName → kubelet on target node watches for pods assigned to it → kubelet calls CRI to start containers. The scheduler's role is narrow but critical — if it can't find a node, nothing else happens.

The Scheduling Cycle: Filter, Score, Bind

The scheduler processes pods in a priority queue. Each pod goes through a two-phase decision process before being bound to a node.


  ┌─────────────────────────────────────────────────────────────────────────┐
  │                      KUBERNETES SCHEDULING CYCLE                        │
  └─────────────────────────────────────────────────────────────────────────┘

  Pod Created / Updated
         │
         ▼
  ┌─────────────────────────┐
  │   Scheduling Queue      │  Priority-sorted queue of unscheduled pods
  │   (PriorityQueue)       │
  └───────────┬─────────────┘
              │
              ▼
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                          FILTER PHASE                                   │
  │                                                                         │
  │  All Nodes (e.g. 500)                                                   │
  │       │                                                                 │
  │       ├─ NodeUnschedulable      → remove node-1 (cordoned)             │
  │       ├─ NodeResourcesFit       → remove node-2,3 (no CPU/mem)         │
  │       ├─ TaintToleration        → remove node-4..53 (tainted)          │
  │       ├─ NodeAffinity           → remove node-54..80 (wrong labels)    │
  │       ├─ PodTopologySpread      → remove node-81..90 (maxSkew breach)  │
  │       ├─ VolumeBinding          → remove node-91 (PVC zone mismatch)   │
  │       └─ PodAntiAffinity        → remove node-92 (conflicting pod)     │
  │                                                                         │
  │  Feasible Nodes: node-100, node-101, node-102  (3 remaining)           │
  └─────────────────────────────────────────────────────────────────────────┘
              │
              ▼
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                          SCORE PHASE                                    │
  │                                                                         │
  │  node-100:  LeastAllocated=85  BalancedAlloc=90  NodeAffinity=0  = 87  │
  │  node-101:  LeastAllocated=60  BalancedAlloc=55  NodeAffinity=25 = 62  │
  │  node-102:  LeastAllocated=72  BalancedAlloc=68  ImageLocality=20 = 73 │
  │                                                                         │
  │  Winner: node-100  (highest combined score)                             │
  └─────────────────────────────────────────────────────────────────────────┘
              │
              ▼
  ┌─────────────────────────┐
  │       BIND PHASE        │  scheduler writes pod.spec.nodeName = node-100
  │   (API Server + etcd)   │  kubelet on node-100 picks up and runs pod
  └─────────────────────────┘

Phase 1 — Filtering

The filter phase runs a series of predicates against every node in the cluster. Each predicate is a yes/no gate. If a node fails any predicate, it is removed from the feasible set. The scheduler runs these plugins in order, and as nodes are eliminated, later plugins have fewer nodes to evaluate. Here are the most important filter plugins and what they check.

NodeResourcesFit

Checks whether the node has enough allocatable CPU and memory to satisfy the pod's resourcerequests (not limits). The scheduler uses the sum of all existing pod requests on the node plus the new pod's requests. If the total exceeds the node's allocatable capacity, the node is filtered out.

Critical nuance: a node can be at 95% CPU utilization but still pass this check if pods on it have low requests. The scheduler only sees requests. This is how clusters get overcommitted in practice.

NodeUnschedulable

Checks node.spec.unschedulable. When you run kubectl cordon node-1, this flag is set to true, and the scheduler immediately stops placing new pods there. Existing pods are not affected — that's what drain is for.

TaintToleration

Compares the node's taints against the pod's tolerations. A taint is a key-value-effect triple applied to a node. A toleration is a matching declaration on the pod. If the node has a taint with effect NoSchedule and the pod has no matching toleration, the node is filtered out. This is the plugin that caused the incident in the opening. We cover this in depth below.

NodeAffinity

Evaluates the pod's spec.affinity.nodeAffinity rules and any nodeSelectorfield. If the pod has a requiredDuringScheduling rule that the node doesn't satisfy, the node is filtered out. Preferred affinity rules are not evaluated in the filter phase — they show up later in scoring.

PodTopologySpread

Evaluates topologySpreadConstraints with whenUnsatisfiable: DoNotSchedule. If placing the pod on this node would violate the maxSkew constraint, the node is filtered out. For example, if you have a constraint of maxSkew: 1 across zones and zone A already has two more pods than zone B, the scheduler won't add a third pod to zone A.

VolumeBinding

Checks whether PersistentVolumeClaims referenced by the pod can be bound to PersistentVolumes available on the node. This is particularly important for storage classes withvolumeBindingMode: WaitForFirstConsumer — the volume is only provisioned in the zone where the pod is being scheduled. If the StorageClass is zone-specific and the node is in the wrong zone, this filter eliminates it.

PodAntiAffinity

Checks whether placing this pod on the node would violate a required pod anti-affinity rule of either the new pod or existing pods on the node. If an existing pod has a required anti-affinity rule that would be violated by the new pod's labels, the node is eliminated. This is how pod anti-affinity blocks spreading — and how it can trap you during an AZ failure.

Phase 2 — Scoring

Once the filter phase produces a set of feasible nodes, the scoring phase ranks them. Each scoring plugin assigns a score from 0 to 100. Scores are multiplied by a plugin weight and summed. The node with the highest total score wins. If there is a tie, the scheduler picks one at random.

LeastAllocated

Prefers nodes with lower allocation ratios. A node using 20% of its CPU and memory scores higher than one using 80%. This spreads pods across nodes by default, which is what most clusters want.

BalancedResourceAllocation

Penalizes nodes where CPU and memory utilization are significantly imbalanced. A node at 80% CPU and 10% memory scores lower than one at 45% CPU and 40% memory. The goal is to avoid leaving nodes with lots of one resource but none of another, which wastes capacity.

NodeAffinity (scoring)

preferredDuringSchedulingIgnoredDuringExecution rules are evaluated here. Each preferred rule has a weight (1–100). Nodes matching the preference get a score proportional to the weight. Multiple preferred rules accumulate. This is how you express “I prefer memory-optimized nodes but will accept anything” without hard-blocking on the others.

InterPodAffinity

Scores nodes higher if they or nearby nodes (within the topology key) already run pods that match a pod affinity selector. Useful for co-locating related workloads to reduce network latency.

ImageLocality

Gives a higher score to nodes that already have the container image cached locally. For large images (>1 GB) this can meaningfully reduce pod startup time by avoiding a registry pull. The plugin scores nodes proportional to the total size of already-present images.

⚡ Production Tip

The scoring phase only matters when you have multiple feasible nodes. If your filter phase reduces the candidate set to one node, scoring is irrelevant — the pod goes there regardless of score. Over-constrained workloads (multiple required node affinities, strict topology spread, required anti-affinity) often end up with a single feasible node and no real flexibility. That's fine when it's intentional, dangerous when it's accidental.

Taints and Tolerations: The Gatekeeper Model

Taints and tolerations are the mechanism for repelling pods from nodes. A taint says “stay away unless you explicitly declare you're okay with this.” A toleration is that declaration. The model is intentionally asymmetric: you taint nodes, and pods must opt in. This is safer than requiring pods to opt out of everything they can't handle.


  NODE TAINT                               POD TOLERATION
  ──────────────────────────────────────────────────────────────────────────

  key:   gpu                               key:   gpu
  value: "true"           ◄───MATCH──►    value: "true"
  effect: NoSchedule                       effect: NoSchedule
                                           operator: Equal
                          RESULT: Pod CAN be scheduled on this node

  ──────────────────────────────────────────────────────────────────────────

  key:   gpu                               (no toleration)
  value: "true"           ◄──NO MATCH─►
  effect: NoSchedule
                          RESULT: Pod CANNOT be scheduled on this node

  ──────────────────────────────────────────────────────────────────────────

  key:   node.kubernetes.io/not-ready      (no toleration needed for Exists)
  effect: NoExecute        ◄──MATCH──►    key:   node.kubernetes.io/not-ready
                                           operator: Exists
                                           tolerationSeconds: 300
                          RESULT: Pod tolerates NotReady for 300 s then evicted

Taint Effects

Effect	New Pods	Existing Pods	Use Case
NoSchedule	Blocked (hard)	Not affected	Dedicated GPU nodes, spot instances, maintenance prep
PreferNoSchedule	Avoided (soft)	Not affected	Soft isolation, overflow capacity
NoExecute	Blocked (hard)	Evicted (unless tolerated)	Node draining, NotReady/Unreachable handling

System-Applied Taints

Kubernetes applies certain taints automatically in response to node conditions. These are prefixed with node.kubernetes.io/ and you will see them constantly in production:

node.kubernetes.io/not-ready — node's Ready condition is False
node.kubernetes.io/unreachable — node controller cannot reach the node
node.kubernetes.io/memory-pressure — node is reporting memory pressure
node.kubernetes.io/disk-pressure — node is reporting disk pressure
node.kubernetes.io/pid-pressure — node is running low on process IDs
node.kubernetes.io/unschedulable — node has been cordoned
node.kubernetes.io/network-unavailable — node's network is not configured

DaemonSets automatically get tolerations for not-ready and unreachablebecause they need to run on every node regardless of health conditions. If you write a DaemonSet for logging or monitoring, you do not need to add these tolerations manually — the admission controller adds them.

NoExecute and Eviction

NoExecute is the most powerful taint effect because it affects running pods, not just new ones. When a node becomes NotReady, Kubernetes automatically adds anode.kubernetes.io/not-ready:NoExecute taint. Pods that have no toleration for this taint will be evicted after a grace period (default: 300 seconds for most pods). Pods with a toleration and a tolerationSeconds value will be evicted after that many seconds.

This is the mechanism behind Kubernetes's node failure handling. It's not magic — it's taints and tolerations working together with the node lifecycle controller.

⚠️ Common Mistake

Adding a toleration with no tolerationSeconds for NoExecute means the pod will never be evicted from a NotReady node. This is correct for certain DaemonSets and stateful workloads, but disastrous for stateless replicas. If an entire AZ goes down and your pods have an unbounded not-ready:NoExecute toleration, those pod slots are never freed, and Kubernetes won't reschedule replacements because it thinks the original pods are still running.

Production Pattern: Dedicated Node Pools

The standard way to create dedicated node pools in Kubernetes is the taint-and-toleration pattern. You taint all nodes in the pool with a meaningful key-value pair and NoSchedule. Only pods that need those nodes add the matching toleration. Everything else is blocked.

# Step 1: Taint all GPU nodes so only GPU workloads land on them
kubectl taint nodes node-gpu-1 node-gpu-2 node-gpu-3 \
  hardware=gpu:NoSchedule

# Step 2: Label GPU nodes for affinity matching
kubectl label nodes node-gpu-1 node-gpu-2 node-gpu-3 \
  hardware=gpu \
  accelerator=nvidia-a100

---
# Step 3: GPU workload pod spec
apiVersion: v1
kind: Pod
metadata:
  name: model-training-job
spec:
  tolerations:
    - key: "hardware"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: hardware
                operator: In
                values:
                  - gpu
              - key: accelerator
                operator: In
                values:
                  - nvidia-a100

  containers:
    - name: trainer
      image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
      resources:
        requests:
          nvidia.com/gpu: "1"
          cpu: "8"
          memory: "32Gi"
        limits:
          nvidia.com/gpu: "1"
          cpu: "16"
          memory: "64Gi"

⚡ Production Tip

Always pair a taint with a node affinity rule when creating dedicated pools. The taint prevents unwanted pods from landing on GPU nodes, but without the affinity rule, your GPU workload can still land on regular nodes if the taint is missing or removed. Defense in depth: taint blocks others in, affinity rules keep your workload where it belongs.

Node Affinity: Where Pods Want to Run

Node affinity lets pods express preferences or requirements about which nodes they land on, based on node labels. It is the more powerful successor to nodeSelector. Understanding the difference between the two affinity types and when to use each is essential.

Type	Phase	Behavior	Pod Stuck if Unmet?
nodeSelector	Filter	Hard requirement, simple equality	Yes
requiredDuringSchedulingIgnoredDuringExecution	Filter	Hard requirement, complex expressions	Yes
preferredDuringSchedulingIgnoredDuringExecution	Score	Soft preference with weight, will schedule elsewhere	No

IgnoredDuringExecution: What It Means

Both current affinity types end in IgnoredDuringExecution. This means: once a pod is running, if the node's labels change and the pod's affinity rule would no longer match, Kubernetes does not evict the pod. The rule is only enforced at scheduling time.

The counterpart, RequiredDuringExecution, would evict pods when their affinity is violated at runtime. This feature was planned but has never shipped in stable form. Don't design systems around it.

Required Affinity: Hard Rules

Use requiredDuringSchedulingIgnoredDuringExecution when a pod genuinely cannot run without certain node characteristics. GPU workloads need GPU nodes. Workloads that read from local NVMe need nodes with those disks. Security-sensitive workloads need nodes in a compliant pool. If no matching node exists, the pod stays Pending — and that is correct behavior. It is better for the pod to wait than to run in an environment that cannot support it.

Preferred Affinity: Soft Preferences

Use preferredDuringSchedulingIgnoredDuringExecution when you want to express a preference without making it a hard requirement. The weight field (1–100) controls how strongly the scheduler favors matching nodes. You can stack multiple preferred rules with different weights. The scheduler normalizes scores so the highest-weight matches have more pull.

This is particularly useful for cost optimization: prefer on-demand nodes in a first-tier weight, prefer spot nodes as a lower-weight fallback, and let the scheduler sort it out based on capacity.

Pod Affinity and Anti-Affinity

Where node affinity is about pod-to-node relationships, pod affinity and anti-affinity express pod-to-pod relationships: “schedule me near pods with these labels” or “do not schedule me near pods with these labels.” The topology key defines what “near” means.

Pod Affinity: Co-location

Pod affinity schedules a pod onto a node where pods matching a label selector already exist within the same topology domain. Common use case: a web server that calls a caching layer on localhost or a low-latency internal network path benefits from being on the same node or in the same rack as the cache.

Pod Anti-Affinity: Spreading

Pod anti-affinity prevents pods from being co-located within a topology domain. The most common production use case is HA: you want replicas of the same Deployment spread across different nodes (kubernetes.io/hostname) or different availability zones (topology.kubernetes.io/zone).


  ZONE SPREAD WITH POD ANTI-AFFINITY  (topologyKey: topology.kubernetes.io/zone)

  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
  │     us-east-1a   │  │     us-east-1b   │  │     us-east-1c   │
  │                  │  │                  │  │                  │
  │  ┌────────────┐  │  │  ┌────────────┐  │  │  ┌────────────┐  │
  │  │  api-pod-1 │  │  │  │  api-pod-2 │  │  │  │  api-pod-3 │  │
  │  └────────────┘  │  │  └────────────┘  │  │  └────────────┘  │
  │                  │  │                  │  │                  │
  │  anti-affinity   │  │  anti-affinity   │  │  anti-affinity   │
  │  blocks 2nd pod  │  │  blocks 2nd pod  │  │  blocks 2nd pod  │
  └──────────────────┘  └──────────────────┘  └──────────────────┘

  With required anti-affinity:
  - AZ failure (us-east-1b goes down) → api-pod-2 needs to reschedule
  - us-east-1a already has api-pod-1  → anti-affinity blocks it
  - us-east-1c already has api-pod-3  → anti-affinity blocks it
  - RESULT: api-pod-2 stays Pending forever ← THE INCIDENT

⚠️ Common Mistake

Using requiredDuringSchedulingIgnoredDuringExecution for pod anti-affinity across zones is dangerous for small replica counts. With 3 replicas across 3 zones, if one zone fails, the evicted pod cannot reschedule because both remaining zones already have a pod — and the required anti-affinity blocks them. The pod stays Pending until the failed zone recovers. UsepreferredDuringSchedulingIgnoredDuringExecution or TopologySpreadConstraints instead for most workloads.

Performance Warning at Scale

Pod affinity and anti-affinity are computationally expensive. Every scheduling decision requires checking every existing pod's labels and their affinity rules against every candidate node. At 1,000+ nodes and 10,000+ pods, required anti-affinity rules can add hundreds of milliseconds to each scheduling cycle. The Kubernetes docs explicitly warn about this. If you have large clusters, prefer TopologySpreadConstraints — they achieve the same goal more efficiently.

🚨 Real Incident: AZ Failure with Required Anti-Affinity

A production API had 3 replicas, each in a different AZ, enforced by a requiredpod anti-affinity rule with topologyKey: topology.kubernetes.io/zone. During a cloud provider incident, us-east-1b became unavailable. The pod in us-east-1b was evicted. The scheduler tried to reschedule it. us-east-1a already had a replica — anti-affinity blocked it. us-east-1c already had a replica — anti-affinity blocked it. There was no remaining zone. The pod sat Pending for four hours until the AZ recovered. The service ran at 66% capacity the entire time with no autoscaling relief possible because the replica count couldn't increase (adding a replica hit the same anti-affinity wall). The fix: downgrade anti-affinity from required to preferred and switch to TopologySpreadConstraints with ScheduleAnywayas the fallback policy.

Topology Spread Constraints

TopologySpreadConstraints are the modern, purpose-built replacement for complex anti-affinity rules. They solve the same problem — distributing pods across failure domains — with cleaner semantics, better performance, and a configurable fallback policy when the ideal spread cannot be achieved.

Key Fields

maxSkew — the maximum allowed difference in pod count between any two topology domains. maxSkew: 1 means no zone can have more than one extra pod compared to any other zone.
topologyKey — the node label used to define topology domains.topology.kubernetes.io/zone for AZ spread, kubernetes.io/hostnamefor per-node spread.
whenUnsatisfiable — what to do when the constraint cannot be satisfied.DoNotSchedule keeps the pod Pending (hard constraint).ScheduleAnyway allows scheduling but still tries to minimize skew (soft constraint).
labelSelector — which pods to count when evaluating the spread. Typically matches the Deployment's pod template labels.

# Spread a Deployment across 3 AZs with maxSkew of 1
# This means: at most 1 pod difference between any two zones
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
  namespace: production
spec:
  replicas: 9
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      topologySpreadConstraints:
        # Primary constraint: spread across AZs
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web-frontend
          # minDomains: 3  # require at least 3 zones (k8s 1.25+)

        # Secondary constraint: no more than 4 pods per node
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: web-frontend

      containers:
        - name: frontend
          image: myregistry/web-frontend:v1.8.0
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"

⚡ Production Tip

For production workloads spanning 3 AZs, use whenUnsatisfiable: DoNotSchedule on the zone-level spread constraint only if you have enough replicas (at least 3) and the cluster is sized to handle a full AZ loss. If you have 3 replicas and might lose an AZ, useScheduleAnyway — it degrades gracefully instead of blocking rescheduling. Reserve DoNotSchedule for the per-node constraint where the failure mode is less catastrophic.

TopologySpread vs Anti-Affinity

The key advantage of TopologySpreadConstraints over pod anti-affinity is the fallback behavior. Anti-affinity with required has no fallback — if the constraint can't be satisfied, the pod stays Pending. TopologySpreadConstraints with ScheduleAnywaywill schedule the pod in the best available location even if perfect balance isn't possible. Use TopologySpreadConstraints by default and reserve anti-affinity for cases where co-location is genuinely a correctness concern, not just a preference.

Priority Classes and Preemption

When the cluster is full and a high-priority pod needs to schedule, the scheduler can evict lower-priority pods to make room. This is preemption, and it is controlled by PriorityClass resources.

# system-critical: reserved for core cluster infrastructure
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: system-critical
value: 2000000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Core cluster infrastructure. Do not use for application workloads."

---
# production-high: critical production services
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-high
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Production services with SLA requirements."

---
# production-standard: normal production workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-standard
value: 100000
globalDefault: true
preemptionPolicy: PreemptLowerPriority
description: "Default for production workloads. Applied if no priorityClassName is set."

---
# development: batch jobs, dev/staging environments
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: development
value: 1000
globalDefault: false
preemptionPolicy: Never
description: "Development and batch jobs. Never preempts other pods."

---
# Example usage in a Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: production
spec:
  template:
    spec:
      priorityClassName: production-high
      containers:
        - name: payment-service
          image: myregistry/payment:v3.1.0

How Preemption Works

When a pod cannot be scheduled due to insufficient resources, the scheduler checks whether evicting lower-priority pods on any node would free enough resources to fit the high-priority pod. If so, it selects a set of pods to evict (minimizing disruption), marks them for eviction, and nominates the node for the high-priority pod. The nominated pod does not immediately jump to that node — it re-enters the scheduling cycle and wins in the next pass once the evicted pods have terminated.

⚠️ Common Mistake

The most dangerous priority class anti-pattern is giving development or batch jobs a higher priority value than production services. This seems absurd but happens because teams inherit a cluster where globalDefault: true was set on a high-priority class, and all new workloads get that priority unless explicitly set otherwise. A development team deploys a batch job with the default priority. The cluster fills up. The batch job preempts a production payment service. Always set globalDefault: true only on your standard production priority class, and explicitly set priorityClassName: development on all batch and non-production workloads.

🚨 Real Incident: Dev Jobs Preempting Production

A data engineering team added a priority class for their nightly ETL jobs and set the value to10000000 — they wanted their jobs to be “really high priority.” The production PriorityClass was set to 1000000. On a Tuesday morning when the cluster was at 87% utilization, the ETL jobs started running and preempted 14 production API pods. Latency spiked. PagerDuty fired. The on-call engineer spent 20 minutes trying to understand why production pods were in ContainerCreating while the cluster appeared to have capacity. Once the priority class values were discovered and corrected, recovery was immediate. The lesson: document priority class values, make them a required review item for any new workload onboarding, and use a clear numeric hierarchy: system at 2B, prod at 1M, staging at 100K, dev/batch at 1K or below.

System-Reserved Priority Classes

Kubernetes reserves priority values above 1,000,000,000 (1 billion) for system components. The built-in classes system-cluster-critical (2,000,000,000) andsystem-node-critical (2,000,001,000) are used by components like CoreDNS and kube-proxy. Never set application workloads above 1 billion. The scheduler will reject pods that attempt to use a value above this threshold unless they are in the kube-systemnamespace.

Resource Requests and Scheduling

The scheduler only uses resource requests for scheduling decisions. Limits are enforced by the kubelet and container runtime at runtime, but the scheduler ignores them entirely when choosing a node. This is one of the most commonly misunderstood aspects of Kubernetes resource management.

The practical implication: a pod with requests.cpu: 100m and limits.cpu: 4000moccupies only 100m of CPU in the scheduler's accounting. The node appears to have 39.9 CPU left after placing 400 such pods, but each of those pods can burst to 4 cores under load. This is how clusters get overcommitted — intentionally on CPU (which is compressible) and dangerously on memory (which is not).

⚠️ Common Mistake

Setting memory limits much higher than requests is dangerous. If multiple pods on a node all burst their memory usage simultaneously, the OOM killer starts terminating processes. The scheduler has no way to predict this because it only sees requests. For memory, requests and limits should be close together (or equal for critical services). For CPU, a 2–4x difference between requests and limits is reasonable because CPU throttling is graceful; OOM is not.

LimitRange for Sane Defaults

In namespaces where teams deploy workloads without setting resource requests, use aLimitRange to inject defaults. Without requests, the scheduler places the pod as if it needs zero resources, which causes it to be packed onto already-full nodes and then get OOM-killed immediately. A LimitRange with sensible defaults (e.g., 100m CPU, 128Mi memory) prevents this failure mode.

Debugging Pending Pods: A Systematic Approach

When a pod is stuck in Pending, follow this process in order. Do not skip steps. The answer is almost always in the first two.

🔍 Troubleshooting Tip

The Events section of kubectl describe pod is your primary diagnostic tool. It contains the scheduler's exact rejection message. A message like “0/500 nodes are available: 50 node(s) had untolerated taint, 450 node(s) had insufficient cpu” tells you exactly what happened. Read it before doing anything else.

# Step 1: Get the Events section — this is where the answer lives
kubectl describe pod <pod-name> -n <namespace>

# Look for lines like:
#   Warning  FailedScheduling  0/50 nodes are available:
#     10 node(s) had untolerated taint {dedicated: gpu},
#     25 node(s) didn't match Pod's node affinity/selector,
#     15 node(s) had insufficient cpu.

# Step 2: Check node capacity and allocations
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

# Step 3: List all taints on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Step 4: Check if PVC is bound (if pod uses PersistentVolumeClaim)
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# Step 5: Check topology spread status (k8s 1.26+)
kubectl get pod <pod-name> -n <namespace> -o json | jq '.status.conditions'

# Step 6: Check priority class assignment
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.priorityClassName}'

# Step 7: Scheduler logs (if you have cluster-admin access)
kubectl logs -n kube-system -l component=kube-scheduler --tail=100 | grep <pod-name>

Reading the Scheduler's Rejection Message

The scheduler's rejection message in the Events section always follows the pattern:N/M nodes are available: X node(s) had [reason], Y node(s) had [reason]...The reasons map directly to filter plugins:

insufficient cpu / insufficient memory → NodeResourcesFit
untolerated taint → TaintToleration
didn't match Pod's node affinity → NodeAffinity or nodeSelector
didn't match pod topology spread constraints → PodTopologySpread
node(s) had volume node affinity conflict → VolumeBinding
didn't match pod anti-affinity rules → PodAntiAffinity
node(s) were unschedulable → NodeUnschedulable (cordoned)

Full Deployment Example

This is a production-grade Deployment spec combining tolerations, required and preferred node affinity, preferred pod anti-affinity, and topology spread constraints.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
        tier: backend
    spec:
      # ── Tolerations: allow scheduling on dedicated compute nodes ──
      tolerations:
        - key: "workload-type"
          operator: "Equal"
          value: "compute"
          effect: "NoSchedule"

      # ── Node Affinity: require prod nodes, prefer high-memory ────
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: environment
                    operator: In
                    values:
                      - production
                  - key: kubernetes.io/arch
                    operator: In
                    values:
                      - amd64
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: node-class
                    operator: In
                    values:
                      - memory-optimized
            - weight: 20
              preference:
                matchExpressions:
                  - key: node-class
                    operator: In
                    values:
                      - compute-optimized

        # ── Pod Anti-Affinity: one pod per zone ──────────────────────
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - api-server
                topologyKey: topology.kubernetes.io/zone

      # ── Topology Spread: enforce even spread across zones ─────────
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api-server

      containers:
        - name: api-server
          image: myregistry/api-server:v2.4.1
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "2Gi"

Custom Schedulers and Scheduler Extenders

Kubernetes supports running multiple schedulers simultaneously. Pods explicitly name the scheduler they want via spec.schedulerName. If unset, the defaultkube-scheduler handles them. Custom schedulers are useful for specialized workloads: ML frameworks that need gang scheduling (all pods start or none do), batch systems with custom bin-packing strategies, or hardware-aware placement for specialized accelerators.

Scheduler extenders are plugins that the built-in scheduler calls out to during the filter and score phases. They allow custom logic without replacing the entire scheduler. A common use case is a cluster with custom hardware labels that an external inventory system manages. The extender can query the inventory system during scheduling to validate hardware availability.

The scheduler framework (introduced in Kubernetes 1.15 and stabilized by 1.19) is the preferred modern approach to scheduler customization. It exposes well-defined extension points at each phase (PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit, Bind) that plugins can implement without forking the scheduler binary.

🎯 Interview Tip

If asked about custom schedulers, mention the scheduler framework extension points. Knowing the difference between a scheduler extender (out-of-process HTTP call during scheduling) and a scheduler plugin (in-process, implements the framework interface) demonstrates production-level knowledge. Extenders add latency to every scheduling decision; plugins do not.

Common Mistakes

Looking at resource utilization instead of resource requests when pods are Pending.
Not reading the Events section of kubectl describe pod before investigating elsewhere.
Using required pod anti-affinity with fewer replicas than AZs, blocking rescheduling during AZ failures.
Setting node affinity with an operator of In but using a label that doesn't exist on any node.
Forgetting that taints applied by the cluster autoscaler during node initialization may not be removed automatically.
Setting memory limits much higher than requests, enabling silent overcommitment.
Not setting resource requests at all — pods get packed onto full nodes and OOM-killed.
Setting all workloads to the same high priority class, making preemption meaningless.
Using a high priority value for development or batch jobs without realizing they will preempt production.
Applying NoExecute tolerations without tolerationSeconds, blocking eviction and rescheduling during node failures.
Mixing DoNotSchedule topology spread with an insufficient number of AZ replicas, causing Pending pods during an AZ event.
Using nodeSelector and affinity rules simultaneously without realizing both must be satisfied.
Not labeling new nodes before deploying workloads with affinity rules that require those labels.
Applying pod anti-affinity to singleton pods (one replica), which forces that pod to a node with no matching pods — which is every node, so the rule has no effect.
Using pod affinity across different namespaces without setting the namespaces field, causing the rule to match nothing.
Running large-scale clusters with required pod anti-affinity on high-replica Deployments, causing severe scheduler performance degradation.

Interview Questions and Answers

Beginner

Q: What is the difference between a taint and a toleration?

A taint is applied to a node and repels pods that don't explicitly tolerate it. A toleration is applied to a pod and allows it to be scheduled on nodes with matching taints. Taints are the lock; tolerations are the key. A pod with a toleration can still be scheduled on untainted nodes — tolerations don't restrict, they only permit.

Q: Why is my pod in Pending state?

Run kubectl describe pod <name> and read the Events section. The scheduler writes its rejection reason there. Common causes: insufficient CPU or memory on all nodes, a taint on all nodes that the pod doesn't tolerate, a node affinity rule that no node satisfies, a topology spread constraint that can't be met, or a PVC that can't be bound.

Q: What does `kubectl cordon` do?

It sets node.spec.unschedulable = true on the node, which causes the scheduler to stop placing new pods there. Existing pods continue running. It is used to prepare a node for maintenance. kubectl drain cordons the node and also evicts existing pods.

Q: What is the difference between resource requests and limits?

Requests are what the scheduler uses to find a node with enough capacity. They represent the guaranteed minimum the pod needs. Limits are the maximum the container is allowed to use; exceeding the CPU limit causes throttling, exceeding the memory limit causes an OOM kill. The scheduler ignores limits entirely.

Q: What is a PriorityClass?

A cluster-scoped resource that assigns a numeric priority value to pods. Pods with higher priority can preempt (evict) lower-priority pods when the cluster is full and a high-priority pod needs to schedule. The built-in system-cluster-critical andsystem-node-critical classes have the highest values and protect core cluster infrastructure.

Intermediate

Q: What is the difference between `required` and `preferred` node affinity?

requiredDuringSchedulingIgnoredDuringExecution is a hard constraint evaluated in the filter phase. If no node matches, the pod stays Pending. preferredDuringSchedulingIgnoredDuringExecutionis a soft preference evaluated in the scoring phase. If no node matches the preference, the scheduler picks the best available node anyway. Use required for correctness (e.g., GPU workloads must have GPUs), preferred for optimization (e.g., prefer cheaper instance types).

Q: Why should you avoid required pod anti-affinity for small replica counts?

With N replicas and required anti-affinity across zones, you need at least N healthy zones. If a zone fails, the evicted pod cannot reschedule because all remaining zones already have a replica that the anti-affinity rule forbids. The pod stays Pending until the zone recovers. Use preferred anti-affinity or TopologySpreadConstraints withwhenUnsatisfiable: ScheduleAnyway for resilient degraded behavior.

Q: How does the scheduler handle a full cluster?

If no node can fit the pod, the pod stays in the scheduling queue. If the pod has a higher priority than existing pods, the scheduler's PostFilter phase (preemption) searches for nodes where evicting lower-priority pods would free enough resources. If found, it nominates that node, evicts the lower-priority pods, and re-queues the high-priority pod. If no preemption candidate exists, the pod waits until capacity is available (e.g., a node is added or an existing pod is deleted).

Q: What is `topologySpreadConstraints` and how does it differ from pod anti-affinity?

TopologySpreadConstraints explicitly control the maximum skew (pod count difference) between topology domains. They are more declarative, more performant, and support aScheduleAnyway fallback that anti-affinity lacks. Anti-affinity uses binary yes/no logic per node, while topology spread uses a counting model across domains. For distributing replicas across AZs, topology spread is almost always the right choice.

Q: How does the scheduler use resource requests vs actual usage?

The scheduler only uses requests, not actual usage. It maintains an in-memory model of each node's allocated capacity by summing the requests of all scheduled pods. It does not query metrics-server or node utilization data. This means a cluster can appear “schedulable” while being dangerously overloaded at runtime. Vertical Pod Autoscaler helps bridge this gap by right-sizing requests based on observed usage.

Advanced

Q: Explain the scheduling framework extension points and when you'd implement each.

The scheduler framework has eight main extension points in order: PreFilter (normalize pod state, fail fast before per-node evaluation), Filter (eliminate infeasible nodes), PostFilter (preemption logic when filter produces no candidates), PreScore (compute state for scoring), Score (rank feasible nodes), Reserve (claim resources optimistically), Permit (hold pod in waiting state for gang scheduling), and Bind (write nodeName to the API). You'd implement Filter and Score for custom node selection logic. Reserve and Permit are used for gang scheduling, where all pods in a group must schedule simultaneously or not at all.

Q: How would you optimize scheduling performance for a 5,000-node cluster?

First, enable percentage-based filtering: the scheduler can stop filtering after finding a configurable percentage of feasible nodes (default is all, configurable viapercentageOfNodesToScore). At 5,000 nodes, evaluating 10% (500 nodes) is usually sufficient. Second, avoid required pod anti-affinity at scale — it is O(pods × nodes) per scheduling decision. Replace with topology spread. Third, partition the cluster into multiple node pools with affinity rules so the scheduler evaluates a smaller candidate set per workload type. Fourth, consider running multiple scheduler instances with cluster partitioning if a single instance becomes a bottleneck.

Q: What happens to DaemonSet pods during node initialization and why?

The DaemonSet controller runs its own scheduling logic (independent of kube-scheduler) that directly sets spec.nodeName on pods. DaemonSet pods are automatically given tolerations for common system taints including node.kubernetes.io/not-ready,node.kubernetes.io/unreachable, node.kubernetes.io/disk-pressure,node.kubernetes.io/memory-pressure, node.kubernetes.io/pid-pressure, and node.kubernetes.io/unschedulable. This ensures that critical DaemonSets (log collectors, monitoring agents) run on nodes even during transient failure states, because those are exactly the conditions when you need them most.

Q: How does preemption interact with PodDisruptionBudgets?

The preemption algorithm respects PodDisruptionBudgets when selecting victims. If evicting a pod would violate its PDB (the number of available pods would drop below minAvailable or exceed maxUnavailable), the scheduler will not select it as a preemption candidate in that round. It will look for other pods on other nodes. If no PDB-safe victim set can be found, the preemption attempt fails and the high-priority pod stays pending. This is intentional — PDBs provide availability guarantees that take precedence over preemption.

Q: Describe a scenario where ImageLocality scoring causes unexpected pod placement behavior.

Imagine a Deployment with 10 replicas where the first 3 replicas were all placed on node-A because of prior scheduling decisions. The container image is now cached on node-A. When replica 4 needs to schedule, ImageLocality gives node-A a score boost. If other score signals are close, the pod may land on node-A instead of a less-loaded node. Over time, this can cause subtle clustering: new pods gravitating toward nodes that already ran them, even when you want even distribution. For workloads where spread matters more than startup time, you can disable ImageLocality via the scheduler plugin configuration, or ensure topology spread constraints dominate by using DoNotSchedule mode.

Best Practices

Always read the Events section of kubectl describe pod before investigating resource limits.
Use TopologySpreadConstraints as your default HA spread mechanism instead of required pod anti-affinity.
Set both CPU and memory requests on every container. No exceptions for production workloads.
Keep memory requests close to limits (<2x ratio) to prevent dangerous overcommitment.
Use a LimitRange in every application namespace to enforce default resource requests.
Establish a clear PriorityClass hierarchy and assign it in cluster onboarding, not ad hoc.
Set globalDefault: true on your standard production priority class, not the highest one.
Always pair taints with node affinity rules for dedicated node pools — defense in depth.
Add tolerationSeconds to NoExecute tolerations to allow graceful handling without permanent tolerance.
Label nodes with zone, instance type, hardware class, and environment before deploying workloads that depend on those labels.
Prefer preferredDuringScheduling over requiredDuringScheduling for everything except hard correctness requirements.
Use whenUnsatisfiable: ScheduleAnyway for zone spread in multi-AZ clusters to enable graceful degradation during AZ failures.
Audit all taints on all nodes after autoscaler version upgrades — initialization taints can persist unexpectedly.
Monitor scheduler latency and queue depth as first-class metrics, not just pod count.
Set percentageOfNodesToScore in the scheduler config for clusters above 500 nodes to improve scheduling throughput.
Use minDomains in TopologySpreadConstraints (k8s 1.25+) to require a minimum number of topology domains before scheduling.
Test your scheduling constraints in a staging cluster before deploying to production — a misspecified affinity rule is silent until nodes are missing.
Document the intended scheduling behavior for every Deployment as part of the service runbook.
Review priority class values as part of any workload onboarding checklist.
Use kubectl get events --field-selector reason=FailedScheduling to quickly scan for scheduling failures across the cluster.

FAQ

Can a pod with a toleration be forced onto a tainted node?

No. A toleration allows scheduling on a tainted node but does not require it. The scheduler will still prefer untainted nodes unless a node affinity rule forces the pod to that specific node. To force a pod onto a specific tainted node, you need both a toleration and a required node affinity or nodeSelector matching that node.

What happens if I add a new taint to a node that has running pods without that toleration?

It depends on the taint effect. NoSchedule and PreferNoSchedule only affect new scheduling decisions — existing pods are not evicted. NoExecute will evict pods that don't have a matching toleration immediately (or aftertolerationSeconds if they have a partial toleration).

Can I run multiple schedulers in one cluster?

Yes. You can run custom schedulers alongside kube-scheduler. Pods opt in to a specific scheduler via spec.schedulerName. If the named scheduler is not running, the pod stays Pending indefinitely. The default scheduler only handles pods with noschedulerName or schedulerName: default-scheduler.

Why does the scheduler sometimes place multiple pods on the same node despite anti-affinity?

If you used preferred anti-affinity (not required), the constraint is not enforced — it just reduces the score for co-located nodes. If all other nodes score lower for other reasons, the scheduler may still place pods together. Userequired anti-affinity if co-location is genuinely unacceptable, with the understanding of the failure mode during AZ outages.

How does the cluster autoscaler interact with the scheduler?

The cluster autoscaler watches for Pending pods and determines whether adding a new node would allow them to schedule. It uses the scheduler's filter logic (simulated) to predict whether a new node of a given type would satisfy the pod's constraints. If a pod is Pending due to a taint/toleration mismatch, the autoscaler may spin up a node, apply the same taint, and the pod will still be Pending. Always verify that the pod's constraints are satisfiable by the node pools the autoscaler can provision.

What is the difference between `nodeSelector` and node affinity?

nodeSelector is a simple key-value map that requires an exact label match. It only supports equality. Node affinity supports rich expressions with In,NotIn, Exists, DoesNotExist, Gt, andLt operators. It also supports soft preferences viapreferredDuringScheduling. nodeSelector is still supported and works fine for simple cases. Use affinity when you need complex matching logic or soft preferences.

Can I see the score each node received during scheduling?

Not directly from kubectl. You can enable verbose scheduler logging (--v=10) to see per-node scores in the scheduler logs. In production, this level of verbosity generates enormous log volume and should only be used temporarily for debugging specific scheduling decisions.

What happens when a PVC is in Pending state?

The pod stays Pending waiting for the PVC to bind. For WaitForFirstConsumerstorage classes, the PVC won't bind until the pod is scheduled — which creates a chicken-and-egg situation. The scheduler resolves this by including volume binding in the filter phase: it considers which nodes have accessible storage and uses that to break the deadlock. If no node can access the required storage, both the pod and PVC stay Pending.

Does the scheduler consider actual CPU and memory utilization when placing pods?

No. The default scheduler uses only the sum of pod requests on each node. It does not query metrics-server. The LeastRequestedPriority scorer ranks nodes by their allocation ratio (requests vs capacity), not actual utilization. If you want utilization-aware scheduling, you need a custom scheduler plugin or extender that queries metrics-server or a monitoring system.

What is `maxSkew` and what value should I use?

maxSkew defines the maximum allowed difference in pod count between any two topology domains. With maxSkew: 1 and 3 AZs, the pod distribution might be 3-3-3 or 4-3-3 but not 5-3-3. Use maxSkew: 1 for strict balance. UsemaxSkew: 2 or higher if you're okay with some imbalance in exchange for fewer Pending pods during scaling. For Deployments with fewer replicas than zones, setmaxSkew: 1 with ScheduleAnyway to spread as evenly as possible without blocking scheduling.

How do I drain a node without disrupting production workloads?

Use kubectl drain <node> --ignore-daemonsets --delete-emptydir-data with PodDisruptionBudgets in place. The drain command respects PDBs by default and will wait or fail if evicting a pod would violate its budget. Set appropriate minAvailablevalues on your PDBs before draining. For critical services, usekubectl drain --timeout=300s to avoid indefinite hangs if a PDB cannot be satisfied.

What is a gang scheduling and when do you need it?

Gang scheduling ensures that a group of pods all schedule simultaneously or none do. It is required for distributed ML training jobs (PyTorch DDP, Horovod) where all workers must start together or the job fails. The default scheduler does not support gang scheduling. You need a scheduler plugin (like Volcano or the Coscheduling plugin) that implements the Permit extension point to hold pods in a “waiting” state until the full gang can schedule.

Can I schedule a pod to a specific node directly?

Yes, by setting spec.nodeName directly on the pod spec. This bypasses the scheduler entirely — the kubelet on that node will run the pod regardless of taints, resource availability, or affinity rules. This is useful for debugging but dangerous in production because it ignores all safety checks. Use node affinity with akubernetes.io/hostname label selector instead if you need to target a specific node while keeping safety checks in place.

Why are DaemonSet pods not managed by kube-scheduler?

The DaemonSet controller uses a direct pod creation path that sets spec.nodeNamebefore the scheduler sees the pod. This ensures DaemonSet pods run on every node (or every matching node) regardless of cluster capacity — system agents like log collectors and monitoring must run everywhere, even on overloaded nodes. The controller handles its own scheduling decisions using the same filter logic but bypasses the priority queue.

How do I configure the scheduler for a high-throughput cluster?

Key configuration options in the KubeSchedulerConfiguration API:percentageOfNodesToScore (reduce from 100% to 10–20% for large clusters),podInitialBackoffSeconds and podMaxBackoffSeconds (control retry delays for unschedulable pods), and plugin profiles (enable/disable/reorder plugins per scheduler profile). For very high throughput, run multiple scheduler replicas behind leader election and partition workloads using schedulerName per team or workload type.

Key Takeaways

The scheduler assigns pods to nodes in two phases: filter (eliminate ineligible nodes) and score (rank eligible nodes). It does not run pods.
The most common cause of Pending pods is scheduling constraints, not resource exhaustion. Always read the Events section of kubectl describe pod first.
Taints block all pods except those with explicit tolerations. System taints (node.kubernetes.io/*) are applied automatically by the node lifecycle controller and the autoscaler.
Required node affinity and required pod anti-affinity are hard constraints that keep pods Pending if unsatisfiable. Use them only when correctness demands it; use preferred rules or TopologySpreadConstraints for optimization.
TopologySpreadConstraints are the modern replacement for pod anti-affinity for HA spread. They are more performant, more declarative, and support configurable fallback behavior.
PriorityClasses control preemption. A misconfigured priority hierarchy where dev jobs outrank production is a silent cluster reliability bomb.
The scheduler uses resource requests, not limits or actual utilization. Set accurate requests to get accurate scheduling decisions.
NoExecute taints evict running pods. Always use tolerationSeconds unless you genuinely need indefinite tolerance of a failure condition.

About the author

Ravi Kapoor

Senior DevOps Engineer & Technical Writer

CKA & AWS SA-Pro Certified9 yrs — Atlassian & FintechKubernetes open-source contributor

Ravi is a senior DevOps engineer with 9 years of experience building cloud-native infrastructure at Atlassian and multiple fintech companies. CKA and AWS Solutions Architect Professional certified, he has managed Kubernetes clusters serving millions of daily users and contributes to open-source tooling.

Targeting a Platform Engineering Role?

AiResumeFit tailors your DevOps and Kubernetes resume to job descriptions in seconds.

Optimize My Resume →

How the Kubernetes Scheduler Chooses a Node: Filter, Score, and Schedule Explained

Why You Need to Understand the Scheduler

What the Scheduler Actually Does

The Scheduling Cycle: Filter, Score, Bind

Phase 1 — Filtering

NodeResourcesFit

NodeUnschedulable

TaintToleration

NodeAffinity

PodTopologySpread

VolumeBinding

PodAntiAffinity

Phase 2 — Scoring

LeastAllocated

BalancedResourceAllocation

NodeAffinity (scoring)

InterPodAffinity

ImageLocality

Taints and Tolerations: The Gatekeeper Model

Taint Effects

System-Applied Taints

NoExecute and Eviction

Production Pattern: Dedicated Node Pools

Node Affinity: Where Pods Want to Run

IgnoredDuringExecution: What It Means

Required Affinity: Hard Rules

Preferred Affinity: Soft Preferences

Pod Affinity and Anti-Affinity

Pod Affinity: Co-location

Pod Anti-Affinity: Spreading

Performance Warning at Scale

Topology Spread Constraints

Key Fields

TopologySpread vs Anti-Affinity

Priority Classes and Preemption

How Preemption Works

System-Reserved Priority Classes

Resource Requests and Scheduling

LimitRange for Sane Defaults

Debugging Pending Pods: A Systematic Approach

Reading the Scheduler's Rejection Message

Full Deployment Example

Custom Schedulers and Scheduler Extenders

Common Mistakes

Interview Questions and Answers

Beginner

Q: What is the difference between a taint and a toleration?

Q: Why is my pod in Pending state?

Q: What does kubectl cordon do?

Q: What is the difference between resource requests and limits?

Q: What is a PriorityClass?

Intermediate

Q: What is the difference between required and preferred node affinity?

Q: Why should you avoid required pod anti-affinity for small replica counts?

Q: How does the scheduler handle a full cluster?

Q: What is topologySpreadConstraints and how does it differ from pod anti-affinity?

Q: How does the scheduler use resource requests vs actual usage?

Advanced

Q: Explain the scheduling framework extension points and when you'd implement each.

Q: How would you optimize scheduling performance for a 5,000-node cluster?

Q: What happens to DaemonSet pods during node initialization and why?

Q: How does preemption interact with PodDisruptionBudgets?

Q: Describe a scenario where ImageLocality scoring causes unexpected pod placement behavior.

Best Practices

FAQ

Can a pod with a toleration be forced onto a tainted node?

What happens if I add a new taint to a node that has running pods without that toleration?

Can I run multiple schedulers in one cluster?

Why does the scheduler sometimes place multiple pods on the same node despite anti-affinity?

How does the cluster autoscaler interact with the scheduler?

What is the difference between nodeSelector and node affinity?

Can I see the score each node received during scheduling?

What happens when a PVC is in Pending state?

Does the scheduler consider actual CPU and memory utilization when placing pods?

What is maxSkew and what value should I use?

How do I drain a node without disrupting production workloads?

What is a gang scheduling and when do you need it?

Can I schedule a pod to a specific node directly?

Why are DaemonSet pods not managed by kube-scheduler?

How do I configure the scheduler for a high-throughput cluster?

Q: What does `kubectl cordon` do?

Q: What is the difference between `required` and `preferred` node affinity?

Q: What is `topologySpreadConstraints` and how does it differ from pod anti-affinity?

What is the difference between `nodeSelector` and node affinity?

What is `maxSkew` and what value should I use?