— Production cluster, 2:17 AM —

PagerDuty fires. Half your pods are Pending. Users are hitting 502s.

You run kubectl describe pod. It says:

0/3 nodes are available: 3 Insufficient cpu.

You check the nodes. They have CPU. kubectl top nodes shows 40% usage. Not full.

You delete the pending pods. New ones appear. Also Pending.

You scale the deployment down, then up. Still Pending.

You restart the Scheduler. Wait — where is the Scheduler? Is it a pod? A service? Which node?

You check kube-system. You see kube-scheduler-control-plane. You describe it. It looks fine.

You check the Controller Manager. You check etcd. Everything looks fine. The cluster looks fine. Half your pods won't start.

Your team lead wakes up 38 minutes later. They type four words:

kubectl describe node worker-1

Scroll to Allocated resources. CPU requests: 1900m / 2000m allocatable. 95% reserved. Not 40% used — 95% reserved.

The fix: one line in the Deployment. Reduce the CPU request from 500m to 200m.

Pods schedule immediately. 2:58 AM. Incident closed.

38 minutes lost because one word — requests vs. usage — was missing from your model. You are about to have the complete model.

The Scheduler does not check how much CPU a node is actually using. It checks how much CPU is requested — the number written in the pod spec. That one distinction is invisible until a production incident teaches it to you at 2 AM. It should not be.

This article covers every component, every connection, every flow. Pod creation, deletion, rolling updates, crash recovery. All of it — so the next time something breaks, you know exactly which component to look at.

The Interview Question

You are in a senior DevOps or platform engineering interview. The interviewer puts down their notes and says:

“Walk me through the Kubernetes architecture — every component, what it does, and what happens from the moment I run kubectl apply to the moment my pod is running.”

Most people answer with a list: API Server, etcd, Scheduler, Controller Manager, kubelet, kube-proxy. Nodding. The interviewer keeps nodding back. Then they ask the follow-up:

Interviewer keeps going:

❓ “Which components actually talk to etcd directly?”

❓ “What exactly does the Scheduler write when it picks a node?”

❓ “You run kubectl delete pod on a pod in a ReplicaSet. How many API calls happen and who makes them?”

❓ “Your node loses power at 3 AM. How does Kubernetes know? How long does it take? What does it do?”

❓ “What is the difference between what the Controller Manager does and what the Scheduler does?”

The list is not the answer. The connections are the answer. Let's build both.

The Mental Model — Read This Before Anything Else

Before the technical depth, one analogy. Think of how a large restaurant chain operates.

Restaurant Chain World	Kubernetes World
The front counter — where all orders come in	API Server — the only entry point; nothing bypasses it
The order database — every order ever placed	etcd — stores the complete desired and actual state of the cluster
The regional manager who assigns orders to kitchens	Scheduler — decides which node a pod runs on
The quality manager who enforces "always have 3 burgers ready"	Controller Manager — ensures desired state matches actual state
Kitchen manager at each location	kubelet — manages pods on each specific node
The delivery routing system at each location	kube-proxy — routes network traffic to the right pod on that node
The actual cook who makes the food	Container Runtime (containerd) — actually runs the container
The customer placing the order	kubectl — the CLI client you use to talk to Kubernetes
A specific order / meal	Pod — the smallest deployable unit
"Keep 3 of that burger available at all times"	ReplicaSet — maintains desired replica count
"Roll out the new recipe gradually"	Deployment — manages rolling updates and rollbacks

Keep this table in your head. Every section below maps directly back to it.

The Full Architecture Diagram

Two node types. Seven core components. Every connection flows through one hub. Here is each side of that picture, separately, so nothing gets cramped.

Kubernetes Architecture — Full Cluster View

Two things to notice immediately:

The Control Plane manages the cluster — it does not run your application pods.
Every Worker Node runs the same three components: kubelet, kube-proxy, and a container runtime.

🧠 Memory Trick

Master node = the brain. Worker node = the hands. The brain decides what to do. The hands do it. But both need the same nervous system (kubelet) to function.

Why Kubernetes Uses Pods (Not Containers Directly)

Before going deep on components, there is a question that trips up even experienced engineers: why does Kubernetes schedule a Pod instead of just scheduling a container? You can run one container per pod. So what is the point of the wrapper?

The wrong answer: “it's just how Kubernetes works.” The right answer has five distinct reasons, and each one explains a pattern you will see constantly in production.


  Why not just run containers directly?

  ┌─────────────────────────────────────────────────────────────────────┐
  │                              POD                                     │
  │                                                                      │
  │  Shared network namespace ─── all containers share the same IP       │
  │  Shared storage volumes   ─── all containers can access the same disk│
  │                                                                      │
  │  ┌──────────────────┐   ┌──────────────────┐   ┌────────────────┐  │
  │  │  Main Container  │   │  Sidecar (logs)  │   │ Init Container │  │
  │  │  (your app)      │   │  (Fluentd)       │   │ (runs first,   │  │
  │  │                  │   │                  │   │  then exits)   │  │
  │  │  localhost:8080  │◄──│  reads same logs │   │                │  │
  │  └──────────────────┘   └──────────────────┘   └────────────────┘  │
  │                                                                      │
  │  All three share: same IP, same /proc, same volumes, same lifecycle  │
  │  The Scheduler places ALL of them on the same node — atomically      │
  └─────────────────────────────────────────────────────────────────────┘

  If you scheduled containers independently:
  → App container on node-1, log sidecar on node-3. localhost doesn't work.
  → Init container finishes on node-2. Main container starts on node-1. Volume gone.
  → Scheduler picks nodes one at a time. Partial placement. Broken state.

  Pod = the atomic unit. Schedule together. Share everything. Fail together.

Pod Feature	What It Enables	Real Example
Shared network namespace	All containers in the pod share the same IP and port space	App on :8080 and log shipper on :24224 both reachable via localhost
Shared storage volumes	Containers can read/write the same filesystem path	App writes logs to /var/log — Fluentd sidecar reads and ships them
Sidecar pattern	Augment any app with a second container without modifying the app	Envoy proxy, secrets injector, metrics exporter — none touch app code
Init containers	Run setup tasks before the main container starts	Wait for DB to be ready, seed config files, set file permissions
Atomic scheduling	All containers in a pod land on the same node together	No split-brain: app and its sidecar are always co-located
Shared lifecycle	All containers start and stop together as one unit	Evict the pod — all containers go. Reschedule — all come back together

🧠 Memory Trick

Think of a Pod as an apartment, not a room. A room holds one person. An apartment holds a household — people who share a kitchen, a door, an address. The Scheduler books apartments, not individual rooms. Every tenant in the apartment gets the same address (IP) and shares the same space (volumes). You could book rooms separately, but then your roommates end up in different buildings and suddenly nobody shares a kitchen anymore.

🚨 Interview Trap

“A Pod is just a container.” A single-container Pod looks like a container but is not one. The Pod is the scheduling and networking unit. The container is the runtime unit. Kubernetes never schedules containers — it schedules Pods. The container runtime (containerd) never knows about Pods — it only knows about containers. They are different abstractions at different layers. Conflating them is the first mental model mistake that causes every other confusion about how Kubernetes works.

The Control Plane (Master Node)

The control plane is the decision-making layer. It does not run your application containers. It runs the components that decide where, when, and how your application containers run.


  Control Plane — Internal Wiring

  kubectl apply -f deployment.yaml
           │
           ▼
  ┌─────────────────────────────────────────────────────┐
  │                   kube-apiserver                     │
  │  • Authenticates your request (who are you?)         │
  │  • Authorizes it (are you allowed?)                  │
  │  • Validates the YAML (is it correct?)               │
  │  • Writes the desired state to etcd                  │
  │  • The ONLY component that talks to etcd             │
  └───────────┬──────────────────────────┬──────────────┘
              │                          │
              ▼                          ▼
  ┌───────────────────┐      ┌─────────────────────────────┐
  │       etcd        │      │   kube-controller-manager    │
  │                   │      │                              │
  │ Stores desired    │      │ Watches API Server for       │
  │ and actual state  │      │ unmet desired states.        │
  │ for every object  │      │ "3 replicas wanted, 2 exist" │
  │ in the cluster.   │      │ → creates a new pod record.  │
  │                   │      │                              │
  │ Lost etcd = lost  │      │ Contains 30+ controllers:    │
  │ cluster. Back it  │      │ ReplicaSet, Node, Job,       │
  │ up. Always.       │      │ Endpoint, Namespace...       │
  └───────────────────┘      └─────────────────┬────────────┘
                                               │
                                               ▼
                             ┌─────────────────────────────┐
                             │      kube-scheduler          │
                             │                              │
                             │ Watches for pods with no     │
                             │ assigned node.               │
                             │ Scores every eligible node.  │
                             │ Picks the best one.          │
                             │ Writes the node name to      │
                             │ the pod spec in etcd.        │
                             │ That's it. Does not start    │
                             │ the pod. Just assigns.       │
                             └──────────────────────────────┘

kube-apiserver — The Front Door

The API Server is the only entry point into Kubernetes. Every single operation — creating a pod, deleting a deployment, listing nodes, watching for changes — goes through it. Nothing in Kubernetes talks directly to anything else.

When you run kubectl apply, you are sending an HTTP request to the API Server. When the Scheduler picks a node for a pod, it writes that decision through the API Server. When kubelet reports a pod status, it sends that through the API Server. It is not a component — it is the backbone.

What it does in sequence for every incoming request:

Authentication — who are you? (certificates, tokens, OIDC)
Authorization — are you allowed to do this? (RBAC)
Admission control — does the request pass policy checks? (webhooks, limits)
Validation — is the YAML schema correct?
Write to etcd — store the desired state
Notify watchers — tell controllers, schedulers, kubelets something changed

🚨 Interview Trap

“The API Server processes requests and executes them.” Wrong. The API Server stores desired state and notifies watchers. It does not execute anything. Execution happens in kubelet via the container runtime. The API Server is a database with a REST interface and a notification system. It does not start containers.

etcd — The Database

etcd is a distributed key-value store. It is the only persistent storage in Kubernetes. Everything the cluster knows — every pod, every service, every secret, every config map, every node, every replica count — lives in etcd.

If etcd is gone, the cluster cannot function. Existing pods keep running (they are managed by kubelet which does not need etcd directly) but you cannot create, delete, or modify anything. The cluster is read-only until etcd comes back.

Critical rule: only the API Server talks to etcd. No other component reads from or writes to etcd directly. The Controller Manager, Scheduler, and kubelet all go through the API Server to read and write state.

control-plane node

# See everything stored in etcd (on a control-plane node)
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  get /registry --prefix --keys-only | head -30

🔥 Production Reality

etcd is the most critical component in your cluster and also the most under-backed-up. If you lose etcd data without a backup, you lose the entire cluster configuration — every Deployment, Secret, ConfigMap, RBAC rule, and CRD. Back up etcd. Automated. Daily minimum. Test the restore. Most teams never test the restore until they need it.

kube-scheduler — The Dispatcher

The Scheduler has one job: watch for pods that have no node assigned, and assign them a node.

It does this in two phases for each pod:

Filtering — eliminate nodes that cannot run this pod. Reasons: not enough CPU, not enough memory, pod has a nodeSelector that does not match, node has a taint the pod does not tolerate, node affinity rules exclude it.
Scoring — rank the remaining nodes. Prefers nodes with more free resources, nodes where related pods already live (affinity), nodes in different zones (anti-affinity), etc.

The Scheduler picks the highest-scoring node and writes nodeName: worker-node-2to the pod spec through the API Server. That is its entire contribution. It does not start the pod. It does not tell kubelet anything directly. It writes a field to the pod object, and kubelet notices because kubelet is watching for pods assigned to its node.

🧠 Memory Trick

The Scheduler is the regional manager who looks at all the kitchens, checks which ones have capacity, and marks the order “send to Kitchen 3.” It does not cook. It does not call Kitchen 3. It marks the ticket, puts it back in the system, and Kitchen 3's manager notices and picks it up.

⚡ Pro Tip

The Pending pod problem from the opening story? The Scheduler filtered out all nodes in step 1 because none passed the CPU check. When filtering removes all candidates, the pod stays Pending. The error message tells you exactly why: 0/3 nodes available: 3 Insufficient cpu. Check resource requests on the pod and allocatable resources on the nodes.

Scheduler Decision Process — Filter → Score → Assign

kube-controller-manager — The Watchdog

The Controller Manager is a single process that runs many controllers. Each controller watches the API Server for a specific type of gap between desired state and actual state — and acts to close that gap.

Controller	What It Watches	What It Does
ReplicaSet Controller	ReplicaSets	Creates/deletes pods to match desired replica count
Deployment Controller	Deployments	Creates/updates ReplicaSets for rolling updates
Node Controller	Node heartbeats	Marks nodes NotReady, evicts pods after timeout
Endpoint Controller	Pods & Services	Updates Endpoints list when pods start/stop
Namespace Controller	Namespaces	Cleans up resources when namespace is deleted
Job Controller	Jobs	Creates pods for Jobs, tracks completions
ServiceAccount Controller	Namespaces	Creates default ServiceAccount in new namespaces

Each controller runs a reconciliation loop: observe current state, compare to desired state, act to close the gap. They never store state — they watch the API Server and write back to it.

🚨 Interview Trap

“The Controller Manager creates pods.” Technically it creates pod objects in etcd (through the API Server). But the pod does not run until kubelet on the assigned node picks it up and tells the container runtime to start it. Creating a pod object and running a container are two completely different steps performed by two completely different components.

The Worker Node

Worker nodes run your application pods. Each worker node runs three components: kubelet, kube-proxy, and a container runtime. That is it. Everything else on the worker node is your application.


  Worker Node — Internal Wiring

  ┌──────────────────────────────────────────────────────────────┐
  │                        Worker Node                           │
  │                                                              │
  │  ┌─────────────────────────────────────────────────────┐    │
  │  │                      kubelet                         │    │
  │  │  • Watches API Server for pods assigned to this node │    │
  │  │  • Tells container runtime to pull image & run       │    │
  │  │  • Reports pod status back to API Server             │    │
  │  │  • Restarts crashed containers (liveness probe)      │    │
  │  │  • Mounts volumes, injects env vars, secrets         │    │
  │  └──────────────────────────┬──────────────────────────┘    │
  │                             │ CRI (Container Runtime Interface)
  │                             ▼                                │
  │  ┌─────────────────────────────────────────────────────┐    │
  │  │            Container Runtime (containerd)            │    │
  │  │  • Pulls the image from registry                     │    │
  │  │  • Creates the container                             │    │
  │  │  • Manages container lifecycle                       │    │
  │  └──────────────────────────┬──────────────────────────┘    │
  │                             │                                │
  │  ┌──────────┐  ┌──────────┐ │  ┌──────────┐                 │
  │  │  Pod A   │  │  Pod B   │ │  │  Pod C   │                 │
  │  │ container│  │ container│ │  │ container│                 │
  │  └──────────┘  └──────────┘ │  └──────────┘                 │
  │                             │                                │
  │  ┌─────────────────────────────────────────────────────┐    │
  │  │                    kube-proxy                        │    │
  │  │  • Maintains iptables/ipvs rules on this node        │    │
  │  │  • Routes traffic to the correct pod IP              │    │
  │  │  • Makes Services work (ClusterIP, NodePort, LB)     │    │
  │  └─────────────────────────────────────────────────────┘    │
  └──────────────────────────────────────────────────────────────┘

kubelet — The Node Manager

kubelet is an agent that runs on every node — worker nodes and the control-plane node. It has one job: make sure the containers that should be running on its node are running.

kubelet watches the API Server for pods assigned to its node. When a new one appears, kubelet:

Pulls the pod spec from the API Server
Calls the container runtime via CRI (Container Runtime Interface) to pull the image
Calls the container runtime to create and start the container
Sets up volumes, mounts secrets, injects environment variables
Runs liveness and readiness probes
Reports pod status back to the API Server continuously

kubelet is the component that makes pods real. Everything above it — the API Server, Scheduler, Controller Manager — deals in YAML objects and desired state. kubelet is where desired state becomes an actual running process.

😅 Senior Engineer Confession

Most engineers think of kubelet as infrastructure plumbing and never look at its logs. Then something goes wrong with a pod — it gets stuck in ContainerCreating, or keeps restarting, or the image fails to pull — and they spend an hour in kubectl describe when the answer is always in journalctl -u kubelet -f on the node. kubelet logs are the most useful and most ignored logs in any Kubernetes cluster.

kube-proxy — The Traffic Cop

kube-proxy runs on every node. It watches the API Server for Service and Endpoint objects and maintains network rules (iptables or IPVS) that make Services work.

When you create a Service with type ClusterIP, kube-proxy sets up iptables rules on every node so that traffic sent to the Service's ClusterIP gets routed to one of the pod IPs behind it. When a pod behind that Service gets deleted or added, the Endpoint object updates and kube-proxy updates the iptables rules on every node.

kube-proxy is why Services work as a stable IP address even as pods come and go behind it. The Service IP is virtual — no process listens on it. The iptables rules kube-proxy maintains intercept packets and redirect them to real pod IPs.

⚡ Pro Tip

In high-traffic clusters, iptables-mode kube-proxy becomes a bottleneck — iptables rule lookup is O(n) in the number of rules. Switch to IPVS mode for better performance. IPVS uses a hash table (O(1) lookup) and supports more load-balancing algorithms. Set mode: ipvs in the kube-proxy ConfigMap.

Container Runtime — The Cook

The container runtime is the component that actually runs containers. kubelet does not run containers directly — it calls the container runtime via the CRI (Container Runtime Interface) and tells it what to do.

Modern Kubernetes uses containerd (the most common) or CRI-O. Docker was removed as a supported runtime in Kubernetes 1.24 — though containerd, which Docker itself uses internally, is still the default.

What kubelet asks the runtime	What the runtime does
Pull image nginx:1.25	Downloads layers from registry, stores locally
Create a container from this image	Creates container filesystem, namespaces, cgroups
Start the container	Runs the container process
Stop the container	Sends SIGTERM, then SIGKILL after grace period
Remove the container	Deletes container and its filesystem
Get container status	Returns running/stopped/exited and exit code

Who Talks to Whom — Every Connection

This is the most important thing to understand about Kubernetes architecture. There is one rule that governs all communication:

Every component talks to the API Server. No component talks directly to any other component.


  Who Talks to Whom — Every Connection

  kubectl ──────────────────────────────► API Server
                                              │
                         ┌──────────────────── ┤ reads/writes
                         ▼                    │
                        etcd ◄────────────────┘

  Controller Manager ───────────────────► API Server  (watches for desired state)
  Scheduler ────────────────────────────► API Server  (watches for unscheduled pods)
  kubelet (each worker) ────────────────► API Server  (watches for pod assignments)
  kube-proxy (each worker) ─────────────► API Server  (watches for Service changes)

  Rule: EVERY component talks to the API Server.
        NO component talks directly to etcd (except API Server).
        NO component talks directly to another component.
        The API Server is the only hub. Everything else is a spoke.

The API Server is not one component in a web of connections — it is the only hub. All other components are spokes. They watch the API Server for changes relevant to them and write their results back through the API Server.

This design is intentional. It means:

Any component can restart independently without breaking others (they re-watch on reconnect)
All state is in one place (etcd, via the API Server) — not scattered across components
Components are loosely coupled — they react to state changes, not direct calls
You can watch the API Server yourself and see exactly what every component sees

🧠 Memory Trick

The API Server is like the restaurant's order board. Every role — cook, manager, driver — looks at the same board. Nobody calls each other directly. The cook does not call the driver. The driver does not call the quality manager. Everyone watches the board and reacts. The board is the API Server.

Desired State, Current State, and the Reconciliation Loop

Everything in Kubernetes flows from one idea: you declare what you want, and the system continuously works to make reality match that declaration. This is not a one-time operation. It is a loop that never stops.

Desired state is what you wrote in your YAML and what lives in etcd. Three replicas. nginx:1.25. Port 80. That is the truth you declared.

Current state is what is actually running — what kubelet is reporting back, what containerd is managing, what the Node Controller is observing. Two replicas running. One node down. One pod crashed.

The gap between desired and current is what every controller exists to close.


  The Reconciliation Loop — How Every Controller Works

  ┌─────────────────────────────────────────────────────────────────┐
  │                                                                  │
  │   ┌──────────────┐                                              │
  │   │   OBSERVE    │  Watch the API Server for the object type    │
  │   │              │  this controller cares about                 │
  │   └──────┬───────┘                                              │
  │          │                                                       │
  │          ▼                                                       │
  │   ┌──────────────┐                                              │
  │   │   COMPARE    │  Desired state (what etcd says should exist) │
  │   │              │  vs.                                         │
  │   │              │  Actual state (what is really running)       │
  │   └──────┬───────┘                                              │
  │          │                                                       │
  │          ▼                                                       │
  │   ┌──────────────┐                                              │
  │   │     ACT      │  If gap exists → take the smallest action    │
  │   │              │  to close it. Create a pod. Delete a pod.    │
  │   │              │  Update an endpoint. Evict a node.           │
  │   └──────┬───────┘                                              │
  │          │                                                       │
  │          └──────────────────────────────────► loop forever      │
  │                                                                  │
  └─────────────────────────────────────────────────────────────────┘

  Every controller runs this loop. Always. The cluster never "finishes."
  It just converges closer to desired state with every iteration.


  Desired State vs Current State — The Gap Kubernetes Always Closes

  Desired State (stored in etcd)       Current State (what is running)
  ─────────────────────────────        ─────────────────────────────
  replicas: 3                          replicas: 2
  image: nginx:1.25                    image: nginx:1.25
  node: any                            node: worker-1, worker-2

  Gap: want 3, have 2
       ↓
  ReplicaSet Controller detects the gap
       ↓
  Creates 1 new pod object (desired state updated)
       ↓
  Scheduler assigns it to worker-3
       ↓
  kubelet starts it
       ↓
  Current state = 3 running. Gap = 0. Loop continues watching.

  This gap-detection-and-close pattern runs for every object in the cluster.
  Not just pods. Nodes, endpoints, jobs, namespaces — all of it.

This is why Kubernetes is called eventually consistent. The cluster does not instantly jump from current state to desired state. It converges — one reconciliation loop at a time, one controller action at a time. After a node failure, it might take 5–6 minutes before the cluster fully converges. During that window, current state does not match desired state. That is not a bug. That is the system working exactly as designed.

🧠 Memory Trick

Think of a thermostat, not a light switch. A light switch goes from off to on instantly. A thermostat sets a desired temperature and continuously fires the heater until actual temperature reaches it — then keeps watching in case it drops again. Kubernetes is a thermostat. Every controller is a thermostat for its object type. The cluster never finishes. It just gets closer.

🚨 Interview Trap

“Kubernetes is real-time.” No — it is eventually consistent. When you run kubectl apply, the API Server confirms it stored your desired state. The reconciliation loop then runs asynchronously to close the gap. There is always a window — sometimes milliseconds, sometimes minutes — where desired and current state do not match. Designing systems that assume instant consistency leads to race conditions that only appear under load or after node failures.

Stateless Components and Leader Election

Every control-plane component — API Server, Scheduler, Controller Manager — is stateless. This is not accidental. It is the design choice that makes high availability possible.

Why stateless?

Stateless means the component stores nothing locally. All state lives in etcd, accessed through the API Server. If the Scheduler crashes, the replacement Scheduler starts up, connects to the API Server, reads the current state of all pods, and picks up exactly where the previous one left off — because the state was never in the Scheduler to begin with. It was always in etcd.

This is also why restarting any control-plane component during an incident is safe. You are not losing state. You are restarting a stateless process that will re-derive everything it needs from the API Server in seconds.

Leader election — why 3 Schedulers don't fight each other

In a highly available cluster you run 3 master nodes, each with its own Scheduler and Controller Manager instance. But only one of each should be making decisions at any time — two Schedulers racing to assign the same pod to different nodes would cause chaos.

Kubernetes solves this with leader election: each component instance competes to hold a lease — a lock object stored in etcd. The instance that holds the lease is the leader. The others are standbys that watch the lease and do nothing.


  Leader Election — Why 3 Schedulers Don't Conflict

  HA Control Plane: 3 master nodes, each running Scheduler + Controller Manager

  master-1  ─── Scheduler instance A  ◄── LEADER (holds the lease)
  master-2  ─── Scheduler instance B      standby (watching the lease)
  master-3  ─── Scheduler instance C      standby (watching the lease)

  Only the LEADER does work. Standbys watch a lease object in etcd.

  If leader stops renewing the lease:
       ↓
  Standby B or C wins the next election
       ↓
  New leader starts doing work immediately
       ↓
  No duplicate scheduling. No conflicts. No human involved.

  Lease object lives at:
  /registry/leases/kube-system/kube-scheduler
  /registry/leases/kube-system/kube-controller-manager

  Check who holds the lease:
  kubectl get lease -n kube-system

kubectl — inspect leader election

# See which instance currently holds the Scheduler lease
kubectl get lease kube-scheduler -n kube-system -o yaml

# Key fields:
# holderIdentity: master-1_abc123   ← current leader
# renewTime: 2026-06-23T02:17:00Z   ← updated every few seconds
# leaseDurationSeconds: 15          ← if not renewed in 15s, election fires

⚡ Pro Tip

If a control-plane component seems stuck — not reacting to changes — check whether it still holds the lease. A component can be running but have lost the election (network partition, clock skew, resource starvation). The process is alive. The lock is gone. It is a standby pretending to be a leader. kubectl get lease -n kube-system tells you immediately.

How a Pod Gets Created — The Complete Flow

This is the question most interviews ask. Here is the full journey — first as a narrative flow, then as a precise sequence table showing exactly which component does what and what it writes.

Pod Creation — Full Journey (kubectl apply to Pod Running)

The key insight: eight steps, six components, all coordinated through the API Server, none of them calling each other directly.Here is the same flow as a sequence table — the format that makes the handoffs obvious:


  Pod Creation — Who Does What and When

  Step │ Component             │ Action                                  │ Writes to
  ─────┼───────────────────────┼─────────────────────────────────────────┼───────────────────────
   1   │ kubectl               │ POST /apis/apps/v1/deployments           │ → API Server
   2   │ API Server            │ Auth → validate → store                  │ Deployment → etcd
   3   │ Deployment Controller │ Sees new Deployment (want RS, have 0)   │ ReplicaSet → etcd
   4   │ ReplicaSet Controller │ Sees new RS (want 3 pods, have 0)       │ 3 Pod objects → etcd
       │                       │ Pods are now Pending — no node assigned  │
   5   │ Scheduler             │ Sees Pending pods, scores nodes, picks   │ nodeName → etcd
   6   │ kubelet               │ Sees pod assigned to its node            │ calls containerd (CRI)
   7   │ containerd            │ Pulls image, creates + starts container  │ container running
   8   │ kubelet               │ Reports pod status = Running             │ pod status → etcd

  Total time from kubectl apply to Running: typically 5–30 seconds.
  Every step goes through the API Server. Nothing skips the queue.

At each step, a component watches for a specific type of state change, acts to close a gap, and writes the result back. The next component picks up from where the previous one left off — without the previous one ever knowing the next one exists.

You

kubectl apply

▼

API Server

Deployment object created in etcd

▼

Deployment Controller

ReplicaSet object created

▼

ReplicaSet Controller

3 Pod objects created (Pending, no node)

▼

Scheduler

nodeName written to each pod spec

▼

kubelet

Container runtime pulls image

▼

Container Runtime

Container starts — Pod is Running

⚡ Pro Tip

You can watch this flow happen in real time. Run kubectl get events -n default --watchin one terminal, then kubectl apply -f your-deployment.yaml in another. You will see every step — SuccessfulCreate, Scheduled, Pulling, Pulled, Created, Started — as it happens. That is the API Server event stream. Every component writes to it.

The Biggest Myth in Kubernetes: Who Actually Creates the Pod?

Ask ten Kubernetes engineers who creates a pod. At least six will say the Scheduler. This is the single most common wrong answer in Kubernetes interviews — and it reveals a gap in the mental model that breaks everything downstream.

The Scheduler never creates a pod. It never touches a container. It never calls kubelet. It writes one field to one object and stops.


  The Biggest Myth: "The Scheduler Creates Pods"

  ✗ What engineers think:
  ─────────────────────────────────────────────────────────
  kubectl apply → API Server → Scheduler → [Scheduler creates the pod]

  ✓ What actually happens:
  ─────────────────────────────────────────────────────────
  kubectl apply → API Server → etcd (pod object stored, Pending)
                                    │
                               Scheduler watches
                                    │
                               Scheduler PICKS a node
                               Scheduler writes nodeName to the pod spec
                               Scheduler is DONE. It never touches a container.
                                    │
                               kubelet on that node watches
                                    │
                               kubelet reads the pod spec
                               kubelet calls containerd via CRI
                                    │
                               containerd pulls the image
                               containerd creates the container
                               containerd starts the container
                                    │
                               Pod is Running

  ─────────────────────────────────────────────────────────
  Scheduler  →  decides WHICH node
  kubelet    →  creates the pod's containers ON that node
  containerd →  runs the containers

  Three different components. Three different responsibilities.
  None of them creates what the others create.

Scheduler

Decides WHICH node the pod lands on

▼

kubelet

Creates the Pod's containers ON that node

▼

containerd

Runs the containers (pulls image, starts process)

Three different components. Three different responsibilities. Zero overlap. The Scheduler does not know containerd exists. containerd does not know the Scheduler exists. kubelet is the only one that talks to both — watching the API Server for assignments, calling containerd to execute them.

😅 Senior Engineer Confession

I have seen engineers restart the Scheduler because pods were stuck in ContainerCreating. ContainerCreating means the Scheduler already did its job — the pod has a node assigned. The problem is in kubelet or containerd on that node, not the Scheduler. Restarting the Scheduler did nothing. It was already done. Knowing which component owns which phase of pod creation eliminates an entire category of wrong debugging moves.

How a Pod Gets Deleted

Deletion in Kubernetes is not instant. It is a graceful protocol with a configurable grace period. Here is why — and what happens.

Pod Deletion — Graceful Termination Sequence

The grace period exists so your application can finish in-flight requests before the container is killed. The default is 30 seconds. Your application should handle SIGTERM and shut down cleanly within that window.

The Endpoint Controller removes the pod from Service routing before the pod is killed. This is deliberate — new requests stop arriving before the shutdown signal is sent. This is how Kubernetes achieves zero-downtime deletions.

🚨 Interview Trap

“If I delete a pod, it's gone immediately.” No. The pod enters Terminating state. Traffic routing is removed first. Then SIGTERM is sent. Then there is a grace period. Then SIGKILL if needed. The container runtime does the actual kill. The pod object in etcd is only removed after kubelet confirms the container is gone. From your command to gone: up to 30+ seconds. Plan accordingly.

How Updates Work — Rolling Deployments

When you update a Deployment, Kubernetes does not delete all pods and recreate them. That would cause downtime. Instead it does something smarter — and the mechanism is also how rollbacks work, which most people think is magic until they see how obvious it is.

Rolling Update — Old ReplicaSet Scales Down, New Scales Up

The key that most explanations skip: the old ReplicaSet is kept at 0 replicas, not deleted.It is sitting there with the old pod spec intact, scaled to zero, waiting. That is not an accident or a bug — it is the rollback mechanism. kubectl rollout undo just reverses the scale: old ReplicaSet goes to 3, new goes to 0. No pods are rebuilt. No images are re-pulled. The Deployment Controller swaps two numbers. That is the entire rollback operation.

🚨 Interview Trap

“kubectl rollout undo redeploys the old version from scratch.” No. It scales the old ReplicaSet back up and the new one back down — the old pod spec was never deleted. This is also why kubectl rollout history shows previous revisions: the old ReplicaSets are still there. --to-revision=3 just says which one to scale back up. The whole thing is a scale operation, not a redeploy.

kubectl

# Watch a rolling update in real time
kubectl rollout status deployment/my-app

# Check rollout history — each revision is a ReplicaSet still on disk
kubectl rollout history deployment/my-app

# Undo the last update — scales old RS up, new RS to 0
kubectl rollout undo deployment/my-app

# Roll back to a specific revision
kubectl rollout undo deployment/my-app --to-revision=2

🔥 Production Reality

A rolling update with no readiness probe is a ticking clock. The Deployment Controller marks a new pod Ready the moment the container starts — it has no other signal. If your app takes 15 seconds to warm up, those first 15 seconds of traffic hit a pod that is running but not ready. No errors in kubectl. Errors in production. The readiness probe is not optional configuration — it is the signal that tells the Deployment Controller when it is safe to kill an old pod. Without it, the rollout is guessing. It will guess wrong on every deployment, at random, under load.

How Crash Recovery Works

Kubernetes is often called “self-healing.” That phrase covers three different mechanisms, depending on what crashes. Here is each one:

Node Failure — What Kubernetes Does and When

The CrashLoopBackOff state comes from Scenario 1: the container keeps crashing and kubelet keeps restarting it, but adds exponential backoff between restarts (10s, 20s, 40s, up to 5 minutes). This prevents a broken container from hammering the container runtime.

The most important thing to understand about Scenario 3 (node failure): Kubernetes waits approximately 40 seconds before marking the node NotReady, and then another 5 minutes before evicting pods. That means up to ~5.5 minutes of downtime if a node dies and you only have one replica.

debugging crash recovery

# Check liveness and readiness probe status
kubectl describe pod my-pod | grep -A 5 "Liveness|Readiness"

# Check restart count and last state
kubectl get pod my-pod -o wide

# Check node conditions
kubectl describe node worker-1 | grep -A 5 Conditions

# Watch kubelet logs for crash details
journalctl -u kubelet -f --since "5 minutes ago"

😅 Senior Engineer Confession

The most common production incident that looks like “Kubernetes is broken” but is not: a pod with no liveness probe that is running but not responding. Kubernetes reports it as Running and Healthy. Traffic routes to it. Users see errors. The fix is not Kubernetes — it is adding a liveness probe that the application actually responds to. I have seen this sink releases at companies of every size.

What Happens When kubelet Restarts

Most engineers assume that restarting kubelet restarts all the pods on that node. This is wrong — and the reason why is one of the most elegant design decisions in Kubernetes.

kubelet and the container runtime (containerd) are separate processes. containerd manages containers independently. kubelet uses containerd but does not own it. When kubelet stops, containerd keeps running. The containers keep running. The pods are still up. Users see nothing.


  What Happens When kubelet Restarts

  Before restart:
  ┌─────────────────────────────────────────┐
  │  kubelet (running)                       │
  │  containerd (running)                    │
  │  Pod A (running) ──── container-A-123    │
  │  Pod B (running) ──── container-B-456    │
  └─────────────────────────────────────────┘

  kubelet process exits (upgrade, crash, OOM)
  ┌─────────────────────────────────────────┐
  │  kubelet (STOPPED)                       │
  │  containerd (still running)              │
  │  Pod A ──── container-A-123  (STILL UP) │
  │  Pod B ──── container-B-456  (STILL UP) │
  └─────────────────────────────────────────┘
  containerd is independent. Containers keep running.
  No downtime for running pods.

  kubelet restarts
  ┌─────────────────────────────────────────┐
  │  kubelet (running again)                 │
  │  kubelet queries containerd for current  │
  │  running containers                      │
  │  kubelet re-syncs with API Server        │
  │  kubelet resumes probe checks            │
  └─────────────────────────────────────────┘
  Reconciliation takes ~seconds.
  Pods that crashed during the gap get restarted.
  Pods that were healthy continue uninterrupted.

When kubelet comes back, it does three things in order:

Queries containerd for the list of currently running containers on this node.
Re-syncs with the API Server to get the desired pod spec for each one.
Reconciles — any container that should be running but is not gets restarted; any that should not be running gets stopped.

The whole re-sync takes seconds. Pods that were healthy during the restart continue uninterrupted. Pods that crashed during the kubelet downtime get restarted immediately on re-sync because kubelet sees the gap.

safe kubelet restart procedure

# Restart kubelet safely (node stays in service, pods keep running)
sudo systemctl restart kubelet

# Watch kubelet come back
journalctl -u kubelet -f

# Verify pods on the node are still running
kubectl get pods -o wide | grep worker-1

🔥 Production Reality

kubelet restarts are the standard way to apply kubelet configuration changes — new flags, updated config file, kubeconfig rotation. Teams that avoid restarting kubelet because they fear downtime are carrying unnecessary operational risk. Restarts are safe. The separation between kubelet and containerd exists precisely so you can upgrade and restart the management layer without touching running workloads.

Two Production Disasters

Disaster 1: The Scheduler Was Right the Whole Time (e-commerce, Black Friday, 2:43 AM)

Black Friday. Traffic spikes. On-call engineer scales the payment service from 6 to 18 replicas. 12 new pods appear. All Pending. Error: 0/4 nodes available: 4 Insufficient memory.

First thing they check: kubectl top nodes. Worker-1: 38% memory. Worker-2: 41%. Worker-3: 44%. Worker-4: 39%. None of the nodes look full. They delete the Pending pods and reapply. Still Pending.

They check the Scheduler. kubectl logs kube-scheduler-control-plane -n kube-system. Clean. No errors. They try cordoning and uncordoning worker-4 to force a reschedule. Still Pending. They try patching the pod directly to remove the memory request. The API Server rejects it — you cannot patch a Pending pod's resource spec in place.

34 minutes in. Team lead joins the Slack thread and pastes one command:

the command that ended the incident

kubectl describe node worker-1 | grep -A 6 "Allocated resources"

# Output:
Allocated resources:
  Resource  Requests        Limits
  --------  --------        ------
  memory    7424Mi / 8Gi    0 (0%)

7.4Gi requested on an 8Gi node. kubectl top showed 38% used. 90% was requested. The Scheduler checks requests. Not usage. Those are different numbers. They had been different for months and nobody had noticed.

Fix: kubectl set resources deployment/payment --requests=memory=256Mi. All 12 pods scheduled in 4 seconds. Time lost: 34 minutes. The Scheduler was correct the entire time — it had been printing the exact reason why. They were reading the wrong metric.

Disaster 2: Someone Almost Made It Much Worse (fintech startup, Tuesday morning)

A team rolling out a new version of their API server. 8 replicas. maxSurge: 1, maxUnavailable: 0. Halfway through — 4 new pods Running, 4 old pods still up — one AWS availability zone goes soft. Three worker nodes start sending intermittent heartbeats. The Deployment Controller pauses the rollout. kubectl rollout status hangs.

kubectl get pods shows: 4 new pods Running, 3 old pods Running, 2 old pods Terminating (stuck), 1 new pod ContainerCreating (stuck). A senior engineer, unfamiliar with the rollout mechanics, decides the deployment is corrupted. They type:

the command nobody should have typed

kubectl delete replicaset api-server-7d9f4b8c6

That is the old ReplicaSet. The one keeping the 3 surviving old pods alive while the rollout was paused. Deleting it would have evaporated the only pods currently serving production traffic, leaving 4 new pods and a degraded AZ.

A junior engineer, 3 months in, stops them: “Wait — if we delete the old ReplicaSet, those 3 pods die immediately. We only have 4 new ones and the AZ is still flaky.”

Nobody types anything for 90 seconds.

The AZ stabilizes. The stuck pods terminate. The Deployment Controller un-pauses. The rollout completes. All 8 new pods running. Old ReplicaSet scales to 0 on its own. Total downtime: zero. Total commands run during the incident: zero. Total time from AZ hiccup to full recovery: 7 minutes.

The junior engineer understood why the old ReplicaSet was still there. The senior engineer was about to delete the safety net holding the service up. Architecture knowledge, 3 months in, saved a production incident.

🔥 Production Reality

Understanding Kubernetes architecture does not just help you pass interviews. It determines whether you spend 34 minutes reading the wrong metric, or 4 seconds running the right command. It determines whether you delete the thing keeping production alive, or correctly identify it as the safety net. Every minute of architectural confusion is a minute of incident duration.

The Wall of Shame

Seven mistakes. All extremely common. All made by engineers who could list the components but did not understand the connections.

1. Thinking the Scheduler starts pods

“The Scheduler picks a node and... that's it. It does not whistle to kubelet. It writes a sticky note on the pod that says 'you belong on node-3'. kubelet on node-3 sees the note and handles the rest. The Scheduler's job ends the moment it writes nodeName.”

What happens: Incorrect interview answers. Wrong mental model for debugging scheduling issues.

Fix: Scheduler assigns. kubelet runs. Two completely separate steps by two completely separate components.

2. Thinking you can talk directly to etcd to fix things

“Reaching into etcd to edit a value directly is like rewriting the restaurant's order database rows without telling the front counter, the manager, or the kitchen. Everything that was watching that data now has a stale view. Inconsistency spirals. The cluster can enter states nobody planned for.”

What happens: Cluster inconsistency, controllers looping, audit logs missing the change.

Fix: Always go through the API Server. Use kubectl or the Kubernetes API. Never edit etcd directly.

3. Deleting a pod to “restart” a Deployment

“Killing one burger to make the kitchen make a fresh one works — but you get the same recipe. Same image. Same broken config. Same bug. If the pod is broken because of what is in it, deleting it restarts the broken version. Only deleting the Deployment or updating the image gets you something different.”

What happens: Same broken pod comes back. Issue persists. Time wasted.

Fix: kubectl rollout restart deployment/my-app for a clean restart of all pods with a rolling update.

4. Interpreting resource usage numbers as what the Scheduler sees

“The Scheduler is a hiring manager who reads resumes — not a heart-rate monitor. It does not check how much CPU your pod is actually using. It reads what the pod says it needs (requests). A pod that uses 50m CPU but requested 500m blocks 500m for scheduling. Kubectl top is not what the Scheduler sees. Kubectl describe node is.”

What happens: Nodes look underloaded in kubectl top but pods stay Pending. The Black Friday disaster above.

Fix: Check allocatable vs. requested with kubectl describe node | grep -A 5 Allocated.

5. Running applications without a readiness probe

“Running no readiness probe is like hiring a chef whose employment contract says 'I'm ready to cook the moment I walk in the door.' Doesn't matter that they need 2 minutes to wash their hands, put on their apron, and find their station. Orders start arriving immediately. The first ten orders get cooked wrong. The restaurant gets bad reviews. You never made a readiness probe.”

What happens: Traffic hits pods before they are actually ready. Users see errors during deployments and restarts.

Fix: Add a readiness probe to every container. Use the healthz endpoint if the app has one.

6. Blaming kube-proxy when pod-to-pod networking breaks

“kube-proxy is the traffic cop at the Service intersection — ClusterIP, NodePort, LoadBalancer. It does not build the roads between pods. That is the CNI plugin's job. Debugging kube-proxy when two pods cannot reach each other by IP is like calling the toll booth operator because the road between two cities has a pothole. The toll booth does not know the road exists. Wrong department entirely.”

What happens: 40-minute debugging session in kube-proxy logs for a Calico misconfiguration. Wrong component. Wrong logs. Wrong fix.

Fix: Pod-to-pod networking broken → check CNI plugin (Calico/Flannel/Cilium). Service routing broken → check kube-proxy. These are different systems.

7. Trying to “trigger” the Scheduler or Controller Manager by restarting them

“The Scheduler is not a person who fell asleep at their desk. It is a loop — watching the API Server, reacting to what it sees, writing back. If your pods are Pending and you restart the Scheduler, you have rebooted a security camera to try to fix a traffic jam. The camera was never causing the jam. The jam is in the data — wrong resource requests, missing toleration, broken node. Fix the data. The Scheduler will react within seconds on its own.”

What happens: Brief Scheduler downtime added to the incident. Pods still Pending. Root cause untouched.

Fix: Scheduler and Controller Manager are reactive loops — they respond to state. Fix the state, not the component. Start with kubectl describe pod and read the Events.

Why Kubernetes Is Designed This Way

Every architectural decision in Kubernetes has a reason. Understanding the whyis what separates engineers who use Kubernetes from engineers who understand it. Here are the five design decisions that define everything else.

Why does everything go through the API Server?

Because the API Server is the only component that can enforce authentication, authorization, and admission control in one place. If components talked directly to each other, you would need to implement security at every point of contact — and you would miss some. The single hub means one place to audit, one place to enforce policy, one place to watch. It also means every component is loosely coupled: they can restart independently, miss messages temporarily, and catch up from the API Server's state on reconnection. No component can get out of sync because there is only one source of truth.

Why desired state instead of imperative commands?

Imperative systems are fragile: if the command fails halfway, the system is in an unknown partial state. Desired state is self-correcting: you declare what you want, and controllers continuously reconcile until reality matches the declaration. If a node dies mid-operation, nothing is lost — the desired state is still in etcd, and controllers pick up exactly where they left off when the node recovers. Desired state also makes auditing trivial: what is in etcd is what you asked for, and the diff between etcd and running reality is visible at any moment.

Why separate scheduling from execution?

The Scheduler has global visibility — it can see all nodes, all resource pressures, all affinity rules, all topology constraints. kubelet has local execution — it knows one node intimately: its volumes, its running containers, its probe results. Combining them would mean every node needs global cluster state to make local decisions, and the global optimizer needs per-node execution details it does not care about. Separation keeps each component focused: the Scheduler optimizes placement globally, kubelet executes locally. Neither leaks into the other's concern.

Why does kubelet run on every node — including the control-plane node?

kubelet is not a worker-node component — it is a node component. Every node, regardless of its role, needs something to manage the containers running on it and report their status back to the cluster. The control-plane node runs etcd, the API Server, the Scheduler, and the Controller Manager — all as containers, all managed by kubelet. If kubelet did not run on the control-plane node, nobody would be watching those containers. They would crash and not restart. The cluster would be unmanaged at its most critical layer.

Why Pods instead of containers as the scheduling unit?

Containers are isolated by design — separate namespaces, separate filesystems, separate network stacks. But many real workloads need tight coupling: an app and its log shipper need to share a filesystem. An app and its proxy need to share a network interface (localhost). An app and its secrets injector need to share an ephemeral volume. Scheduling containers independently makes tight coupling impossible — they could land on different nodes and suddenly localhost does not work. The Pod groups containers that must be co-located, co-scheduled, and co-managed into one atomic unit. The scheduler books the apartment. The containers are the tenants.

⚡ Pro Tip

These five “why” answers are what interviewers at senior and staff level are actually listening for. Listing components is the expected answer. Explaining why the architecture was designed this way — the tradeoffs, the constraints, the deliberate choices — is the answer that gets an offer.

Cloud-Managed vs Self-Managed Kubernetes

Every Kubernetes cluster has the same components. What changes between EKS, GKE, AKS, and a self-managed cluster is who is responsible for which components — and what you can control.

Concern	Self-Managed (kubeadm)	Cloud-Managed (EKS / GKE / AKS)
Control plane visibility	You SSH into master nodes, see all components, read all logs	Control plane is hidden — you get an API endpoint, not a node
etcd access	Direct access — you back it up, you restore it, you own it	Managed by the provider — backup/restore is their responsibility
Control plane upgrades	Manual — kubeadm upgrade, one version at a time, your risk	Provider-managed — click a button or run one command
Worker node upgrades	Manual — drain, upgrade OS/kubelet, uncordon	Managed node groups / node pools handle this automatically
API Server HA	You configure 3 master nodes, load balancer, etcd quorum	Provider runs multi-AZ HA automatically — invisible to you
Static Pods	Control plane runs as static pods in /etc/kubernetes/manifests/	No static pods — provider manages control plane differently
Custom admission webhooks	Full control — install any webhook	Full control — same Kubernetes API, same webhook support
Cost model	You pay for master node VMs (3× for HA)	Provider charges a cluster fee ($0.10/hr on EKS) + worker nodes

The architecture is the same. The API, the objects, the controllers, the kubelet — identical. The difference is operational ownership. On EKS you cannot SSH into the master node because it does not exist as a node you can access. You interact with the same API Server, but the machine it runs on belongs to AWS.

🧠 Memory Trick

Cloud-managed Kubernetes is like renting a fully-managed kitchen — the appliances are there, you cook, but you do not fix the oven when it breaks. Self-managed is owning the kitchen: full control, full responsibility. Same food either way. Different maintenance burden.

📖 Deep Dive: How the Control Plane Boots Itself

On self-managed clusters, the control plane components (etcd, kube-apiserver, kube-scheduler, kube-controller-manager) run as Static Pods — managed directly by kubelet from YAML files in /etc/kubernetes/manifests/. This is how Kubernetes solves the bootstrap problem: kubelet starts before the API Server exists, reads the manifest files, and brings the control plane up from disk. We have a full article on this: Kubernetes Static Pods Explained →

Best Practices

Set resource requests and limits on every container. Requests are what the Scheduler uses to place pods. Without them, pods can land on full nodes and get OOMKilled. Without limits, a runaway container can starve its neighbors.
Add readiness and liveness probes to every container. Readiness gates traffic. Liveness triggers restarts. Without them, broken pods look healthy to Kubernetes and traffic hits them anyway.
Back up etcd. Test the restore. Use etcdctl snapshot save on a schedule. A backup you have never restored is not a backup — it is a comfort object.
Run at least 2 replicas of every stateless service. One replica means a single node failure causes downtime for up to 5+ minutes while Kubernetes reschedules. Two replicas means the other pod handles traffic while the replacement starts.
Use kubectl rollout status to monitor deployments. Do not assume a deployment succeeded because kubectl apply returned. The API Server accepted the YAML. The rollout is still ongoing.
Check kubelet logs first when pods are stuck. journalctl -u kubelet -f on the node. This is where image pull failures, volume mount errors, and container runtime issues surface.
Use pod disruption budgets (PDB) for critical services. PDBs tell the Node Controller how many pods of a given type can be disrupted at once. Without one, node drains can take down too many replicas simultaneously.

FAQ

What is the difference between the Master Node and the Control Plane?

They often mean the same thing but are not identical. The Control Plane is the set of components that manage the cluster: API Server, etcd, Scheduler, Controller Manager. The Master Node is the physical or virtual machine that runs those components. In a highly available cluster, you typically have 3 Master Nodes running the Control Plane components (with etcd forming a quorum). In a single-node dev cluster like kind or minikube, the same node runs both Control Plane and your workloads.

What is the cloud-controller-manager?

A fourth control-plane component present in cloud-managed clusters. It handles cloud-specific logic: provisioning LoadBalancer Services (creating an AWS ELB, for example), managing node labels that reflect cloud instance types, and handling cloud-specific node lifecycle events. It sits between the Kubernetes control loop and the cloud provider API so that the core Kubernetes code stays cloud-agnostic. On bare-metal clusters, it is not present.

Why does kubectl sometimes return before my pod is actually running?

kubectl returns when the API Server accepts and stores your request — not when your pod is running. After kubectl apply returns, the Deployment Controller still needs to create a ReplicaSet, the ReplicaSet Controller needs to create pod objects, the Scheduler needs to assign nodes, and kubelet needs to pull the image and start the container. All of that happens asynchronously. Use kubectl rollout status deployment/my-app to wait for the actual rollout to complete.

Interview Corner

Questions You Should Be Able to Answer at Any Level

Q: What is the role of the API Server in Kubernetes?

The API Server is the central gateway for all cluster communication. It authenticates, authorizes, validates, and persists every operation. It is the only component that reads from and writes to etcd. All other components communicate with each other by watching and writing to the API Server — never directly to each other.

Q: What does the Scheduler actually write when it assigns a pod to a node?

It writes the nodeName field to the pod spec in etcd, via the API Server. That is the entire act of scheduling. It does not contact kubelet, it does not start the container — it writes one field. kubelet on that node watches for pods with its hostname in nodeName and takes it from there.

Q: Which components talk to etcd directly?

Only the API Server. Every other component — Controller Manager, Scheduler, kubelet, kube-proxy — reads and writes state through the API Server. etcd is not accessible to other components directly. This is an architectural guarantee.

Q: A pod is in Pending state. What does that mean and how do you debug it?

Pending means the pod has been accepted by the API Server and stored in etcd, but the Scheduler has not yet assigned it to a node. The most common causes: no nodes with sufficient resources (check resource requests vs. allocatable), no nodes matching the pod's nodeSelector or affinity rules, or all eligible nodes are tainted and the pod has no matching toleration. Debug with kubectl describe pod — the Events section at the bottom will show the Scheduler's reason for not placing the pod.

Q: What is the Controller Manager's relationship to the Scheduler? Do they interact?

They do not interact directly. The Controller Manager creates pod objects in etcd (via the API Server) to match desired replica counts. Those pod objects have no nodeName set — they are Pending. The Scheduler independently watches for Pending pods and assigns nodes. Both components watch the same API Server. Neither knows the other exists. The API Server is the only connection between them.

Q: How does Kubernetes know when a node has died?

kubelet on each node sends a heartbeat to the API Server every few seconds, updating the node's lastHeartbeatTime. The Node Controller (inside the Controller Manager) watches these. If heartbeats stop for ~40 seconds, it marks the node NotReady. After another ~5 minutes (pod-eviction-timeout), it evicts all pods from the node so they can be rescheduled elsewhere.

Q: What is the difference between kubelet and the container runtime?

kubelet is the node agent that decides what should run — it reads pod specs from the API Server and manages the pod lifecycle (probes, volumes, status reporting). The container runtime (containerd, CRI-O) is what actually runs containers — pulling images, creating namespaces, starting processes. kubelet calls the container runtime via the CRI interface. kubelet is the manager. The container runtime is the executor.

Q: What happens when you run kubectl delete deployment my-app?

kubectl sends a DELETE request to the API Server. The API Server marks the Deployment for deletion. The Deployment Controller sees this, scales down its ReplicaSet to 0. The ReplicaSet Controller deletes all pod objects. kubelet on each node sees pods with deletionTimestamp set, sends SIGTERM to each container, waits for the grace period, and reports back. The API Server removes the pod objects from etcd. The ReplicaSet and Deployment objects are cleaned up last.

🎤 The 60-Second Answer

🎤 Say This Out Loud Until You Own It

“Kubernetes has two types of nodes. The Control Plane — which contains the API Server, etcd, the Scheduler, and the Controller Manager — is the brain. Worker Nodes run your application pods. Each Worker Node runs kubelet, kube-proxy, and a container runtime like containerd.

Every component communicates through the API Server. Nothing talks directly to anything else. etcd is the only storage — and only the API Server touches etcd directly.

When you run kubectl apply, the API Server validates and stores the Deployment in etcd. The Deployment Controller sees it, creates a ReplicaSet, which creates Pod objects in Pending state. The Scheduler sees Pending pods, scores nodes, and writes a node name to each pod spec. kubelet on that node sees a pod assigned to it, tells containerd to pull the image, starts the container, and reports Running back to the API Server.

If a container crashes, kubelet restarts it — no other component involved. If a node dies, the Node Controller marks it NotReady and evicts pods after ~5 minutes. The ReplicaSet Controller sees the count drop and creates replacement pods. The Scheduler assigns them to healthy nodes. kubelet starts them. Cluster heals.”

If you can say that without looking at notes, you genuinely understand it. That is the offer.

If You Remember Only Five Things

API Server is the only front door.

Every component — kubectl, Scheduler, Controller Manager, kubelet, kube-proxy — talks to the API Server. Nothing talks directly to anything else. Nothing bypasses it.

Scheduler chooses the node.

It scores all eligible nodes and writes nodeName to the pod spec. That is its entire job. It does not create containers. It does not contact kubelet. It writes one field and stops.

kubelet creates and manages pods on that node.

When kubelet sees a pod assigned to its node, it calls the container runtime to pull the image and start the container. It also runs probes, mounts volumes, and restarts crashed containers. Scheduler decides. kubelet executes.

containerd runs the containers.

kubelet calls containerd via CRI. containerd pulls the image, creates the container process, and manages its lifecycle. containerd does not know about Pods, nodes, or Kubernetes. It knows about containers.

Controllers continuously reconcile desired and actual state.

The Controller Manager runs watch loops that compare what etcd says should exist with what actually exists. Every gap triggers an action. This is why Kubernetes is self-healing — not magic, just a reconciliation loop that never stops.

The Pending pods at 2:17 AM were caused by a gap between what the Scheduler sees (resource requests) and what monitoring shows (resource usage). One piece of architecture knowledge — that the Scheduler filters on requests, not actual usage — turns a 45-minute incident into a 4-minute fix.

The components themselves are not complicated. What they do individually is straightforward. The power is in understanding how they connect — the single hub of the API Server, the watch-and-react pattern, the clean separation between deciding (control plane) and doing (worker node). Once that clicks, every Kubernetes behavior has a logical explanation.

The cluster is not magic. It is a very well-designed distributed system where seven components, none of which talks directly to any other, collectively manage complexity that would be impossible to handle manually. Now you know how.

About the author

Ravi Kapoor

Senior DevOps Engineer & Technical Writer

CKA & AWS SA-Pro Certified9 yrs — Atlassian & FintechKubernetes open-source contributor

Ravi is a senior DevOps engineer with 9 years of experience building cloud-native infrastructure at Atlassian and multiple fintech companies. CKA and AWS Solutions Architect Professional certified, he has managed Kubernetes clusters serving millions of daily users and contributes to open-source tooling.

Targeting a Kubernetes or DevOps Role?

AiResumeFit matches your resume to Kubernetes, cloud, and DevOps job descriptions — improving your ATS score in seconds.

Optimize My Resume →

Kubernetes Architecture Explained: Every Component, Every Connection, Every Flow

The Interview Question

The Mental Model — Read This Before Anything Else

The Full Architecture Diagram

Why Kubernetes Uses Pods (Not Containers Directly)

The Control Plane (Master Node)

kube-apiserver — The Front Door

etcd — The Database

kube-scheduler — The Dispatcher

kube-controller-manager — The Watchdog

The Worker Node

kubelet — The Node Manager

kube-proxy — The Traffic Cop

Container Runtime — The Cook

Who Talks to Whom — Every Connection

Desired State, Current State, and the Reconciliation Loop

Stateless Components and Leader Election

Why stateless?

Leader election — why 3 Schedulers don't fight each other

How a Pod Gets Created — The Complete Flow

The Biggest Myth in Kubernetes: Who Actually Creates the Pod?

How a Pod Gets Deleted

How Updates Work — Rolling Deployments

How Crash Recovery Works

What Happens When kubelet Restarts

Two Production Disasters

Disaster 1: The Scheduler Was Right the Whole Time (e-commerce, Black Friday, 2:43 AM)

Disaster 2: Someone Almost Made It Much Worse (fintech startup, Tuesday morning)

The Wall of Shame

Why Kubernetes Is Designed This Way

Cloud-Managed vs Self-Managed Kubernetes

Best Practices

FAQ

People Also Ask

Interview Corner

🎤 The 60-Second Answer

Targeting a Kubernetes or DevOps Role?