KubernetesπŸ”₯ Production-Critical⚑ Senior Engineer Level

Kubernetes Architecture Explained: Control Plane, Worker Nodes, and How It All Works Together

The interview question that separates engineers who have run Kubernetes in production from engineers who have read about it β€” answered at the architecture level, with the counterintuitive truth about what actually happens when the control plane goes down.

β€œThe kubelet doesn't need the control plane to keep your pods alive. The control plane needs itself to prove it's still in charge.”

Updated June 8, 2026|22 min read|Has diagnosed 2 silent cluster deaths

β€” PagerDuty Alert β€”

It's 2:17 AM.

etcd is down. The API server is unresponsive. kubectl hangs on every command. You check the workloads:

βœ… Pods are still Running

βœ… Application traffic is flowing

βœ… Health checks are passing

βœ… No user-facing errors

But the control plane is dead. You can't deploy. You can't scale. You can't see anything. The cluster is frozen.

And it has been frozen for 47 minutes before anyone noticed β€” because from the outside, everything looked completely fine.

You don't know why. You're about to.

Three Months Later. A Different Kind of War Room.

No PagerDuty this time. A whiteboard. A platform engineering interview at a company you have been targeting for two years. The interviewer uncaps a marker and writes one sentence:

β€œExplain the Kubernetes architecture to me.”

You breathe. You've survived incidents in this architecture. You know exactly where the bodies are buried. You've stared at a frozen cluster at 2 AM and understood, component by component, why it was frozen but still serving traffic. This question is yours.

The Answer That Gets You to Round Two β€” Then Eliminated

You draw the standard flow. Confident. Correct. And completely insufficient.

User
β–Ό
kubectl
β–Ό
API Server
β–Ό
etcd
β–Ό
Pods

The interviewer nods. Writes something down. Then looks up slowly.

Interviewer keeps going:

❓ β€œWhat is etcd and why would the cluster stop working if it went down?”

❓ β€œThe API server is stateless β€” what does that mean and why does it matter?”

❓ β€œThe scheduler picked a node for my pod β€” who actually starts the container?”

❓ β€œHow does the kubelet know what pods it should be running?”

❓ β€œIf the entire control plane goes down, do running pods die?”

Five questions. Most candidates freeze at question one or give the wrong answer to question five β€” the counterintuitive one. Engineers who answer all five with precision walk out with the offer. Let's do this.

Before the architecture details: here is the mental model that makes everything else stick. Kubernetes is a hospital. A very well-engineered, occasionally dysfunctional, distributed hospital.

Hospital WorldKubernetes World
Hospital records roometcd β€” source of truth; every resource, every config, every secret stored here
Reception / Nurse's stationAPI Server β€” all requests go through here; nothing bypasses it
Bed assignment coordinatorScheduler β€” decides WHERE a pod runs; does not start the container itself
Hospital administrationController Manager β€” notices beds are empty and fills them
Ward nurse on each floorkubelet β€” carries out the actual orders on each node
Internal phone directorykube-proxy β€” routes calls between departments (Services to Pods)
A patient roomPod β€” the unit of scheduling and resource allocation
The patient and care teamContainer β€” the actual workload running inside the pod

Hold that analogy. Every question below maps exactly to a part of that hospital β€” except the phone directory is iptables rules in the Linux kernel, and the records room runs Raft consensus. Let's go.

Q1: What Is etcd β€” and Why Would the Cluster Stop Working If It Went Down?

Most people say β€œetcd is a database for Kubernetes.” That is like saying β€œthe hospital records room is a filing cabinet.” Technically accurate. Completely missing the point of why it matters at 3 AM.

etcd is a distributed key-value store that uses the Raft consensus algorithm. Every single resource in your Kubernetes cluster lives here and only here β€” every pod spec, every Deployment, every Secret, every ConfigMap, every RBAC policy, every node object, every custom resource. The API server stores nothing. The scheduler stores nothing. The controller manager stores nothing. etcd is the database. Everything else is logic on top of it.

Raft requires a quorum β€” a majority of members β€” to agree on any write before committing. With 3 etcd nodes, you need 2 to agree. Lose 2 nodes and writes stop. The cluster enters a read-only state. Lose all 3 and you cannot read current state from the API server either β€” except from its in-memory watch cache, which becomes stale within seconds to minutes depending on when it was last populated.

Here is the counterintuitive part: kubectl get pods can still work even when etcd is down β€” because the API server maintains an in-memory watch cache of recently read objects. You might get a stale response for a few minutes. But kubectl apply? Dead. Any write goes straight to etcd. No etcd, no writes. The cluster is frozen. This is exactly the 2:17 AM scenario.

🚨 Interview Trap

The wrong answer is β€œthe cluster would stop working immediately.” Bold but imprecise, which signals you have not actually operated this. The correct answer has two parts: writes stop immediately β€” nothing new can be created, updated, or deleted β€” but reads may still work for a short window via the API server's watch cache, and running pods keep running indefinitely because the kubelet is independent. That distinction is what the interviewer is probing for.

🧠 Memory Trick

etcd is the hospital records room. The API server is reception. If the records room burns down, reception can still answer questions from memory for a few minutes β€” β€œMr Smith is in Ward 3, I just checked.” But they cannot admit new patients, update treatment plans, or sign any paperwork. The hospital is frozen in its last known state. That is etcd going down.

Q2: The API Server Is Stateless β€” What Does That Mean and Why Does It Matter?

The wrong answer is β€œstateless means it doesn't remember anything.” That walks right past the important part β€” which is the architectural consequence of statelessness in a distributed system under production load.

Stateless means the API server stores nothing of its own. Every write goes to etcd. Every read comes from etcd, or from the watch cache that mirrors etcd. The API server holds zero persistent state between requests. Shut it down, restart it, and it is functionally identical β€” because all state it needs lives in etcd.

The architectural consequence: you can run 3 API servers behind a load balancer and they are perfectly equivalent. Any request can go to any API server. No session affinity. No hot/cold primary. No split-brain risk. This is why HA control planes are simple to build β€” add API servers to the load balancer pool. That's it. Horizontal scaling with zero coordination overhead.

Compare this to etcd, which is stateful. etcd nodes know about each other, elect a leader, coordinate writes through Raft. etcd has exactly one leader at a time. The API server has no such concept β€” they are all leaders, all the time, all serving the same data from etcd.

πŸ”₯ Production Reality

In practice, API server memory grows with watch connections. Each kubectl logs -f, each kubectl get pods -w, each controller watch β€” these are all persistent HTTP connections to the API server. In a team of 50 engineers with aggressive CI/CD tooling, the API server can hold thousands of concurrent watch connections and use 2–4 GB of memory. β€œStateless” does not mean β€œsmall” β€” it means the state it holds is transient and can be recreated from etcd on restart.

Q3: The Scheduler Picked a Node β€” Who Actually Starts the Container?

This question eliminates people who have memorized a diagram versus people who have traced the actual sequence of events. The wrong answer: β€œthe scheduler starts the container.” The scheduler has never started a container in its life. It is a bed assignment coordinator. It decides where. It does not move anyone.

The kubelet starts the container. Here is the exact sequence:

  1. The scheduler selects a node and calls the API server to write spec.nodeName on the pod object. The scheduler's involvement ends here.
  2. The kubelet on that node is watching the API server for pods assigned to its node. It sees the newly bound pod within milliseconds.
  3. The kubelet reads the pod spec and calls the Container Runtime Interface (CRI) β€” containerd or CRI-O β€” to pull the image if not already cached and start the container.
  4. The container runtime starts the container process, sets up Linux namespaces (network, PID, mount, UTS), and applies cgroup resource limits.
  5. The CNI plugin runs during step 3 to set up pod networking: creating the veth pair, assigning the pod IP, wiring up the bridge or overlay routing.
  6. The kubelet begins polling liveness and readiness probes and writes pod status updates back to the API server.

The scheduler's job ends with the binding decision. Everything that follows is the kubelet.


  Pod created with no spec.nodeName
          β”‚
          β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                      kube-scheduler                            β”‚
  β”‚                                                               β”‚
  β”‚  PHASE 1: FILTER (Predicates)                                 β”‚
  β”‚  β”œβ”€ NodeResourcesFit  β€” enough CPU + memory?                  β”‚
  β”‚  β”œβ”€ NodeAffinity      β€” nodeSelector / affinity rules?        β”‚
  β”‚  β”œβ”€ TaintToleration   β€” pod tolerates node taints?            β”‚
  β”‚  β”œβ”€ PodTopologySpread β€” topology spread constraints met?      β”‚
  β”‚  └─ VolumeBinding     β€” required PVs available on node?       β”‚
  β”‚          β”‚                                                    β”‚
  β”‚          β–Ό  (feasible nodes only)                             β”‚
  β”‚                                                               β”‚
  β”‚  PHASE 2: SCORE (Priorities)                                  β”‚
  β”‚  β”œβ”€ LeastAllocated    β€” spread load across nodes (default)    β”‚
  β”‚  β”œβ”€ InterPodAffinity  β€” co-locate or anti-affinity            β”‚
  β”‚  β”œβ”€ ImageLocality     β€” prefer nodes with image already pulledβ”‚
  β”‚  └─ NodeResourcesBalanced β€” balanced CPU/memory ratio         β”‚
  β”‚          β”‚                                                    β”‚
  β”‚          β–Ό  (winning node selected)                           β”‚
  β”‚                                                               β”‚
  β”‚  BIND: writes pod.spec.nodeName via API server                β”‚
  β”‚  Scheduler's job ends here. Kubelet takes over.               β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🧠 Memory Trick

The scheduler is the bed assignment coordinator. It looks at all available beds, picks the best one, and writes it on the patient's chart. The ward nurse (kubelet) on that floor reads the chart, makes the bed, gets the IV running, and calls the pharmacist (container runtime) to bring the medication. The scheduler has moved on to the next patient before the nurse has even started.

⚑ Pro Tip

When a pod is stuck in Pending, most engineers check node capacity manually. The right move: always run kubectl describe pod <name> first. The Events section gives you the scheduler's exact reason in one line: β€œ0/5 nodes are available: 3 Insufficient cpu, 2 node(s) had untolerated taint.” That single line replaces 15 minutes of manual investigation.

Q4: How Does the Kubelet Know What Pods It Should Be Running?

The wrong answer is β€œit polls the API server.” Polling is a system from 2003. Kubernetes does not poll. Polling is slow, noisy, and scales linearly with cluster size. It would also mean pod startup latency is bounded by the poll interval, which is not what you observe.

The kubelet watches the API server using a long-lived HTTP connection. This is the watch mechanism β€” a ?watch=true query parameter on the pod resource, filtered by spec.nodeName=this-node. The API server holds that connection open and streams events down it in real time: pod added, pod modified, pod deleted.

When a new pod is scheduled to the kubelet's node, the kubelet receives the event immediately β€” not on the next poll interval, but within milliseconds of the API server writing the change to etcd. The kubelet then runs its reconciliation loop: β€œwhat should be running according to the API server” versus β€œwhat is actually running on this node according to the container runtime.” Any delta triggers action.

This reconciliation loop is not specific to the kubelet β€” it is the core design pattern of every controller in Kubernetes. Desired state lives in etcd. Actual state is observed from the real world. The controller closes the gap. When the gap is zero, nothing happens. When it opens β€” pod died, deployment updated, node went down β€” the controller acts.


  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚         Controller Reconciliation Loop (every controller)     β”‚
  β”‚                                                               β”‚
  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   Watch    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
  β”‚   β”‚   etcd   β”‚ ─────────▢ β”‚   Informer / Work Queue      β”‚   β”‚
  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
  β”‚                                          β”‚                    β”‚
  β”‚                                          β–Ό                    β”‚
  β”‚                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
  β”‚                             β”‚   Reconcile()           β”‚        β”‚
  β”‚                             β”‚                         β”‚        β”‚
  β”‚                             β”‚   desired ← spec        β”‚        β”‚
  β”‚                             β”‚   actual  ← status      β”‚        β”‚
  β”‚                             β”‚                         β”‚        β”‚
  β”‚                             β”‚   if desired != actual: β”‚        β”‚
  β”‚                             β”‚     take action         β”‚        β”‚
  β”‚                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
  β”‚                                         β”‚                     β”‚
  β”‚                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
  β”‚                             β”‚  API Server (write)     β”‚        β”‚
  β”‚                             β”‚  Update status          β”‚        β”‚
  β”‚                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🚨 Interview Trap

β€œThe kubelet polls etcd directly” sounds plausible and is completely wrong. No component talks to etcd except the API server. This is by design β€” one security boundary, one audit log, one place to enforce RBAC. The kubelet talks to the API server. The API server talks to etcd. This hub-and-spoke architecture is not accidental. It is a security decision that every senior engineer should be able to articulate.

Q5: If the Entire Control Plane Goes Down, Do Running Pods Die?

This is the question that earns the offer or ends the interview, depending on which side of the answer you land on.

The wrong answer: β€œyes, pods would die without the control plane.” That sounds reasonable. It is completely wrong. And if you lived through the 2:17 AM incident at the top of this article, you already know why.

No. Running pods do not die when the control plane goes down. The kubelet is independent. It is a systemd service on each worker node. It does not need the API server to keep containers running. It needs the API server to receive new instructions β€” but once a pod is running, the kubelet manages it locally.

What actually breaks when the control plane disappears:

  • No new pods can be scheduled. The scheduler is gone.
  • No deployments, no scaling, no config changes. The controller manager is gone.
  • No kubectl anything. The API server is gone.
  • No new Service endpoint updates. The endpoint controller is gone.

What keeps working:

  • Running pods keep running. kubelet is local and independent of the control plane.
  • Health checks keep firing. Liveness and readiness probes still execute on their configured intervals.
  • Container restarts still happen. If a process crashes, the kubelet restarts it according to restartPolicy.
  • Traffic to existing pods keeps flowing. kube-proxy already wrote the iptables rules into the Linux kernel. The kernel does not care that the control plane is gone.

The cluster is frozen in amber. Everything that was running keeps running. Nothing new can happen. This is exactly the 2:17 AM incident. The applications were fine for 47 minutes. The platform was frozen and nobody noticed.

πŸ”₯ Production Reality

The practical implication: during a control plane outage, your SLO may still be met for existing traffic β€” but your ability to respond to incidents is zero. You cannot roll back a bad deploy. You cannot scale under a traffic spike. You cannot change a secret. A 15-minute control plane outage during normal load may be invisible to users. The same outage during a traffic spike is a prolonged incident you cannot mitigate. This is why HA control planes with 3+ nodes are non-negotiable for production β€” not a nice-to-have that you get to when there is time.

The Full Architecture Diagram

Now that the five questions are answered, the diagram means something. Every box is a component you can inspect, a process you can restart, a metric you can alert on. Every arrow is a watch connection or an API call β€” not magic.


β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           KUBERNETES CLUSTER                                  β”‚
β”‚                                                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ CONTROL PLANE ─────────────────────────────┐   β”‚
β”‚  β”‚                                                                         β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  kube-apiserver β”‚    β”‚  kube-scheduler β”‚    β”‚  kube-ctrl-mgr  β”‚    β”‚   β”‚
β”‚  β”‚  β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  REST / gRPC    β”‚    β”‚  Filter+Score   β”‚    β”‚  20+ controllersβ”‚    β”‚   β”‚
β”‚  β”‚  β”‚  Auth / RBAC    β”‚    β”‚  Binding write  β”‚    β”‚  Leader elect   β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  Admission      β”‚    β”‚                 β”‚    β”‚                 β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  Watch/Notify   β”‚    β”‚                 β”‚    β”‚                 β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β”‚           β”‚                      β”‚ watch                β”‚ watch        β”‚   β”‚
β”‚  β”‚           β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚   β”‚
β”‚  β”‚           β”‚  (only component that talks to etcd)                      β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚      etcd       β”‚    β”‚      cloud-controller-manager          β”‚    β”‚   β”‚
β”‚  β”‚  β”‚                 β”‚    β”‚                                        β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  Raft consensus β”‚    β”‚  LB provisioning, Node lifecycle       β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  Distributed KV β”‚    β”‚  Pluggable per cloud provider          β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  SSD required!  β”‚    β”‚                                        β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ WORKER NODE 1 ──────────┐  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ WORKER NODE 2 ──────────┐ β”‚
β”‚  β”‚                                    β”‚  β”‚                                    β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚  β”‚  β”‚  kubelet β”‚  β”‚  kube-proxy   β”‚   β”‚  β”‚  β”‚  kubelet β”‚  β”‚  kube-proxy   β”‚  β”‚ β”‚
β”‚  β”‚  β”‚          β”‚  β”‚               β”‚   β”‚  β”‚  β”‚          β”‚  β”‚               β”‚  β”‚ β”‚
β”‚  β”‚  β”‚ CRI call β”‚  β”‚ iptables/IPVS β”‚   β”‚  β”‚  β”‚ CRI call β”‚  β”‚ iptables/IPVS β”‚  β”‚ β”‚
β”‚  β”‚  β”‚ Probes   β”‚  β”‚ Service VIPs  β”‚   β”‚  β”‚  β”‚ Probes   β”‚  β”‚ Service VIPs  β”‚  β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚
β”‚  β”‚       β”‚                            β”‚  β”‚       β”‚                            β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚  β”‚  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚ β”‚
β”‚  β”‚  β”‚  Container Runtime (CRI)  β”‚     β”‚  β”‚  β”‚  Container Runtime (CRI)  β”‚    β”‚ β”‚
β”‚  β”‚  β”‚  containerd / CRI-O       β”‚     β”‚  β”‚  β”‚  containerd / CRI-O       β”‚    β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ β”‚
β”‚  β”‚                                    β”‚  β”‚                                    β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚ β”‚
β”‚  β”‚  β”‚  CNI (Calico / Cilium)    β”‚     β”‚  β”‚  β”‚  CNI (Calico / Cilium)    β”‚    β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ β”‚
β”‚  β”‚                                    β”‚  β”‚                                    β”‚ β”‚
β”‚  β”‚  Pods: [app-a] [app-b]            β”‚  β”‚  Pods: [app-c] [app-d]            β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The key structural insight: the API server is the only component that talks to etcd. Everything else β€” scheduler, controller manager, kubelet, kube-proxy β€” talks to the API server. This single-hub design gives you one security boundary, one audit log, one place to enforce RBAC. All distributed consistency comes from etcd. All control logic comes from the control plane components. All execution happens on the nodes. The layering is clean and intentional.

Component Deep Dives

Controller Manager: Twenty-Plus Reconciliation Loops in One Binary

The controller manager is one process running many controllers. Each controller watches specific resource types and acts to close the gap between desired and actual state. The built-in controllers include:

  • ReplicaSet controller β€” ensures the correct number of pod replicas exist at all times
  • Deployment controller β€” manages rolling updates by creating and deleting ReplicaSets
  • StatefulSet controller β€” manages ordered, stable pod identity for stateful workloads
  • Job and CronJob controllers β€” runs batch workloads to completion on schedule
  • Node controller β€” marks nodes unreachable after heartbeat timeout, evicts pods from dead nodes
  • Endpoint controller β€” populates EndpointSlices when pod readiness changes
  • Namespace controller β€” cleans up resources when namespaces are deleted
  • ServiceAccount controller β€” creates default service accounts in new namespaces
  • PersistentVolume controller β€” binds PVCs to available PVs

In an HA setup, multiple controller manager replicas run simultaneously β€” one per control plane node. Only one is the leader at any time. Leadership is implemented via a Lease object inkube-system. The leader renews it every few seconds. If it fails to renew, a follower acquires the lease and becomes the new leader. Failover takes up to 15 seconds by default. During that window, no reconciliation actions happen β€” but existing workloads keep running, and the backlog clears as soon as the new leader starts processing.

⚑ Pro Tip

Do not set CPU limits on the controller manager. CPU limits cause throttling, not termination. A throttled controller manager means reconciliation lag β€” deployments that appear created in the API server but whose pods take minutes to materialize. Set CPU requests for node scheduling purposes, but leave CPU limits unset. Memory limits are fine and prevent a runaway controller from taking down the whole control plane node.

Scheduler: Predicates, Priorities, and the Binding Decision

The scheduler is a single-purpose component with a surprisingly rich two-phase algorithm. When it finds a pod with no spec.nodeName, it runs across all candidate nodes:

Filter phase (predicates) eliminates nodes that cannot physically run the pod. Binary questions: does this node have 500m CPU available? Does it tolerate the pod's tolerations? Does it satisfy nodeAffinity rules? Is it in a Ready condition? Does it have the required volume topology? Nodes failing any predicate are eliminated. The result is a list of feasible nodes.

Score phase (priorities) ranks the feasible nodes. Each scoring plugin assigns 0–100. The default scoring considers: remaining resource capacity (least allocated spreads workloads; most allocated bin-packs β€” you configure the policy), inter-pod affinity preferences, topology spread constraints, and image locality (preferring nodes that already have the container image cached). The winning node gets the pod via a Binding object written to the API server.

The scheduler is extensible via the Scheduling Framework. You can write plugins that hook into filter, score, reserve, permit, pre-bind, bind, and post-bind phases. Multiple schedulers can run in one cluster β€” pods specify their scheduler viaspec.schedulerName. GPU workload teams often run a custom scheduler that accounts for GPU topology and memory bandwidth constraints the default scheduler cannot model.

Production Disasters

πŸ”₯ Disaster Story 1: etcd Disk Full β€” The Silent Cluster Death

The setup: Single-node etcd, stacked on the control plane node, sharing the same EBS volume as the OS and application logs. The cluster had been running for 14 months. etcd storage had grown slowly as the team added custom resources, Helm releases, and Secrets. Nobody had checked disk usage. Nobody had set up a disk utilization alert for the etcd volume.

What happened: At 11:42 AM on a Tuesday, etcd ran out of disk space. Writes began failing with no space left on device errors internally. The API server started returning 500 Internal Server Error on any mutating operation.kubectl apply, kubectl create, and kubectl delete all hung. But kubectl get pods still worked β€” the API server watch cache was still serving reads. From the application monitoring dashboard: everything green.

What the team checked first: Application logs. Deployment status. Pod health. All green. Two hours passed before someone ran kubectl get pods -n kube-systemand noticed the etcd pod was cycling through OOMKill states β€” which was actually the disk-full condition manifesting as a crash loop, not a straightforward disk alert.

The root cause: etcd's data directory at /var/lib/etcd had grown to fill a 30 GB EBS volume. OS logs were also growing on the same partition. etcd entered maintenance-only mode and refused all writes once the volume hit 100%.

The fix: Resized the EBS volume online. Cleared old OS logs. Ran etcd defragmentation: etcdctl defrag --endpoints=.... Cluster recovered in 8 minutes after the disk was cleared.

The lesson: etcd disk full looks like an application problem from the outside. Every new pod, every Helm release, every config change silently fails with a 500. Running workloads keep running. The symptom is β€œI can't deploy anything” β€” which is exactly how application-layer problems also present. Always check the control plane first when all deployments fail simultaneously. One command reveals it: kubectl exec etcd-X -n kube-system -- df -h /var/lib/etcd.

πŸ”₯ Disaster Story 2: Control Plane Node Down During a Deployment

The setup: Single control plane node. The team had discussed HA for months but had not implemented it β€” the cluster was β€œworking fine.” Peak traffic hours. A payment service deployment was 60% complete: some pods on the new version, some on the old.

What happened: The control plane node suffered a hardware fault β€” a memory error that caused a kernel panic. The API server vanished. The scheduler vanished. The controller manager vanished. The deployment was frozen at 60% β€” two versions of the payment service simultaneously handling production traffic.

The blast radius: The new version had a data migration incompatibility with the old version. Mixed-version traffic caused transaction errors for 23 minutes β€” a window the team could not close because they had no way to roll back or scale either version. kubectl was completely unavailable. The cluster was a frozen, split-brain payment processor.

The resolution: The control plane node rebooted after the transient hardware error cleared. The API server came back. The deployment controller saw the stalled rollout and continued. Full recovery: 31 minutes from the hardware fault.

The lesson: A single control plane node is not a cost optimization. It is a deferred incident with a known detonation profile. The cost of a three-node control plane is trivially small compared to the blast radius of a deployment frozen mid-flight during peak traffic. The team had HA implemented two weeks later. Three etcd nodes. Three API server nodes. Load balancer in front. They have not had a control plane outage since.

The Wall of Shame

πŸ˜… Senior Engineer Confession

Every item on this list has been discovered in production, by engineers who knew better, at companies with mature infrastructure teams. Architecture knowledge does not prevent these. Checklists and post-mortems prevent these.
  1. Single etcd node in production. Running your only hospital records room in a tent. Fire season is approaching. A single etcd node means one disk failure, one kernel panic, one bad cloud availability zone day kills your entire cluster's ability to change state. Running pods survive. The platform is frozen. Your on-call is helpless. Three nodes. Always three nodes. The Raft quorum math is not a suggestion.
  2. No etcd backups. A hospital that doesn't keep patient records because β€œwe've never needed them before.” etcd is the only stateful component in the entire control plane. Lose it without a backup and you lose the cluster β€” all Deployments, all Secrets, all RBAC bindings, all custom resources. Gone. The CronJob that snapshots to S3 every 6 hours takes 15 minutes to set up. Do it today. Then test the restore. A backup you have never restored is not a backup. It is a feeling.
  3. Running workloads on control plane nodes. The hospital administrator doing surgery between meetings. Control plane nodes run etcd, the API server, the scheduler, and the controller manager. These processes are sensitive to CPU and memory pressure. A noisy tenant workload causing memory pressure on a control plane node can trigger an OOMKill on the API server. That is a full control plane outage. Taint your control plane nodes. Keep workloads off them. The compute you save is not worth what it costs when the incident fires.
  4. Not setting resource requests on workloads. A patient room with no bed. It exists on paper. The nurse has nowhere to put anyone. The scheduler makes node placement decisions based on resource requests, not actual usage. A pod with no requests is scheduled as if it needs zero CPU and zero memory. The node fills up with β€œfree” workloads until the kernel OOMKills something. Usually not your workload. Usually something important. Set requests. Always. Even rough approximations are far better than nothing.
  5. Storing secrets in ConfigMaps. Writing passwords on sticky notes and calling it β€œencrypted” because they are yellow. ConfigMaps are world-readable within the namespace to any pod with a default service account. Kubernetes Secrets are also not encrypted by default β€” they are base64-encoded β€” but they can be encrypted at rest with an EncryptionConfiguration, and their access is separately scoped via RBAC. The distinction matters. Use Secrets for secrets. Enable encryption at rest.
  6. Not using namespaces. A hospital with no departments. Pediatrics and surgery share a waiting room. The billing department accidentally prescribes antibiotics. Nobody can find anything. In a single-namespace cluster, RBAC becomes all-or-nothing, resource quotas cannot be scoped, and kubectl get pods returns 400 entries spanning monitoring, databases, CI runners, and your application. Namespaces are free. Use them deliberately.
  7. Default ServiceAccount with cluster-admin. Giving every visitor to the hospital a master key to the entire building β€œto save time at reception.” The default service account is automatically mounted into every pod that does not specify one. Binding that account to cluster-admin β€” a common shortcut when debugging RBAC β€” means every pod in that namespace can read and modify any resource in the entire cluster. One compromised container becomes a cluster-admin. Least-privilege RBAC is not bureaucracy. It is blast radius reduction.
  8. No Horizontal Pod Autoscaler on variable-load workloads. A hospital that sets staffing levels in January and never adjusts for flu season. Static replica counts mean you are either over-provisioned (expensive) or under-provisioned (your 9 AM Monday deployment gets hammered before anyone scales manually). HPA reacts to actual CPU and memory usage and scales automatically within a configured range. The metric you choose to scale on matters β€” CPU works for CPU-bound workloads, but request rate via KEDA or custom metrics is more responsive for I/O-bound services.

Production Best Practices

  1. Three etcd nodes minimum in production. Odd numbers only. Five for large or critical clusters. Raft quorum math does not care about your budget justification β€” it only cares about majority.
  2. External etcd on dedicated SSD storage for clusters with 100+ nodes. Stacked etcd is fine for smaller clusters. At scale, etcd on dedicated io2 volumes with provisioned IOPS eliminates a whole category of latency-induced leader elections and disk contention incidents.
  3. Automate etcd backups and test the restore. Snapshot every 6 hours. Upload to S3 or equivalent. Run a restore drill in a non-production cluster at minimum quarterly. A backup you have never restored is a feeling of security, not actual security.
  4. Taint control plane nodes to prevent workload scheduling. node-role.kubernetes.io/control-plane:NoSchedule is set by kubeadm automatically. Verify it is present. Never remove it without a plan for what replaces the isolation it provides.
  5. Set memory limits on control plane components, but not CPU limits. Memory limits prevent OOMKill cascades from taking down the whole control plane node. CPU limits cause throttling that manifests as mysterious reconciliation lag. Set requests for both; limits only for memory.
  6. Monitor etcd disk latency and volume usage. Alert when WAL fsync p99 exceeds 100ms. Alert at 70% disk utilization on the etcd volume. Do not wait for 100% β€” compaction and defragmentation need headroom to run without contending with live writes.
  7. Load balancer in front of API servers for HA setups. All kubelets and workers should talk to the load balancer address, not individual API server IPs. Use a VIP or cloud LB. Bake the controlPlaneEndpoint into the kubeadm config at cluster bootstrap β€” changing it later requires certificate regeneration.
  8. Rotate control plane certificates before expiry. kubeadm certs expire after 1 year by default. kubeadm certs check-expiration shows all expiry dates. Set a calendar reminder for 30 days before expiry. Run kubeadm certs renew all in a maintenance window, or configure automatic rotation. Certificate expiry is a 100% avoidable outage that still happens every year across the industry.

FAQ

Does the kubelet run on control plane nodes?

Yes, in kubeadm-deployed clusters. The kubelet on each control plane node manages the static pods for the control plane components themselves β€” kube-apiserver, kube-scheduler, kube-controller-manager, and etcd are all static pods that the kubelet starts from manifests in/etc/kubernetes/manifests/. If the kubelet dies on a control plane node, all control plane components on that node die with it. Monitor the kubelet service on control plane nodes just as closely as the components it manages.

Can I run Kubernetes without etcd?

Not with the standard upstream API server. etcd is the only supported backend store. k3s uses an embedded SQLite or an external relational database via a compatibility shim, but it is a different distribution with different operational characteristics. For standard Kubernetes, etcd is not optional β€” it is the foundation the entire control plane is built on.

What is the difference between a Deployment and a ReplicaSet?

A ReplicaSet ensures N copies of a pod are running. A Deployment manages ReplicaSets β€” it creates a new ReplicaSet for each rollout and scales the old one down as the new one scales up. This gives you rolling updates and rollback history. You almost never create ReplicaSets directly. Create Deployments and let the Deployment controller manage the ReplicaSets on your behalf.

How does the API server handle authentication without a user database?

It does not maintain a user database. It supports multiple authentication strategies simultaneously: client certificates (X.509 certs signed by the cluster CA), bearer tokens (ServiceAccount JWTs, OIDC tokens from an external identity provider), and webhook authentication (the API server calls an external service and receives a pass/fail decision). If any strategy succeeds, the request is authenticated. Most production clusters use OIDC tokens from an identity provider (Okta, Azure AD, Google) for human users and ServiceAccount tokens for machine-to-machine access.

What is the cloud controller manager and do I need it?

The cloud controller manager handles cloud-provider-specific logic: provisioning load balancers when you create a LoadBalancer Service, programming cloud network routes for pod CIDR blocks, and syncing node objects with the cloud provider's instance lifecycle events. On EKS, GKE, or AKS it is already running and managed for you. On bare metal or with a non-standard cloud provider, you either do not need it or you install the relevant community provider-specific version.

🎀 The 60-Second Interview Answer

Back in the interview room. Five questions answered. Here is how you deliver the complete answer β€” covering the naive path and the architectural depth that gets you the offer:

🎀 Say This Out Loud Until You Own It

β€œA Kubernetes cluster has two planes: the control plane, which makes decisions, and the data plane β€” worker nodes β€” which executes them. The control plane has four main components. etcd is the distributed key-value store where all cluster state lives β€” every pod spec, every secret, every deployment definition. The API server is a stateless REST gateway in front of etcd β€” it handles authentication, RBAC authorization, and admission control, then persists or reads from etcd. No other component talks to etcd directly. Everything goes through the API server.

The scheduler watches for pods with no node assigned, runs a two-phase filter-and-score algorithm across candidate nodes, and writes the assignment back through the API server. The controller manager is a collection of reconciliation loops β€” ReplicaSet, Deployment, Node, Endpoint controllers β€” each watching the API server for drift between desired and actual state and closing the gap.

On each worker node, the kubelet watches the API server for pods assigned to its node via a long-lived HTTP watch connection β€” not polling. When it sees a new pod, it calls the container runtime β€” containerd or CRI-O β€” via the CRI interface to pull images and start containers. The kubelet does the execution. The scheduler only makes the placement decision. kube-proxy programs iptables or IPVS rules on each node to implement the Service virtual IP abstraction.

Here is the counterintuitive part: if the entire control plane goes down, running pods do not die. The kubelet is a local systemd service. It keeps managing its pods independently. What breaks is every decision that requires the control plane: no scheduling, no deployments, no scaling, no kubectl. The cluster is frozen in its last known state β€” not dead, frozen.

Critical production detail: etcd needs dedicated SSD storage and three nodes for quorum. A full etcd disk causes the API server to return 500s on all writes while reads still work from cache β€” it looks like an application problem for 45 minutes until someone checks etcd disk usage. That is the architecture gotcha that separates people who have operated this from people who have only read about it.”

If you can say that in one breath, you're getting the job.

etcd and control plane health β€” the commands you reach for first
# ── ETCD HEALTH CHECKS ──────────────────────────────────────────────────────

# Endpoint health β€” run from inside the etcd pod
kubectl exec -it etcd-<node-name> -n kube-system -- sh -c \
  "ETCDCTL_API=3 etcdctl endpoint health \
   --endpoints=https://127.0.0.1:2379 \
   --cacert=/etc/kubernetes/pki/etcd/ca.crt \
   --cert=/etc/kubernetes/pki/etcd/server.crt \
   --key=/etc/kubernetes/pki/etcd/server.key"

# Member list β€” see all nodes and their leader status
kubectl exec -it etcd-<node-name> -n kube-system -- sh -c \
  "ETCDCTL_API=3 etcdctl member list \
   --endpoints=https://127.0.0.1:2379 \
   --cacert=/etc/kubernetes/pki/etcd/ca.crt \
   --cert=/etc/kubernetes/pki/etcd/server.crt \
   --key=/etc/kubernetes/pki/etcd/server.key \
   -w table"

# Disk latency check β€” p99 should be under 10ms
kubectl exec -it etcd-<node-name> -n kube-system -- sh -c \
  "ETCDCTL_API=3 etcdctl check perf \
   --endpoints=https://127.0.0.1:2379 \
   --cacert=/etc/kubernetes/pki/etcd/ca.crt \
   --cert=/etc/kubernetes/pki/etcd/server.crt \
   --key=/etc/kubernetes/pki/etcd/server.key"

# Disk usage β€” a full etcd disk is a silent cluster death
kubectl exec -it etcd-<node-name> -n kube-system -- df -h /var/lib/etcd

# ── CONTROL PLANE HEALTH ─────────────────────────────────────────────────────

# All control plane pods in kube-system
kubectl get pods -n kube-system -o wide | grep -E 'apiserver|scheduler|controller|etcd'

# Who holds the scheduler and controller-manager leader lease?
kubectl get lease -n kube-system

# ── WHEN KUBECTL ITSELF DOESN'T WORK ─────────────────────────────────────────

# SSH to a control plane node and use crictl directly
crictl ps | grep -E 'apiserver|scheduler|controller|etcd'
crictl logs <container-id>

# Check the kubelet β€” it manages the static pods
systemctl status kubelet
journalctl -u kubelet -f --since "5 minutes ago"

Key Takeaways

  • β†’etcd is the only stateful component in the control plane. Every other component can be recreated from etcd. Lose etcd without a backup and the cluster is gone.
  • β†’The API server is stateless β€” it stores nothing. This is why you can scale it horizontally behind a load balancer with zero coordination overhead.
  • β†’The scheduler decides where. The kubelet executes. These are completely separate processes with separate failure modes and separate blast radii.
  • β†’The kubelet watches the API server via long-lived connections β€” not polling. No component except the API server talks to etcd directly.
  • β†’Running pods survive a complete control plane outage. The kubelet is independent. The cluster is frozen in amber, not dead.
  • β†’A full etcd disk is a silent cluster death. Reads work from watch cache. Writes fail. It looks like an application problem. Check etcd disk usage first β€” always.

Targeting a Kubernetes or Platform Engineering Role?

AiResumeFit matches your resume to Kubernetes, cloud, and SRE job descriptions β€” improving your ATS score in seconds.

Optimize My Resume β†’