β PagerDuty Alert β
It's 2:17 AM.
You check the cluster:
β Deployment shows 3/3 pods Running
β Load balancer is healthy
β Pods are Running
β Endpoints: empty
Users are getting 503s. The load balancer has no backends.
You check one pod. Running. You check its readiness. Not Ready.
The readinessProbe is failing. Silently. The app started 4 seconds ago. The database connection pool hasn't warmed up yet. initialDelaySeconds: 0. The probe fired immediately. Failed. Pod got marked Not Ready. Traffic stopped.
The pod is still Running. Nobody got an alert about it.
You don't know why. You're about to.
Three Months Later. A Different Kind of War Room.
No PagerDuty this time. Fluorescent lights. A senior SRE interview loop. The interviewer β calm, unhurried β writes one question on the whiteboard:
βWalk me through a pod's lifecycle from creation to termination.β
You take a breath. You've lived this at 2 AM. You know every state. You start drawing.
The Answer That Gets 80% of Candidates Eliminated
Most engineers draw the same thing. Confident. Correct. Incomplete.
The interviewer nods. Writes something down. Then looks up.
Interviewer keeps going:
β βWhat is the difference between Running and Ready?β
β βMy pod is stuck in Pending β what are the three most likely causes?β
β βWhat is an Init container and how is it different from a sidecar?β
β βWhat happens after kubectl delete pod β step by step?β
β βMy pod keeps restarting. CrashLoopBackOff. The logs show nothing. Why?β
Five questions. Most candidates get through one and a half before the interviewer knows. Engineers who answer all five β with specifics, with failure modes, with production war stories β walk out with the offer.
Before we go deep, here's the mental model that makes every state click immediately. Think of a pod as a new employee starting at a company.
| New Employee World | Kubernetes Pod World |
|---|---|
| Hired but not yet onboarded (waiting for a desk, equipment, access) | Pending β scheduled, but containers not yet started |
| First day setup (getting their laptop imaged) | ContainerCreating β image pull, CNI setup, volume mount |
| Security clearance and onboarding training (must complete before work) | Init containers β run sequentially to completion first |
| Badge activated, at their desk | Running β container process is executing |
| Actually productive β they know what they're doing | Ready β readinessProbe passed, receiving traffic |
| At their desk, badge works, but not yet taking customer calls | Running but Not Ready β probe failing, no traffic routed |
| Weekly check-in meeting (are you still functional?) | livenessProbe β is this container deadlocked? |
| Are you ready to take on work right now? | readinessProbe β can this pod serve requests? |
| Initial onboarding grace period (don't fire them in the first 2 weeks) | startupProbe β protected window for slow startup |
| Handover period (finishing current work before leaving) | Terminating β pod draining in-flight requests |
| The notice period (stay long enough to hand off properly) | preStop hook β drain window before SIGTERM |
| Your last day is in 30 seconds | SIGTERM β graceful shutdown signal |
| Security is escorting you out right now | SIGKILL β no handlers, no cleanup, immediate exit |
Hold that analogy. Everything below is the exact same thing β except the onboarding training runs in a container, the notice period is a shell script sleeping for 10 seconds, and security escorting you out is the Linux kernel sending SIGKILL to PID 1.
Q1: What Is the Difference Between Running and Ready?
Most people say they're basically the same thing. That answer is like saying a restaurant being open and a restaurant being ready to serve you are basically the same thing. One is the chef arriving. The other is the food being cooked.
Running means the container process started. PID 1 is executing. That's it. The application inside might be loading. It might be failing every health check. It might be in a crash loop. Kubernetes does not care. Running just means the runtime did not immediately die.
Ready means the readinessProbe passed. The application declared it is able to receive traffic. These are completely independent.
A pod can be Running for hours while never becoming Ready. The readinessProbe keeps failing. The pod keeps sitting there β alive, processing nothing, excluded from every Service endpoint, completely invisible to traffic. No restart. No alert. Just silence.
When a pod transitions from Ready back to Not Ready β because its readinessProbe started failing mid-life β it is immediately removed from the Service's EndpointSlice. kube-proxy reprograms iptables. Traffic stops. The pod is still Running. The container is still alive. Nothing restarts. The application is simply no longer reachable via the Service. This is the most common source of silent 503s in production.
π¨ Interview Trap
Q2: My Pod Is Stuck in Pending β What Are the Three Most Likely Causes?
βCheck the logsβ is the wrong first move. Pending pods have no logs β the container hasn't started yet. The right first move is:kubectl describe pod <name> and read the Events section. Kubernetes will tell you exactly why. There are three causes that account for 95% of Pending pods.
Cause 1: Insufficient Resources
No node in the cluster has enough allocatable CPU or memory to satisfy the pod's resource requests. The Events section will show:
0/5 nodes are available: 3 Insufficient cpu, 2 Insufficient memory
Note the word allocatable β not capacity. Nodes reserve CPU and memory for system processes. A 4 GiB node might have only 3.2 GiB allocatable. Check the gap with kubectl describe node <name> | grep -A5 Allocatable. Fix: scale the cluster, reduce the pod's requests to measured actual usage, or add a node with the right instance type.
Cause 2: Taint / Toleration Mismatch
Nodes can be tainted to repel pods. GPU nodes typically carrynvidia.com/gpu=present:NoSchedule. Spot instances carrynode.kubernetes.io/spot:NoSchedule. If your pod has no matching toleration, the scheduler filters out every tainted node and the pod sits in Pending forever. The Events section will show:
N node(s) had taint X that the pod didn't tolerate
Fix: add the correct toleration to your pod spec, or audit whether the pod should be running on those nodes at all.
Cause 3: PersistentVolumeClaim Not Bound
If the pod spec references a PVC that is still in Pending phase β waiting for a PersistentVolume to be provisioned, or waiting for a storage class to be configured β the pod waits too. Forever. Check with kubectl get pvc. A Pending PVC means a Pending pod. The fix is almost never in the pod spec. It is in the StorageClass configuration, the PV provisioner logs, or the missing CSI driver installation.
β‘ Pro Tip
kubectl get events -n <ns> --sort-by=.metadata.creationTimestamp | tail -20 for any Pending pod. The scheduler posts a specific event for every filter it failed. You will have a diagnosis in under 30 seconds.Q3: What Is an Init Container and How Is It Different from a Sidecar?
Most engineers know init containers exist. Fewer can explain the lifecycle distinction that makes them actually useful.
Init containers run to completion before any app container starts.In sequence. Not in parallel. If one fails, it restarts β respecting the pod'srestartPolicy. The app container does not start until every init container has exited with code 0. This is not a race. It is a guarantee.
This matters enormously. Without init containers, your application has to handle the case where the database isn't ready yet β with retry logic, backoff, error handling in application code. With init containers, the database being ready is a scheduling precondition. The app starts knowing everything it needs is available.
Common production patterns:
- Wait for a dependency: poll with
ncorcurluntil Postgres on port 5432 accepts connections. The main container never starts on a cluster where the DB is still coming up. - Database migrations: run
python manage.py migratebefore the app starts. Every replica of the app starts knowing the schema is current. No coordination logic needed. - Secret or certificate setup: fetch a TLS certificate from Vault, write it to a shared
emptyDirvolume. The app container mounts the same volume and finds ready-to-use certs. - Config rendering: populate a config file from environment-specific templates. Write to shared volume. App starts with final config already present.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 2
selector:
matchLabels:
app: web-app
template:
spec:
# ββ Init containers: run SEQUENTIALLY before any app container starts ββ
initContainers:
# Step 1: Wait for Postgres to be reachable.
# Pod never moves forward until this passes.
- name: wait-for-postgres
image: busybox:1.36
command:
- sh
- -c
- |
until nc -z postgres-service 5432; do
echo "Waiting for Postgres..."; sleep 2;
done
resources:
requests: { cpu: "10m", memory: "16Mi" }
limits: { cpu: "10m", memory: "16Mi" }
# Step 2: Run database migrations.
# Uses the same image β schema knowledge is shared.
# App container starts knowing schema is current.
- name: run-migrations
image: myregistry/web-app:v3.1.0
command: ["python", "manage.py", "migrate", "--noinput"]
envFrom:
- secretRef: { name: db-credentials }
resources:
requests: { cpu: "100m", memory: "128Mi" }
limits: { cpu: "100m", memory: "128Mi" }
# ββ Main app container ββββββββββββββββββββββββββββββββββββββββββββββ
containers:
- name: web-app
image: myregistry/web-app:v3.1.0
ports:
- containerPort: 8000
envFrom:
- secretRef: { name: db-credentials }
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "250m", memory: "256Mi" }
readinessProbe:
httpGet: { path: /health/, port: 8000 }
periodSeconds: 5
failureThreshold: 3The Key Difference from Sidecars
Init containers exit. Sidecars run alongside the app.
A sidecar β a log forwarder, a metrics exporter, a service mesh proxy β is a regular container that starts with the app containers and runs for the entire life of the pod. An init container's entire purpose is to exit. Once it exits successfully, it is gone. If you see a pod in Init:0/2 state, that means one of two init containers is still running β which is expected, not a problem.
π§ Memory Trick
β‘ Pro Tip
restartPolicy: Always. They start before app containers, run alongside them, and terminate after them. This solves the startup ordering problem (your Envoy proxy is ready before your app makes any network calls) and the Job completion problem (sidecar exits cleanly when the Job completes, instead of blocking forever). If you are on 1.29+, migrate your logging and proxy sidecars to this pattern.Q4: What Happens After kubectl delete pod β Step by Step?
The simple answer: SIGTERM, then SIGKILL after 30 seconds. That answer describes about 40% of what actually happens. The 60% it omits is where the 503s come from.
- API server sets
deletionTimestampon the pod object. The pod enters Terminating state. This is not an immediate deletion. The pod object stays in etcd until all finalizers are cleared. - EndpointSlice controller removes the pod from the Service endpoints. This is fast β milliseconds. But what happens next is slow.
- kube-proxy on every node reprograms iptables. kube-proxy watches for EndpointSlice changes and updates iptables rules on its node. This takes 2 to 15 seconds per node. On a 50-node cluster under load, propagation can take longer. During this window, new connections are still being routed to the pod that is about to receive SIGTERM.
- preStop hook runs β in parallel with step 3. The kubelet runs the pod's preStop hook on the node where the pod lives. A
preStop: exec: sleep 10hook keeps the pod alive for 10 seconds β giving kube-proxy time to finish propagating the endpoint removal before the app starts shutting down. - SIGTERM sent to PID 1 in each container. After the preStop hook completes, SIGTERM is sent. Your app should handle this: stop accepting new connections, drain in-flight requests, close database connections cleanly, flush write buffers, then exit.
terminationGracePeriodSecondscountdown begins. Default is 30 seconds, measured from when the preStop hook started (not when SIGTERM was sent). If preStop runs for 10 seconds and app draining takes 25 seconds, you need a grace period of at least 35 seconds β not the default 30.- SIGKILL. If the process is still running when the grace period expires, the container runtime sends SIGKILL. No handlers. No cleanup. Everything in flight dies immediately. This is what you are trying to avoid.
kubectl delete pod / rolling update / HPA scale-down t=0s β 1. API server sets deletionTimestamp on pod β 2. EndpointSlice controller removes pod IP β ASYNC (takes 2β15s to propagate β through kube-proxy on all nodes) β β ββββ PARALLEL ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β 3. preStop hook runs (exec or httpGet) β e.g. sleep 10 β this sleep IS the kube-proxy drain window β t=10s (preStop completes) β 4. SIGTERM β PID 1 in each container β App should: stop listener, drain in-flight requests, close DB, exit β t=10s + terminationGracePeriodSeconds (default 30s) = t=40s β β 5. If process still running: SIGKILL β No handlers. No cleanup. In-flight requests die immediately. β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ THE RACE: Steps 2 and 3 run in parallel. Without preStop sleep, SIGTERM arrives before kube-proxy finishes reprogramming iptables. New connections still arrive at the dying pod. TCP RST β 502/503 errors. βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π₯ Production Reality
Q5: My Pod Keeps Restarting. CrashLoopBackOff. The Logs Show Nothing. Why?
CrashLoopBackOff means the container crashed, Kubernetes is applying exponential backoff before restarting it (10s, 20s, 40s, 80s... up to 5 minutes), and you are asking why. The absence of logs means the container crashed before it could write any. There are four causes.
Cause 1: OOMKilled
The container hit its memory limit. The Linux kernel OOM killer fired and terminated the process before the application could write a single log line. Checkkubectl describe pod <name>. Under Last Stateyou will see Reason: OOMKilled and exit code 137 (128 + SIGKILL). Fix: increase the memory limit to p99 usage plus 20-30% headroom, or fix the memory leak in the application.
Cause 2: Missing Environment Variable or Secret
The application panics at startup when a required environment variable is missing or has the wrong value. This often happens before any logger is initialized β the app crashes at line 3 of its startup sequence and never writes anything. Check that all required env vars are set and all referenced Secrets and ConfigMaps exist in the namespace.kubectl describe pod will show CreateContainerConfigErrorif a Secret or ConfigMap reference is missing entirely β but it will not catch bad values.
Cause 3: Wrong Command or Entrypoint
The container exits immediately with code 1 because the command orargs in the pod spec override the Dockerfile's CMD orENTRYPOINT in a way that is broken. The process starts, finds the binary does not exist, exits. No logs. Check the pod spec'scontainers[].command and containers[].args against the Dockerfile. Run kubectl exec on a running pod with the same image and check the binary path manually.
Cause 4: Init Container Failure
The app container never starts because an init container is failing in a loop. From the outside it can look like CrashLoopBackOff on the app container. Check with kubectl logs <pod> -c <init-container-name>. The init container may have logs even when the app container does not.
π¨ Interview Trap
kubectl logs <pod>β is the first thing most people say. In CrashLoopBackOff, the current container may have just restarted and not written any logs yet. Always use kubectl logs <pod> --previous to see logs from the container instance that actually crashed. This is the single most commonly forgotten flag in Kubernetes debugging.Probe Types Deep Dive
Three probes. Completely different consequences on failure. The most dangerous Kubernetes mistake is treating them as variations of the same thing.
| Probe | On Failure | On Success | The Question It Answers |
|---|---|---|---|
| readinessProbe | Removed from endpoints (no kill) | Added to endpoints, receives traffic | Can this pod serve requests right now? |
| livenessProbe | Container killed + restarted | Nothing (normal) | Is this container deadlocked and beyond recovery? |
| startupProbe | Container killed (after threshold) | Liveness + readiness activate | Has the slow startup sequence finished? |
Container starts
β
ββββ startupProbe ONLY (blocks liveness + readiness until it passes)
β budget = failureThreshold Γ periodSeconds
β e.g. 30 Γ 10s = 300s window for a slow JVM startup
β If startupProbe fails beyond threshold β container killed
β
βΌ startupProbe passes (or not configured)
β
ββββ livenessProbe ββββββ FAILS β container KILLED + restarted
ββββ readinessProbe βββββ FAILS β removed from Service endpoints (NOT killed)
β Pod still Running. Just gets no traffic.
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β readiness PASS β pod IP added to EndpointSlice β
β readiness FAIL β pod IP removed from EndpointSlice (no kill) β
β liveness FAIL β container restarted (restartPolicy applies) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DANGER ZONE: identical endpoints and thresholds for both probes
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Traffic spike β /health slow β BOTH probes fail simultaneously
β liveness kills pod that was merely overwhelmed
β replacement pod starts cold, under full load
β fails immediately β death spiral
readinessProbe: Your Traffic Gate
The readinessProbe endpoint should check real dependencies. Database connectivity. Cache availability. Feature flag service health. If any critical dependency is down, the pod should fail readiness. It gets removed from rotation. Healthy pods absorb the traffic. When the dependency recovers, readiness passes and the pod re-enters rotation. No restart, no downtime. Just graceful load shifting.
Use a lower failure threshold for readiness. You want to shed traffic fast (2 failures Γ 5s period = 10 seconds to remove from endpoints) but recover fast too (1 success Γ 5s = back in rotation in 5 seconds).
livenessProbe: For Deadlocks Only
The livenessProbe endpoint should be lightweight and check nothing external. No database query. No cache ping. No downstream service call. It should answer one question: is this process still able to do work, or is it deadlocked and spinning forever on a lock it will never acquire?
If your liveness probe checks the database and the database goes down, all pods get killed simultaneously. You go from a partial outage (readiness removing pods from rotation one by one) to a complete outage (every pod restarting at once). This has happened. It will happen again in a cluster where someone thought βmore comprehensive health check = better.β
Use a much higher failure threshold for liveness. A pod should need to be unresponsive for 45 to 150 seconds before being killed. A threshold of 3 with a 15-second period (3Γ15 = 45 seconds) is a reasonable starting point.
startupProbe: The JVM Saver
JVMs take 30 to 120 seconds to warm up. Large Python applications take time to import their dependency graph. Services that load ML models at startup take what they take. Without a startupProbe, you have two bad options: set a largeinitialDelaySeconds on the liveness probe (makes deadlock detection slow forever) or skip liveness entirely (no deadlock recovery).
The startupProbe is the correct solution. ConfigurefailureThreshold Γ periodSeconds to equal your maximum expected startup time. The startupProbe runs exclusively during startup. Once it passes, it is done. Liveness activates with tight thresholds. A slow startup does not trigger a false liveness kill. A post-startup deadlock is caught quickly.
# production-ready Deployment probe configuration
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # never reduce capacity below desired
template:
spec:
terminationGracePeriodSeconds: 60 # must be > preStop + app drain time
containers:
- name: api
image: myregistry/api:v2.4.1
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m" # == request β Guaranteed QoS
memory: "512Mi"
# ββ startupProbe ββββββββββββββββββββββββββββββββββββββββββ
# Gives JVM / slow apps up to 30Γ10s = 300s to become live.
# Liveness is DISABLED until this passes.
startupProbe:
httpGet:
path: /healthz/startup
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 30 # 300s total budget
# ββ livenessProbe βββββββββββββββββββββββββββββββββββββββββ
# Checks ONLY that the process is not deadlocked.
# NEVER checks external deps (DB, cache) β see why in Q1.
livenessProbe:
httpGet:
path: /healthz/live # lightweight β no DB query
port: 8080
periodSeconds: 15
failureThreshold: 3 # 45s before restart β high bar
# ββ readinessProbe ββββββββββββββββββββββββββββββββββββββββ
# Checks real dependencies. Fails = removed from endpoints.
# Lower threshold: shed traffic fast.
readinessProbe:
httpGet:
path: /healthz/ready # checks DB ping, cache, etc.
port: 8080
periodSeconds: 5
failureThreshold: 2 # 10s to shed traffic
# ββ preStop βββββββββββββββββββββββββββββββββββββββββββββββ
# sleep 10 = kube-proxy drain window before SIGTERM.
# Without this: 502s on every rolling deployment.
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]π Senior Engineer Confession
QoS Classes: Who Gets Evicted First When the Node Runs Out of Memory
Kubernetes assigns every pod one of three QoS classes based entirely on the relationship between its resource requests and limits. This class determines eviction order when a node is under memory pressure. Most engineers don't know their pod's QoS class until they learn it from a post-mortem.
| QoS Class | Condition | Evicted | OOM Kill Risk |
|---|---|---|---|
| Guaranteed | Every container: requests == limits (CPU + memory) | Last | Only if container exceeds its own limit |
| Burstable | At least one container: requests < limits (or only requests set) | Second | When node is under memory pressure |
| BestEffort | No container has any requests or limits set | First. Always. | Very high under any memory pressure |
Guaranteed QoS means the scheduler places the pod on a node that has exactly that much resource available. The Linux cgroup hard-limits the container to its specified memory. If the container allocates beyond its limit, it is OOMKilled β but that is a container-level event, not a node-level eviction. During node-level memory pressure, Guaranteed pods are the last ones the kubelet touches.
For every production-critical service: set requests equal to limits.kubectl get pod <name> -o jsonpath='{.status.qosClass}'tells you the class. It should return Guaranteed.
# Guaranteed QoS: every container in the pod has requests == limits.
# Result: last to be evicted under node memory pressure.
# Also more predictable: no CPU throttling surprises under load.
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
resources:
requests:
cpu: "1"
memory: "1Gi"
limits:
cpu: "1" # == request β Guaranteed
memory: "1Gi" # == request β Guaranteed
- name: sidecar
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "50m" # == request β both containers Guaranteed
memory: "64Mi"π₯ Production Reality
Production Disasters
Disaster 1: The Missing initialDelaySeconds
A new Python microservice was deployed. The database connection pool took 8 seconds to initialize β establishing 20 connections with TLS handshakes, running a preflight query, and registering a schema validator. Completely normal behavior. The readinessProbe was configured with initialDelaySeconds: 0.
The probe fired at t=1 second. The app was still initializing. The/health/ready endpoint returned 503 because the connection pool was not yet available. Failure. Pod marked Not Ready. The Service had no endpoints. Every incoming request returned 503.
The team spent 40 minutes reading application code, checking database logs, reviewing the network configuration, and rebooting the deployment three times. The app was fine every single time. The probe was the problem.
Fix: set initialDelaySeconds: 10 on the readinessProbe β enough time for the connection pool to initialize. Or better: add a startupProbe with a generous budget, keep initialDelaySeconds: 0 on the readinessProbe, and let the startupProbe protect the initialization window.
The lesson: a pod can look Running and healthy while never serving a single request. Check the endpoints first. kubectl get endpoints <service-name>with an empty result is the fastest diagnosis in the playbook.
Disaster 2: OOMKilled in Silence
A Node.js API service had a memory limit of 256Mi. Average production load: 180Mi. Comfortable headroom. The team set it and forgot it.
A customer submitted a batch export request that triggered a known (but unpatched) memory inefficiency in the JSON serialization code. Memory usage climbed from 180Mi to 310Mi in 90 seconds. The container hit its 256Mi limit. The Linux kernel OOM killer fired. The container was terminated before it wrote a single log line β the application logger had not yet flushed its buffer.
The container restarted (restartPolicy: Always). Under normal load at the next restart: 180Mi. Fine. The batch job was still in the queue. The customer resubmitted. Memory climbed again. OOMKilled again. CrashLoopBackOff. No application logs visible.
The team spent 90 minutes reading application code, adding debug logging, and deploying test builds looking for the crash. Nobody checkedkubectl describe pod until a very tired senior engineer ran it and saw Reason: OOMKilled in the Last State section. The investigation was over in 30 seconds.
Fix: increase memory limit to 512Mi. Profile the JSON serializer. Add a Prometheus alert for container memory usage above 80% of limit. In every service. Before the next batch job.
π Senior Engineer Confession
kubectl describe pod is the most valuable command in Kubernetes debugging and the most commonly skipped. Every crash loop investigation should start with it. Not the logs. Not the metrics dashboard. Not Slack. The pod description. It has the exit code, the OOM reason, the probe failure details, and the event history all in one place. Bookmark nothing. Just remember: describe pod first, ask questions second.The Wall of Shame
π Senior Engineer Confession
- No readinessProbe at all. This is the Kubernetes equivalent of sending customers to a restaurant the moment the chef walks in the door β before the kitchen is set up, before the menu is printed, before the gas is on. Traffic arrives the instant the container starts. The database connection pool is still initializing. Users get 500s. The app is perfectly healthy. It just isn't ready yet. Consequence: every deployment spikes errors for the startup duration of your app, proportional to how slow it is. Add a readinessProbe to every container. It is not optional.
- Same readiness and liveness failureThreshold. You built a smoke detector that also calls the fire brigade and simultaneously burns the building down. Readiness failing means shed load. Liveness failing means this process is beyond recovery β kill it. Using identical thresholds means a temporarily slow pod gets killed instead of just paused. The replacement starts cold, under full load, fails the same probe, gets killed again. The death spiral is symmetric and self-reinforcing. Liveness failureThreshold should be at least 3x readiness. Always.
- initialDelaySeconds: 0 on a slow-starting app. You handed a new employee their first live customer call 10 seconds after they walked in the door. Before they know the product. Before they know the systems. Before they know where the bathroom is. The employee is not incompetent. They were just not given time to onboard. The probe fires before the app is remotely ready. Fails. Pod marked Not Ready forever. Use a startupProbe for slow starters. It exists for exactly this reason.
- No preStop hook. The last person to leave the office turns off all the servers without warning anyone who's still working. SIGTERM arrives. The Go process exits in 200ms. kube-proxy is still reprogramming iptables on 40 nodes. New connections arrive at the dead process IP. TCP RST. nginx reports 502. Every deployment looks like a partial outage. Because it is. Add
preStop: exec: sleep 10to every Deployment. This is not configuration tuning. It is correctness. - terminationGracePeriodSeconds: 0. SIGKILL is not a graceful shutdown. It is a hostage situation. The process has no say. In-flight requests die mid-response. Write buffers are not flushed. Database transactions may be partially committed. The only valid reason for terminationGracePeriodSeconds: 0 is a runaway process that you absolutely must kill immediately in an emergency. In a Deployment template, it is a 503 generator dressed up as configuration.
- Memory limit without a measured baseline. You told a chef they can use as much kitchen space as they want, but if they exceed some number you picked arbitrarily while writing the manifest, you will evict them mid-service with no warning. The customer's order dies on the pass. Profile your service. Measure p99 memory usage under load. Set limit to p99 plus 25% headroom. Not 256Mi because it sounded reasonable in a Slack message.
- CrashLoopBackOff ignored for days. The fire alarm has been going off for three days. Everyone decided it must be a false positive. It is not a false positive. CrashLoopBackOff means the application is crashing and restarting. The backoff protects the cluster, not the application. The application is crashing every five minutes, losing state, dropping connections, generating errors that users are silently absorbing. CrashLoopBackOff is a P1. Treat it like one.
- Running as root in the container. You gave every temporary contractor full admin rights to the building's control room because setting up limited access seemed complicated. A container running as root with a process breakout vulnerability becomes a node compromise, not just a container compromise. Set
securityContext.runAsNonRoot: trueandsecurityContext.readOnlyRootFilesystem: trueon every container. The βit'll complicate the Dockerfileβ objection is not a security argument.
Best Practices
- Add a readinessProbe to every container. No exceptions. Traffic routing without a readiness gate is a deployment-time error generator.
- Use a startupProbe for any app with startup time over 15 seconds. JVMs, ML model loaders, services that run migrations at startup. Set
failureThreshold Γ periodSecondsto your maximum startup budget. - Use separate endpoints for liveness and readiness. Liveness checks only that the process is alive. Readiness checks real dependencies. Never the same endpoint.
- Set liveness failureThreshold at least 3x higher than readiness. Slow under load is not the same as broken beyond recovery.
- Never check external dependencies in the livenessProbe. Database down should mean readiness fails (pods removed from rotation), not liveness fails (all pods restarted simultaneously).
- Add preStop sleep to every Deployment. 10 to 15 seconds. This is the kube-proxy drain window. Without it, every rolling deployment drops connections during the endpoint propagation window.
- Set terminationGracePeriodSeconds to preStop duration + max request duration + 10 seconds. If preStop sleeps 10s and requests can take 20s: set 40s minimum.
- Use Guaranteed QoS for production-critical services. Set requests equal to limits. Run
kubectl get pod <name> -o jsonpath='{.status.qosClass}'to verify.
FAQ
Does Running mean the app is healthy?
No. Running means at least one container process is executing. The application inside may be initializing, failing probes, or in a crash loop. Always check the READY column in kubectl get pods β that reflects the readinessProbe state, not just whether the container process is alive.
What is the difference between pod eviction and OOMKill?
OOMKill is a container-level event: the Linux kernel terminates a specific container that exceeded its cgroup memory limit. The pod continues and the container restarts (restartPolicy applies). Eviction is a pod-level event: the kubelet removes the entire pod from the node due to node-level resource pressure. The pod is rescheduled on another node.
Can a pod be Running but receive zero traffic?
Yes. This is the most important thing in this article. A pod can be Running with a failing readinessProbe for hours. It will never receive traffic. It will not restart. It will not alert. It will just sit there, healthy from the container runtime's perspective, invisible to the Service. Always check endpoints:kubectl get endpoints <service>. An empty result is a complete outage.
What happens if preStop takes longer than terminationGracePeriodSeconds?
The grace period countdown starts from when the preStop hook begins. If preStop runs longer than the grace period, the pod gets SIGKILL as soon as the grace period expires β before SIGTERM is ever sent. Your carefully written graceful shutdown code never runs. Set terminationGracePeriodSeconds to at least preStop duration + expected app drain time + a buffer.
How do I check what QoS class a pod has?
kubectl get pod <name> -o jsonpath='{.status.qosClass}'returns Guaranteed, Burstable, or BestEffort. Run this on every production pod. If you see BestEffort, that is an incident waiting for a busy Tuesday to happen.
π€ The 60-Second Interview Answer
Back in the interview room. The whiteboard is still there. You've answered all five follow-up questions. Here is how you deliver the complete answer:
π€ Say This Out Loud Until You Own It
βA pod starts in Pending β the API server has accepted it but no node has been assigned. The scheduler runs its filtering and scoring algorithm, picks a node, and patches the pod with that node's name. The kubelet on that node picks it up, pulls the image via CRI, sets up the network namespace via CNI, mounts volumes, and runs init containers in sequence. Every init container must complete successfully before any app container starts.
Once app containers start, the pod phase moves to Running β but Running does not mean ready to serve traffic. That is the readinessProbe's job. The probe fires on a schedule. When it passes, the pod IP is added to the Service's EndpointSlice. kube-proxy updates iptables on every node. Only then does the pod receive traffic.
The readinessProbe and livenessProbe do completely different things. Readiness failure removes the pod from endpoints β no kill. Liveness failure kills and restarts the container. Using the same endpoint and thresholds for both is how you get a death spiral under load: a traffic spike makes the health endpoint slow, both probes fail, liveness kills pods that were merely overwhelmed, replacements start cold, fail immediately, repeat.
On termination: kubectl delete sets deletionTimestamp. The EndpointSlice controller removes the pod IP. kube-proxy propagates that removal through iptables asynchronously β takes 2 to 15 seconds. Without a preStop hook, SIGTERM arrives before kube-proxy has finished, and new connections still arrive at the dying process. That is the source of 502s on every rolling deployment that lacks the hook. A preStop sleep of 10 seconds creates the drain window.
Finally, QoS class β determined by requests versus limits β controls eviction order under node memory pressure. Guaranteed class (requests equal limits) is last to be evicted. BestEffort (no requests, no limits) is first, always, immediately.β
If you can say that in one breath, you're getting the job.
Key Takeaways
- βRunning means the container process started. Ready means the readinessProbe passed. These are not the same thing.
- βA pod can be Running with zero traffic, zero restarts, and zero alerts β for hours β if the readinessProbe keeps failing.
- βReadiness failure removes from endpoints. Liveness failure kills and restarts. Different probes. Different endpoints. Different thresholds.
- βWithout a preStop sleep, every rolling deployment has a race condition between SIGTERM and kube-proxy endpoint propagation.
- βterminationGracePeriodSeconds must be greater than preStop duration plus app drain time or SIGKILL cuts your drain short.
- βInit containers are a scheduling precondition: the app container does not start until every init container has exited with code 0.
- βQoS class is determined by requests vs. limits. Guaranteed (requests == limits) is last evicted. BestEffort is first. Always.
- βkubectl describe pod first. Always. Before logs. Before metrics. Before Slack. The answer is almost always there.
Targeting a Kubernetes or SRE Role?
AiResumeFit matches your resume to Kubernetes, cloud, and SRE job descriptions β improving your ATS score in seconds.
Optimize My Resume β