KubernetesπŸ”₯ Production-Critical⚑ Senior Engineer Level

Kubernetes Pod Lifecycle Explained: Running Is Not Ready

The interview question that separates operators from deployers. Pod phases, init containers, readiness vs. liveness, graceful termination, QoS classes, and the silent 503 that takes hours to notice because every dashboard shows green.

β€œThe pod is Running. The app is healthy. The readinessProbe is failing. Traffic stopped four seconds after deploy. Nobody got an alert.”

Updated June 8, 2026|22 min read|Has saved 3 on-call rotations

β€” PagerDuty Alert β€”

It's 2:17 AM.

You check the cluster:

βœ… Deployment shows 3/3 pods Running

βœ… Load balancer is healthy

βœ… Pods are Running

❌ Endpoints: empty

Users are getting 503s. The load balancer has no backends.

You check one pod. Running. You check its readiness. Not Ready.

The readinessProbe is failing. Silently. The app started 4 seconds ago. The database connection pool hasn't warmed up yet. initialDelaySeconds: 0. The probe fired immediately. Failed. Pod got marked Not Ready. Traffic stopped.

The pod is still Running. Nobody got an alert about it.

You don't know why. You're about to.

Three Months Later. A Different Kind of War Room.

No PagerDuty this time. Fluorescent lights. A senior SRE interview loop. The interviewer β€” calm, unhurried β€” writes one question on the whiteboard:

β€œWalk me through a pod's lifecycle from creation to termination.”

You take a breath. You've lived this at 2 AM. You know every state. You start drawing.

The Answer That Gets 80% of Candidates Eliminated

Most engineers draw the same thing. Confident. Correct. Incomplete.

Pending
β–Ό
Running
β–Ό
Succeeded / Failed

The interviewer nods. Writes something down. Then looks up.

Interviewer keeps going:

❓ β€œWhat is the difference between Running and Ready?”

❓ β€œMy pod is stuck in Pending β€” what are the three most likely causes?”

❓ β€œWhat is an Init container and how is it different from a sidecar?”

❓ β€œWhat happens after kubectl delete pod β€” step by step?”

❓ β€œMy pod keeps restarting. CrashLoopBackOff. The logs show nothing. Why?”

Five questions. Most candidates get through one and a half before the interviewer knows. Engineers who answer all five β€” with specifics, with failure modes, with production war stories β€” walk out with the offer.

Before we go deep, here's the mental model that makes every state click immediately. Think of a pod as a new employee starting at a company.

New Employee WorldKubernetes Pod World
Hired but not yet onboarded (waiting for a desk, equipment, access)Pending β€” scheduled, but containers not yet started
First day setup (getting their laptop imaged)ContainerCreating β€” image pull, CNI setup, volume mount
Security clearance and onboarding training (must complete before work)Init containers β€” run sequentially to completion first
Badge activated, at their deskRunning β€” container process is executing
Actually productive β€” they know what they're doingReady β€” readinessProbe passed, receiving traffic
At their desk, badge works, but not yet taking customer callsRunning but Not Ready β€” probe failing, no traffic routed
Weekly check-in meeting (are you still functional?)livenessProbe β€” is this container deadlocked?
Are you ready to take on work right now?readinessProbe β€” can this pod serve requests?
Initial onboarding grace period (don't fire them in the first 2 weeks)startupProbe β€” protected window for slow startup
Handover period (finishing current work before leaving)Terminating β€” pod draining in-flight requests
The notice period (stay long enough to hand off properly)preStop hook β€” drain window before SIGTERM
Your last day is in 30 secondsSIGTERM β€” graceful shutdown signal
Security is escorting you out right nowSIGKILL β€” no handlers, no cleanup, immediate exit

Hold that analogy. Everything below is the exact same thing β€” except the onboarding training runs in a container, the notice period is a shell script sleeping for 10 seconds, and security escorting you out is the Linux kernel sending SIGKILL to PID 1.

Q1: What Is the Difference Between Running and Ready?

Most people say they're basically the same thing. That answer is like saying a restaurant being open and a restaurant being ready to serve you are basically the same thing. One is the chef arriving. The other is the food being cooked.

Running means the container process started. PID 1 is executing. That's it. The application inside might be loading. It might be failing every health check. It might be in a crash loop. Kubernetes does not care. Running just means the runtime did not immediately die.

Ready means the readinessProbe passed. The application declared it is able to receive traffic. These are completely independent.

A pod can be Running for hours while never becoming Ready. The readinessProbe keeps failing. The pod keeps sitting there β€” alive, processing nothing, excluded from every Service endpoint, completely invisible to traffic. No restart. No alert. Just silence.

When a pod transitions from Ready back to Not Ready β€” because its readinessProbe started failing mid-life β€” it is immediately removed from the Service's EndpointSlice. kube-proxy reprograms iptables. Traffic stops. The pod is still Running. The container is still alive. Nothing restarts. The application is simply no longer reachable via the Service. This is the most common source of silent 503s in production.

🚨 Interview Trap

The trap is saying β€œNot Ready means the pod is unhealthy and will restart.” Wrong. Not Ready means no traffic. The pod stays alive. If you want a restart, that's the livenessProbe's job β€” and liveness failing and readiness failing are not the same event, should not have the same thresholds, and should not check the same endpoint. Conflating them is how you build a death spiral. The interviewer is testing whether you know the difference between traffic routing and container lifecycle. Most people don't.

Q2: My Pod Is Stuck in Pending β€” What Are the Three Most Likely Causes?

β€œCheck the logs” is the wrong first move. Pending pods have no logs β€” the container hasn't started yet. The right first move is:kubectl describe pod <name> and read the Events section. Kubernetes will tell you exactly why. There are three causes that account for 95% of Pending pods.

Cause 1: Insufficient Resources

No node in the cluster has enough allocatable CPU or memory to satisfy the pod's resource requests. The Events section will show:

0/5 nodes are available: 3 Insufficient cpu, 2 Insufficient memory

Note the word allocatable β€” not capacity. Nodes reserve CPU and memory for system processes. A 4 GiB node might have only 3.2 GiB allocatable. Check the gap with kubectl describe node <name> | grep -A5 Allocatable. Fix: scale the cluster, reduce the pod's requests to measured actual usage, or add a node with the right instance type.

Cause 2: Taint / Toleration Mismatch

Nodes can be tainted to repel pods. GPU nodes typically carrynvidia.com/gpu=present:NoSchedule. Spot instances carrynode.kubernetes.io/spot:NoSchedule. If your pod has no matching toleration, the scheduler filters out every tainted node and the pod sits in Pending forever. The Events section will show:

N node(s) had taint X that the pod didn't tolerate

Fix: add the correct toleration to your pod spec, or audit whether the pod should be running on those nodes at all.

Cause 3: PersistentVolumeClaim Not Bound

If the pod spec references a PVC that is still in Pending phase β€” waiting for a PersistentVolume to be provisioned, or waiting for a storage class to be configured β€” the pod waits too. Forever. Check with kubectl get pvc. A Pending PVC means a Pending pod. The fix is almost never in the pod spec. It is in the StorageClass configuration, the PV provisioner logs, or the missing CSI driver installation.

⚑ Pro Tip

Run kubectl get events -n <ns> --sort-by=.metadata.creationTimestamp | tail -20 for any Pending pod. The scheduler posts a specific event for every filter it failed. You will have a diagnosis in under 30 seconds.

Q3: What Is an Init Container and How Is It Different from a Sidecar?

Most engineers know init containers exist. Fewer can explain the lifecycle distinction that makes them actually useful.

Init containers run to completion before any app container starts.In sequence. Not in parallel. If one fails, it restarts β€” respecting the pod'srestartPolicy. The app container does not start until every init container has exited with code 0. This is not a race. It is a guarantee.

This matters enormously. Without init containers, your application has to handle the case where the database isn't ready yet β€” with retry logic, backoff, error handling in application code. With init containers, the database being ready is a scheduling precondition. The app starts knowing everything it needs is available.

Common production patterns:

  • Wait for a dependency: poll with nc or curl until Postgres on port 5432 accepts connections. The main container never starts on a cluster where the DB is still coming up.
  • Database migrations: run python manage.py migrate before the app starts. Every replica of the app starts knowing the schema is current. No coordination logic needed.
  • Secret or certificate setup: fetch a TLS certificate from Vault, write it to a shared emptyDir volume. The app container mounts the same volume and finds ready-to-use certs.
  • Config rendering: populate a config file from environment-specific templates. Write to shared volume. App starts with final config already present.
production init container pattern
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web-app
  template:
    spec:
      # ── Init containers: run SEQUENTIALLY before any app container starts ──
      initContainers:

        # Step 1: Wait for Postgres to be reachable.
        # Pod never moves forward until this passes.
        - name: wait-for-postgres
          image: busybox:1.36
          command:
            - sh
            - -c
            - |
              until nc -z postgres-service 5432; do
                echo "Waiting for Postgres..."; sleep 2;
              done
          resources:
            requests: { cpu: "10m", memory: "16Mi" }
            limits:   { cpu: "10m", memory: "16Mi" }

        # Step 2: Run database migrations.
        # Uses the same image β€” schema knowledge is shared.
        # App container starts knowing schema is current.
        - name: run-migrations
          image: myregistry/web-app:v3.1.0
          command: ["python", "manage.py", "migrate", "--noinput"]
          envFrom:
            - secretRef: { name: db-credentials }
          resources:
            requests: { cpu: "100m", memory: "128Mi" }
            limits:   { cpu: "100m", memory: "128Mi" }

      # ── Main app container ──────────────────────────────────────────────
      containers:
        - name: web-app
          image: myregistry/web-app:v3.1.0
          ports:
            - containerPort: 8000
          envFrom:
            - secretRef: { name: db-credentials }
          resources:
            requests: { cpu: "250m", memory: "256Mi" }
            limits:   { cpu: "250m", memory: "256Mi" }
          readinessProbe:
            httpGet: { path: /health/, port: 8000 }
            periodSeconds: 5
            failureThreshold: 3

The Key Difference from Sidecars

Init containers exit. Sidecars run alongside the app.

A sidecar β€” a log forwarder, a metrics exporter, a service mesh proxy β€” is a regular container that starts with the app containers and runs for the entire life of the pod. An init container's entire purpose is to exit. Once it exits successfully, it is gone. If you see a pod in Init:0/2 state, that means one of two init containers is still running β€” which is expected, not a problem.

🧠 Memory Trick

Init container = setup crew. They arrive first, do their work, then leave. The chef cannot start until the kitchen is ready, the gas is on, and the ingredients are prepped. Sidecar = the expeditor who runs alongside the chef all night, never leaving until the kitchen closes. One exits by design. The other stays by design.

⚑ Pro Tip

Kubernetes 1.29+ introduced native sidecar containers β€” init containers withrestartPolicy: Always. They start before app containers, run alongside them, and terminate after them. This solves the startup ordering problem (your Envoy proxy is ready before your app makes any network calls) and the Job completion problem (sidecar exits cleanly when the Job completes, instead of blocking forever). If you are on 1.29+, migrate your logging and proxy sidecars to this pattern.

Q4: What Happens After kubectl delete pod β€” Step by Step?

The simple answer: SIGTERM, then SIGKILL after 30 seconds. That answer describes about 40% of what actually happens. The 60% it omits is where the 503s come from.

  1. API server sets deletionTimestamp on the pod object. The pod enters Terminating state. This is not an immediate deletion. The pod object stays in etcd until all finalizers are cleared.
  2. EndpointSlice controller removes the pod from the Service endpoints. This is fast β€” milliseconds. But what happens next is slow.
  3. kube-proxy on every node reprograms iptables. kube-proxy watches for EndpointSlice changes and updates iptables rules on its node. This takes 2 to 15 seconds per node. On a 50-node cluster under load, propagation can take longer. During this window, new connections are still being routed to the pod that is about to receive SIGTERM.
  4. preStop hook runs β€” in parallel with step 3. The kubelet runs the pod's preStop hook on the node where the pod lives. A preStop: exec: sleep 10 hook keeps the pod alive for 10 seconds β€” giving kube-proxy time to finish propagating the endpoint removal before the app starts shutting down.
  5. SIGTERM sent to PID 1 in each container. After the preStop hook completes, SIGTERM is sent. Your app should handle this: stop accepting new connections, drain in-flight requests, close database connections cleanly, flush write buffers, then exit.
  6. terminationGracePeriodSeconds countdown begins. Default is 30 seconds, measured from when the preStop hook started (not when SIGTERM was sent). If preStop runs for 10 seconds and app draining takes 25 seconds, you need a grace period of at least 35 seconds β€” not the default 30.
  7. SIGKILL. If the process is still running when the grace period expires, the container runtime sends SIGKILL. No handlers. No cleanup. Everything in flight dies immediately. This is what you are trying to avoid.

  kubectl delete pod / rolling update / HPA scale-down
  t=0s
  β”‚  1. API server sets deletionTimestamp on pod
  β”‚  2. EndpointSlice controller removes pod IP  ← ASYNC (takes 2–15s to propagate
  β”‚                                                 through kube-proxy on all nodes)
  β”‚
  β”‚  ──── PARALLEL ────────────────────────────────────────────────────────────
  β”‚
  β”‚  3. preStop hook runs (exec or httpGet)
  β”‚     e.g. sleep 10  ← this sleep IS the kube-proxy drain window
  β”‚
  t=10s (preStop completes)
  β”‚  4. SIGTERM β†’ PID 1 in each container
  β”‚     App should: stop listener, drain in-flight requests, close DB, exit
  β”‚
  t=10s + terminationGracePeriodSeconds (default 30s) = t=40s
  β”‚
  β”‚  5. If process still running: SIGKILL
  β”‚     No handlers. No cleanup. In-flight requests die immediately.
  β”‚
  ─────────────────────────────────────────────────────────────────────────
  THE RACE: Steps 2 and 3 run in parallel. Without preStop sleep, SIGTERM
  arrives before kube-proxy finishes reprogramming iptables. New connections
  still arrive at the dying pod. TCP RST β†’ 502/503 errors.
  ─────────────────────────────────────────────────────────────────────────

πŸ”₯ Production Reality

Steps 3 and 4 happen in parallel. This is the race condition. Without a preStop sleep, SIGTERM arrives before kube-proxy has finished reprogramming iptables on all nodes. If your Go service exits immediately on SIGTERM (because nobody wrote a signal handler), and kube-proxy still has stale rules on 30 other nodes, those nodes keep routing new connections to a dead process. TCP RST. nginx reports 502. Every rolling deployment has this problem if you have no preStop hook. Almost nobody has a preStop hook.

Q5: My Pod Keeps Restarting. CrashLoopBackOff. The Logs Show Nothing. Why?

CrashLoopBackOff means the container crashed, Kubernetes is applying exponential backoff before restarting it (10s, 20s, 40s, 80s... up to 5 minutes), and you are asking why. The absence of logs means the container crashed before it could write any. There are four causes.

Cause 1: OOMKilled

The container hit its memory limit. The Linux kernel OOM killer fired and terminated the process before the application could write a single log line. Checkkubectl describe pod <name>. Under Last Stateyou will see Reason: OOMKilled and exit code 137 (128 + SIGKILL). Fix: increase the memory limit to p99 usage plus 20-30% headroom, or fix the memory leak in the application.

Cause 2: Missing Environment Variable or Secret

The application panics at startup when a required environment variable is missing or has the wrong value. This often happens before any logger is initialized β€” the app crashes at line 3 of its startup sequence and never writes anything. Check that all required env vars are set and all referenced Secrets and ConfigMaps exist in the namespace.kubectl describe pod will show CreateContainerConfigErrorif a Secret or ConfigMap reference is missing entirely β€” but it will not catch bad values.

Cause 3: Wrong Command or Entrypoint

The container exits immediately with code 1 because the command orargs in the pod spec override the Dockerfile's CMD orENTRYPOINT in a way that is broken. The process starts, finds the binary does not exist, exits. No logs. Check the pod spec'scontainers[].command and containers[].args against the Dockerfile. Run kubectl exec on a running pod with the same image and check the binary path manually.

Cause 4: Init Container Failure

The app container never starts because an init container is failing in a loop. From the outside it can look like CrashLoopBackOff on the app container. Check with kubectl logs <pod> -c <init-container-name>. The init container may have logs even when the app container does not.

🚨 Interview Trap

β€œCheck kubectl logs <pod>” is the first thing most people say. In CrashLoopBackOff, the current container may have just restarted and not written any logs yet. Always use kubectl logs <pod> --previous to see logs from the container instance that actually crashed. This is the single most commonly forgotten flag in Kubernetes debugging.

Probe Types Deep Dive

Three probes. Completely different consequences on failure. The most dangerous Kubernetes mistake is treating them as variations of the same thing.

ProbeOn FailureOn SuccessThe Question It Answers
readinessProbeRemoved from endpoints (no kill)Added to endpoints, receives trafficCan this pod serve requests right now?
livenessProbeContainer killed + restartedNothing (normal)Is this container deadlocked and beyond recovery?
startupProbeContainer killed (after threshold)Liveness + readiness activateHas the slow startup sequence finished?

  Container starts
       β”‚
       │◄── startupProbe ONLY (blocks liveness + readiness until it passes)
       β”‚    budget = failureThreshold Γ— periodSeconds
       β”‚    e.g.  30 Γ— 10s = 300s window for a slow JVM startup
       β”‚    If startupProbe fails beyond threshold β†’ container killed
       β”‚
       β–Ό  startupProbe passes (or not configured)
       β”‚
       │◄── livenessProbe ────── FAILS β†’ container KILLED + restarted
       │◄── readinessProbe ───── FAILS β†’ removed from Service endpoints (NOT killed)
       β”‚                                  Pod still Running. Just gets no traffic.
       β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  readiness PASS  β†’ pod IP added to EndpointSlice                β”‚
  β”‚  readiness FAIL  β†’ pod IP removed from EndpointSlice (no kill)  β”‚
  β”‚  liveness  FAIL  β†’ container restarted (restartPolicy applies)  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  DANGER ZONE: identical endpoints and thresholds for both probes
  ──────────────────────────────────────────────────────────────────
  Traffic spike β†’ /health slow β†’ BOTH probes fail simultaneously
               β†’ liveness kills pod that was merely overwhelmed
               β†’ replacement pod starts cold, under full load
               β†’ fails immediately β†’ death spiral

readinessProbe: Your Traffic Gate

The readinessProbe endpoint should check real dependencies. Database connectivity. Cache availability. Feature flag service health. If any critical dependency is down, the pod should fail readiness. It gets removed from rotation. Healthy pods absorb the traffic. When the dependency recovers, readiness passes and the pod re-enters rotation. No restart, no downtime. Just graceful load shifting.

Use a lower failure threshold for readiness. You want to shed traffic fast (2 failures Γ— 5s period = 10 seconds to remove from endpoints) but recover fast too (1 success Γ— 5s = back in rotation in 5 seconds).

livenessProbe: For Deadlocks Only

The livenessProbe endpoint should be lightweight and check nothing external. No database query. No cache ping. No downstream service call. It should answer one question: is this process still able to do work, or is it deadlocked and spinning forever on a lock it will never acquire?

If your liveness probe checks the database and the database goes down, all pods get killed simultaneously. You go from a partial outage (readiness removing pods from rotation one by one) to a complete outage (every pod restarting at once). This has happened. It will happen again in a cluster where someone thought β€œmore comprehensive health check = better.”

Use a much higher failure threshold for liveness. A pod should need to be unresponsive for 45 to 150 seconds before being killed. A threshold of 3 with a 15-second period (3Γ—15 = 45 seconds) is a reasonable starting point.

startupProbe: The JVM Saver

JVMs take 30 to 120 seconds to warm up. Large Python applications take time to import their dependency graph. Services that load ML models at startup take what they take. Without a startupProbe, you have two bad options: set a largeinitialDelaySeconds on the liveness probe (makes deadlock detection slow forever) or skip liveness entirely (no deadlock recovery).

The startupProbe is the correct solution. ConfigurefailureThreshold Γ— periodSeconds to equal your maximum expected startup time. The startupProbe runs exclusively during startup. Once it passes, it is done. Liveness activates with tight thresholds. A slow startup does not trigger a false liveness kill. A post-startup deadlock is caught quickly.

production probe configuration β€” save this
# production-ready Deployment probe configuration
apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0     # never reduce capacity below desired
  template:
    spec:
      terminationGracePeriodSeconds: 60  # must be > preStop + app drain time

      containers:
        - name: api
          image: myregistry/api:v2.4.1

          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "500m"    # == request β†’ Guaranteed QoS
              memory: "512Mi"

          # ── startupProbe ──────────────────────────────────────────
          # Gives JVM / slow apps up to 30Γ—10s = 300s to become live.
          # Liveness is DISABLED until this passes.
          startupProbe:
            httpGet:
              path: /healthz/startup
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 30   # 300s total budget

          # ── livenessProbe ─────────────────────────────────────────
          # Checks ONLY that the process is not deadlocked.
          # NEVER checks external deps (DB, cache) β€” see why in Q1.
          livenessProbe:
            httpGet:
              path: /healthz/live   # lightweight β€” no DB query
              port: 8080
            periodSeconds: 15
            failureThreshold: 3    # 45s before restart β€” high bar

          # ── readinessProbe ────────────────────────────────────────
          # Checks real dependencies. Fails = removed from endpoints.
          # Lower threshold: shed traffic fast.
          readinessProbe:
            httpGet:
              path: /healthz/ready  # checks DB ping, cache, etc.
              port: 8080
            periodSeconds: 5
            failureThreshold: 2    # 10s to shed traffic

          # ── preStop ───────────────────────────────────────────────
          # sleep 10 = kube-proxy drain window before SIGTERM.
          # Without this: 502s on every rolling deployment.
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]

πŸ˜… Senior Engineer Confession

The most common probe configuration in production is no startupProbe, identical endpoints for readiness and liveness, and a failureThreshold of 3 for both. It works fine at low load. Under a traffic spike, when the health endpoint gets slow, both probes fail at the same time. Liveness kills pods that were merely overwhelmed, not broken. The replacements start cold, under full load, fail immediately, restart immediately. The death spiral is perfectly symmetric. Every team discovers this on a Tuesday afternoon in November. Or a Black Friday morning.

QoS Classes: Who Gets Evicted First When the Node Runs Out of Memory

Kubernetes assigns every pod one of three QoS classes based entirely on the relationship between its resource requests and limits. This class determines eviction order when a node is under memory pressure. Most engineers don't know their pod's QoS class until they learn it from a post-mortem.

QoS ClassConditionEvictedOOM Kill Risk
GuaranteedEvery container: requests == limits (CPU + memory)LastOnly if container exceeds its own limit
BurstableAt least one container: requests < limits (or only requests set)SecondWhen node is under memory pressure
BestEffortNo container has any requests or limits setFirst. Always.Very high under any memory pressure

Guaranteed QoS means the scheduler places the pod on a node that has exactly that much resource available. The Linux cgroup hard-limits the container to its specified memory. If the container allocates beyond its limit, it is OOMKilled β€” but that is a container-level event, not a node-level eviction. During node-level memory pressure, Guaranteed pods are the last ones the kubelet touches.

For every production-critical service: set requests equal to limits.kubectl get pod <name> -o jsonpath='{.status.qosClass}'tells you the class. It should return Guaranteed.

Guaranteed QoS β€” production critical services only
# Guaranteed QoS: every container in the pod has requests == limits.
# Result: last to be evicted under node memory pressure.
# Also more predictable: no CPU throttling surprises under load.
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: app
      resources:
        requests:
          cpu: "1"
          memory: "1Gi"
        limits:
          cpu: "1"       # == request β†’ Guaranteed
          memory: "1Gi"  # == request β†’ Guaranteed
    - name: sidecar
      resources:
        requests:
          cpu: "50m"
          memory: "64Mi"
        limits:
          cpu: "50m"     # == request β†’ both containers Guaranteed
          memory: "64Mi"

πŸ”₯ Production Reality

BestEffort pods β€” those with no resource requests or limits β€” are evicted first under any memory pressure, with no warning and no respect for PodDisruptionBudgets. The most common way to accidentally create a BestEffort pod: a developer sets resources in the Deployment template, a CI pipeline applies a patch that removes the resources field, nobody notices because the pod still deploys and runs. Until the node has a bad Tuesday and your BestEffort production pod is gone. Add a LimitRange to every namespace with a sensible default. BestEffort should never reach production.

Production Disasters

Disaster 1: The Missing initialDelaySeconds

A new Python microservice was deployed. The database connection pool took 8 seconds to initialize β€” establishing 20 connections with TLS handshakes, running a preflight query, and registering a schema validator. Completely normal behavior. The readinessProbe was configured with initialDelaySeconds: 0.

The probe fired at t=1 second. The app was still initializing. The/health/ready endpoint returned 503 because the connection pool was not yet available. Failure. Pod marked Not Ready. The Service had no endpoints. Every incoming request returned 503.

The team spent 40 minutes reading application code, checking database logs, reviewing the network configuration, and rebooting the deployment three times. The app was fine every single time. The probe was the problem.

Fix: set initialDelaySeconds: 10 on the readinessProbe β€” enough time for the connection pool to initialize. Or better: add a startupProbe with a generous budget, keep initialDelaySeconds: 0 on the readinessProbe, and let the startupProbe protect the initialization window.

The lesson: a pod can look Running and healthy while never serving a single request. Check the endpoints first. kubectl get endpoints <service-name>with an empty result is the fastest diagnosis in the playbook.

Disaster 2: OOMKilled in Silence

A Node.js API service had a memory limit of 256Mi. Average production load: 180Mi. Comfortable headroom. The team set it and forgot it.

A customer submitted a batch export request that triggered a known (but unpatched) memory inefficiency in the JSON serialization code. Memory usage climbed from 180Mi to 310Mi in 90 seconds. The container hit its 256Mi limit. The Linux kernel OOM killer fired. The container was terminated before it wrote a single log line β€” the application logger had not yet flushed its buffer.

The container restarted (restartPolicy: Always). Under normal load at the next restart: 180Mi. Fine. The batch job was still in the queue. The customer resubmitted. Memory climbed again. OOMKilled again. CrashLoopBackOff. No application logs visible.

The team spent 90 minutes reading application code, adding debug logging, and deploying test builds looking for the crash. Nobody checkedkubectl describe pod until a very tired senior engineer ran it and saw Reason: OOMKilled in the Last State section. The investigation was over in 30 seconds.

Fix: increase memory limit to 512Mi. Profile the JSON serializer. Add a Prometheus alert for container memory usage above 80% of limit. In every service. Before the next batch job.

πŸ˜… Senior Engineer Confession

kubectl describe pod is the most valuable command in Kubernetes debugging and the most commonly skipped. Every crash loop investigation should start with it. Not the logs. Not the metrics dashboard. Not Slack. The pod description. It has the exit code, the OOM reason, the probe failure details, and the event history all in one place. Bookmark nothing. Just remember: describe pod first, ask questions second.

The Wall of Shame

πŸ˜… Senior Engineer Confession

Every item on this list is in production right now, at a company you have heard of, written by engineers who knew better and were moving fast. The fire alarm is on. Nobody has time to look at it.
  1. No readinessProbe at all. This is the Kubernetes equivalent of sending customers to a restaurant the moment the chef walks in the door β€” before the kitchen is set up, before the menu is printed, before the gas is on. Traffic arrives the instant the container starts. The database connection pool is still initializing. Users get 500s. The app is perfectly healthy. It just isn't ready yet. Consequence: every deployment spikes errors for the startup duration of your app, proportional to how slow it is. Add a readinessProbe to every container. It is not optional.
  2. Same readiness and liveness failureThreshold. You built a smoke detector that also calls the fire brigade and simultaneously burns the building down. Readiness failing means shed load. Liveness failing means this process is beyond recovery β€” kill it. Using identical thresholds means a temporarily slow pod gets killed instead of just paused. The replacement starts cold, under full load, fails the same probe, gets killed again. The death spiral is symmetric and self-reinforcing. Liveness failureThreshold should be at least 3x readiness. Always.
  3. initialDelaySeconds: 0 on a slow-starting app. You handed a new employee their first live customer call 10 seconds after they walked in the door. Before they know the product. Before they know the systems. Before they know where the bathroom is. The employee is not incompetent. They were just not given time to onboard. The probe fires before the app is remotely ready. Fails. Pod marked Not Ready forever. Use a startupProbe for slow starters. It exists for exactly this reason.
  4. No preStop hook. The last person to leave the office turns off all the servers without warning anyone who's still working. SIGTERM arrives. The Go process exits in 200ms. kube-proxy is still reprogramming iptables on 40 nodes. New connections arrive at the dead process IP. TCP RST. nginx reports 502. Every deployment looks like a partial outage. Because it is. Add preStop: exec: sleep 10 to every Deployment. This is not configuration tuning. It is correctness.
  5. terminationGracePeriodSeconds: 0. SIGKILL is not a graceful shutdown. It is a hostage situation. The process has no say. In-flight requests die mid-response. Write buffers are not flushed. Database transactions may be partially committed. The only valid reason for terminationGracePeriodSeconds: 0 is a runaway process that you absolutely must kill immediately in an emergency. In a Deployment template, it is a 503 generator dressed up as configuration.
  6. Memory limit without a measured baseline. You told a chef they can use as much kitchen space as they want, but if they exceed some number you picked arbitrarily while writing the manifest, you will evict them mid-service with no warning. The customer's order dies on the pass. Profile your service. Measure p99 memory usage under load. Set limit to p99 plus 25% headroom. Not 256Mi because it sounded reasonable in a Slack message.
  7. CrashLoopBackOff ignored for days. The fire alarm has been going off for three days. Everyone decided it must be a false positive. It is not a false positive. CrashLoopBackOff means the application is crashing and restarting. The backoff protects the cluster, not the application. The application is crashing every five minutes, losing state, dropping connections, generating errors that users are silently absorbing. CrashLoopBackOff is a P1. Treat it like one.
  8. Running as root in the container. You gave every temporary contractor full admin rights to the building's control room because setting up limited access seemed complicated. A container running as root with a process breakout vulnerability becomes a node compromise, not just a container compromise. SetsecurityContext.runAsNonRoot: true andsecurityContext.readOnlyRootFilesystem: true on every container. The β€œit'll complicate the Dockerfile” objection is not a security argument.

Best Practices

  1. Add a readinessProbe to every container. No exceptions. Traffic routing without a readiness gate is a deployment-time error generator.
  2. Use a startupProbe for any app with startup time over 15 seconds. JVMs, ML model loaders, services that run migrations at startup. Set failureThreshold Γ— periodSeconds to your maximum startup budget.
  3. Use separate endpoints for liveness and readiness. Liveness checks only that the process is alive. Readiness checks real dependencies. Never the same endpoint.
  4. Set liveness failureThreshold at least 3x higher than readiness. Slow under load is not the same as broken beyond recovery.
  5. Never check external dependencies in the livenessProbe. Database down should mean readiness fails (pods removed from rotation), not liveness fails (all pods restarted simultaneously).
  6. Add preStop sleep to every Deployment. 10 to 15 seconds. This is the kube-proxy drain window. Without it, every rolling deployment drops connections during the endpoint propagation window.
  7. Set terminationGracePeriodSeconds to preStop duration + max request duration + 10 seconds. If preStop sleeps 10s and requests can take 20s: set 40s minimum.
  8. Use Guaranteed QoS for production-critical services. Set requests equal to limits. Run kubectl get pod <name> -o jsonpath='{.status.qosClass}' to verify.

FAQ

Does Running mean the app is healthy?

No. Running means at least one container process is executing. The application inside may be initializing, failing probes, or in a crash loop. Always check the READY column in kubectl get pods β€” that reflects the readinessProbe state, not just whether the container process is alive.

What is the difference between pod eviction and OOMKill?

OOMKill is a container-level event: the Linux kernel terminates a specific container that exceeded its cgroup memory limit. The pod continues and the container restarts (restartPolicy applies). Eviction is a pod-level event: the kubelet removes the entire pod from the node due to node-level resource pressure. The pod is rescheduled on another node.

Can a pod be Running but receive zero traffic?

Yes. This is the most important thing in this article. A pod can be Running with a failing readinessProbe for hours. It will never receive traffic. It will not restart. It will not alert. It will just sit there, healthy from the container runtime's perspective, invisible to the Service. Always check endpoints:kubectl get endpoints <service>. An empty result is a complete outage.

What happens if preStop takes longer than terminationGracePeriodSeconds?

The grace period countdown starts from when the preStop hook begins. If preStop runs longer than the grace period, the pod gets SIGKILL as soon as the grace period expires β€” before SIGTERM is ever sent. Your carefully written graceful shutdown code never runs. Set terminationGracePeriodSeconds to at least preStop duration + expected app drain time + a buffer.

How do I check what QoS class a pod has?

kubectl get pod <name> -o jsonpath='{.status.qosClass}'returns Guaranteed, Burstable, or BestEffort. Run this on every production pod. If you see BestEffort, that is an incident waiting for a busy Tuesday to happen.

🎀 The 60-Second Interview Answer

Back in the interview room. The whiteboard is still there. You've answered all five follow-up questions. Here is how you deliver the complete answer:

🎀 Say This Out Loud Until You Own It

β€œA pod starts in Pending β€” the API server has accepted it but no node has been assigned. The scheduler runs its filtering and scoring algorithm, picks a node, and patches the pod with that node's name. The kubelet on that node picks it up, pulls the image via CRI, sets up the network namespace via CNI, mounts volumes, and runs init containers in sequence. Every init container must complete successfully before any app container starts.

Once app containers start, the pod phase moves to Running β€” but Running does not mean ready to serve traffic. That is the readinessProbe's job. The probe fires on a schedule. When it passes, the pod IP is added to the Service's EndpointSlice. kube-proxy updates iptables on every node. Only then does the pod receive traffic.

The readinessProbe and livenessProbe do completely different things. Readiness failure removes the pod from endpoints β€” no kill. Liveness failure kills and restarts the container. Using the same endpoint and thresholds for both is how you get a death spiral under load: a traffic spike makes the health endpoint slow, both probes fail, liveness kills pods that were merely overwhelmed, replacements start cold, fail immediately, repeat.

On termination: kubectl delete sets deletionTimestamp. The EndpointSlice controller removes the pod IP. kube-proxy propagates that removal through iptables asynchronously β€” takes 2 to 15 seconds. Without a preStop hook, SIGTERM arrives before kube-proxy has finished, and new connections still arrive at the dying process. That is the source of 502s on every rolling deployment that lacks the hook. A preStop sleep of 10 seconds creates the drain window.

Finally, QoS class β€” determined by requests versus limits β€” controls eviction order under node memory pressure. Guaranteed class (requests equal limits) is last to be evicted. BestEffort (no requests, no limits) is first, always, immediately.”

If you can say that in one breath, you're getting the job.

Key Takeaways

  • β†’Running means the container process started. Ready means the readinessProbe passed. These are not the same thing.
  • β†’A pod can be Running with zero traffic, zero restarts, and zero alerts β€” for hours β€” if the readinessProbe keeps failing.
  • β†’Readiness failure removes from endpoints. Liveness failure kills and restarts. Different probes. Different endpoints. Different thresholds.
  • β†’Without a preStop sleep, every rolling deployment has a race condition between SIGTERM and kube-proxy endpoint propagation.
  • β†’terminationGracePeriodSeconds must be greater than preStop duration plus app drain time or SIGKILL cuts your drain short.
  • β†’Init containers are a scheduling precondition: the app container does not start until every init container has exited with code 0.
  • β†’QoS class is determined by requests vs. limits. Guaranteed (requests == limits) is last evicted. BestEffort is first. Always.
  • β†’kubectl describe pod first. Always. Before logs. Before metrics. Before Slack. The answer is almost always there.

Targeting a Kubernetes or SRE Role?

AiResumeFit matches your resume to Kubernetes, cloud, and SRE job descriptions β€” improving your ATS score in seconds.

Optimize My Resume β†’