β PagerDuty Alert β
It's 2:17 AM.
You check the cluster:
β Pods are Running
β Deployment is Healthy
β Service exists
β Ingress exists
Yet your users are staring at a 502.
You don't know why. You're about to.
Three Months Later. A Different Kind of War Room.
No PagerDuty this time. Fluorescent lights. A Google interview loop. The interviewer β calm, unhurried β writes one question on the whiteboard:
βWalk me through what happens when a request travels from the Internet to a Pod.β
You take a breath. You've debugged this path at 2 AM. You've survived incidents in it. You know this cold.
The Answer That Gets 80% of Candidates Eliminated
You draw the flow. Confident. Clean.
The interviewer nods. Writes something down. Then looks up.
Interviewer keeps going:
β βWhat component actually decides which Pod receives the request?β
β βWhere does kube-proxy participate in forwarding packets?β
β βWhat is the role of iptables or IPVS?β
β βWhat does the CNI plugin do?β
β βIf kube-proxy crashes, does traffic stop immediately?β
Five questions. Most candidates stumble on question one and never recover. Engineers who can answer all five β and explain why β walk out with the offer. Let's answer every single one.
Before we go kernel-deep, here's the mental model that makes everything click. Imagine your HTTP request is a late-night food delivery order:
| Food Delivery World | Kubernetes World |
|---|---|
| You (hungry, typing a URL) | Browser / API client |
| Google Maps finding the address | DNS resolution |
| The building's front entrance | Cloud Load Balancer β everyone enters here |
| The actual building | Kubernetes Node |
| The silent elevator system | kube-proxy / iptables β routes you without asking |
| Reception desk checking your name | Ingress Controller β hostname & path routing |
| Restaurant manager knowing who's free | Service β picks a healthy pod |
| The reservation book | EndpointSlice β lists available pod IPs |
| Hallways connecting the kitchens | CNI plugin β invisible until they catch fire |
| The chef | Pod |
| The person actually cooking | Container process |
Hold that analogy. Everything below is the exact same thing β except the elevator is iptables rules in the Linux kernel, the hallways are veth pairs, and the reception desk is nginx running in a pod. Let's go.
Q1: What Component Actually Decides Which Pod Gets the Request?
Most people say βthe Service.β That answer is like saying βthe recipe decides what you eat.β A recipe is instructions. Someone still has to cook.
The correct answer is: the Linux kernel, via iptables rules that kube-proxy programmed.
Here is the uncomfortable truth about a Kubernetes Service: the ClusterIP β that stable virtual IP like 10.96.45.12 β does not exist anywhere as a real IP address. No process listens on it. No socket is bound to it. It lives exclusively in iptables rules in the Linux kernel on every node in your cluster. When a packet arrives destined for that IP, the kernel intercepts it in the PREROUTING hook and rewrites the destination to a real pod IP before the packet even reaches a routing decision. This is called DNAT.
# What kube-proxy ACTUALLY writes into the Linux kernel.
# Run this on any node and see it yourself:
sudo iptables -t nat -L KUBE-SERVICES -n | grep 10.96.45.12
# β KUBE-SVC-EQCHZ7S2PJ72OHAY tcp -- 0.0.0.0/0 10.96.45.12 tcp dpt:80
# That KUBE-SVC chain does probabilistic load balancing:
# KUBE-SEP-ABCDEF 33% β DNAT to 10.244.1.5:8080 (Pod 1)
# KUBE-SEP-GHIJKL 50% β DNAT to 10.244.2.8:8080 (Pod 2 of remaining)
# KUBE-SEP-MNOPQR 100% β DNAT to 10.244.3.11:8080 (Pod 3)
# ClusterIP 10.96.45.12 has NO process listening on it.
# No socket. No port binding. It exists ONLY in these rules.
# These iptables chains ARE the Service.The chain is: PREROUTING β KUBE-SERVICES β KUBE-SVC-xxx (probabilistic fan-out across pods) β KUBE-SEP-xxx (DNAT to a specific pod IP:port). The βload balancingβ is probabilistic math at the kernel level. No queue awareness. No connection counting. Just: Pod 1 gets 33%, Pod 2 gets 33%, Pod 3 gets 34%. Simple. Fast. Happens at wire speed.
π¨ Interview Trap
Q2: Where Does kube-proxy Participate in Forwarding Packets?
It doesn't.
kube-proxy is not a proxy. Despite literally being named kube-proxy. It is a controller β a DaemonSet pod that watches the Kubernetes API for Service and EndpointSlice changes and translates that into iptables (or IPVS) rules in the Linux kernel. Once it writes those rules, kube-proxy has zero involvement in forwarding packets. It could crash immediately after writing the rules and every existing connection would be completely unaffected.
π§ Memory Trick
The βkube-proxyβ name is a historical accident. In Kubernetes v1.0 it literally was a userspace proxy β every packet went through a real process. They replaced it with iptables programming in v1.2, kept the name, and have been confusing engineers ever since.
When should you care about IPVS?
iptables evaluates rules linearly: O(n) per packet. With 5,000 Services and 5 pods each, a single packet can traverse 25,000 rules looking for a match. That is not βa little slow.β That is a performance installation. IPVS uses kernel hash tables: O(1) lookup regardless of Service count, real scheduling algorithms (round-robin, least connections, sticky sessions), and no lock contention on rule updates. For any cluster past ~500 Services, IPVS is a requirement you don't know you have yet.
# Check current kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
# Switch to IPVS (do this before 500 Services, not after)
kubectl edit configmap kube-proxy -n kube-system
# Change: mode: ""
# To: mode: "ipvs"
kubectl rollout restart ds/kube-proxy -n kube-system
# Confirm virtual servers are populated
ipvsadm -Ln | grep -A 4 "10.96.45.12"
# TCP 10.96.45.12:80 rr
# -> 10.244.1.5:8080 Masq 1 0 0
# -> 10.244.2.8:8080 Masq 1 0 0Q3: What Do iptables/IPVS Actually Do?
They do DNAT: Destination Network Address Translation. They rewrite the destination IP and port of a packet in the kernel β before any routing decision happens β so the rest of the network stack processes it as if it was always headed to the pod.
The return trip works because of conntrack β the Linux connection tracking table. When iptables rewrites a destination, conntrack records the mapping: βpackets from this source matching this 5-tuple were NATed from X to Y.β When the pod responds, the kernel looks up the response in conntrack and automatically reverses the NAT β restoring the original source on the outbound packet. The caller never learns the pod's real IP.
π₯ Production Reality
nf_conntrack_max. Under high traffic with many short-lived connections, this table fills up and new connections are silently dropped at the kernel level β below iptables, below kube-proxy, below everything your monitoring tracks. The ALB sees timeouts. The ingress logs show upstream errors. Your pods look completely healthy. This is conntrack exhaustion. Check it with:wc -l /proc/net/nf_conntrack vs sysctl net.netfilter.nf_conntrack_maxQ4: What Does the CNI Plugin Do?
After iptables rewrites the packet destination to a real pod IP, something has to physically deliver it to that pod. That's the CNI plugin's job.
Every pod lives in its own isolated Linux network namespace β its own routing table, its own iptables, its own interfaces. It cannot grab packets off the node's eth0. It needs a private door. CNI creates that door.
When a pod starts, CNI creates a veth pair: a virtual ethernet cable where one end appears as eth0 inside the pod and the other sits on the host. All pod-side veth ends connect to a bridge (cni0) on the host, handling local pod-to-pod traffic. Cross-node traffic goes through the plugin's strategy β VXLAN encapsulation (Flannel), native BGP routes (Calico), or eBPF (Cilium).
Pod A (10.244.1.5) Pod B (10.244.1.8)
eth0 eth0
| |
veth2a veth2b β one end in pod namespace, one on host
| |
ββββββββββββ¬ββββββββββββββββ
β
cni0 bridge (10.244.1.1/24) β local pod-to-pod traffic handled here
β
eth0 (node: 10.0.0.15)
β
βββββββββββ΄βββββββββββ
β VXLAN (Flannel) β β cross-node: wraps packet in UDP envelope
β or BGP (Calico) β β cross-node: native L3 route, zero overhead
βββββββββββ¬βββββββββββ
eth0 (node: 10.0.0.16)
β
vethββββββ€βββββveth
| |
Pod C (10.244.2.5) Pod D (10.244.2.9)| Plugin | Cross-node strategy | Overhead | Network Policy | Best for |
|---|---|---|---|---|
| Flannel | VXLAN (UDP wrap) | Medium (~50B) | External only | Simplicity, dev clusters |
| Calico | BGP native L3 routing | Zero | Native L3/L4 | Production default |
| Cilium | eBPF (replaces kube-proxy too) | Very low | L3/L4/L7 | Scale & observability |
| AWS VPC CNI | Real VPC secondary IPs | None | Via security groups | EKS β must-use |
π Senior Engineer Confession
Q5: If kube-proxy Crashes, Does Traffic Stop Immediately?
This is the question that separates engineers who memorised the docs from engineers who understand the system. The wrong answer: βyes, traffic stops.β
The right answer: No. Here's exactly why.
kube-proxy writes rules into the Linux kernel. The kernel does not forget rules when the process that wrote them crashes. Those iptables entries live in kernel memory β not in the kube-proxy process. When kube-proxy dies, the kernel continues forwarding packets using the last set of rules it was given, indefinitely.
What actually breaks when kube-proxy crashes:
- New Services created after the crash get no iptables rules on that node. Their traffic is silently dropped.
- Pod IP changes from rolling deployments don't propagate. Rules may route to terminated pod IPs.
- Existing Services with existing pods keep working perfectly.
kube-proxy is a DaemonSet β Kubernetes restarts it within ~30 seconds. During that window, your cluster is frozen in time: routing correctly for everything that existed before the crash, quietly breaking anything new.
π¨ Interview Trap
The Full 7-Layer Path (Now That You Know What's Actually Happening)
Here is the complete picture. Every box is something you can inspect, log, or accidentally break in a running cluster. Every arrow is a place where an engineer has filed a 3 AM incident.
Internet: GET https://api.myapp.com/checkout
β DNS β 52.10.1.200
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β AWS ALB / GCP HTTPS LB β TLS termination, L7 routing
ββββββββββββββββ¬ββββββββββββββββββββββββ
β HTTP β NodePort :32080
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Node (10.0.0.15) β
β β
β iptables PREROUTING β
β βββΆ KUBE-SERVICES β
β βββΆ DNAT β Ingress Controller pod IP β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β Ingress Controller (nginx / traefik) β β
β β Reads Ingress rules β routes by host/path β β
β ββββββββββββββββ¬ββββββββββββββββββββββββββββββ β
β β HTTP β ClusterIP:80 β
β iptables PREROUTING β
β βββΆ KUBE-SERVICES β
β βββΆ DNAT β real Pod IP (from EndpointSlice)β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β CNI (veth β bridge β node routing) β β
β ββββββββββββββββ¬ββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββΌββββββββββββββββββββββββββββββ β
β β Pod eth0: 10.244.2.5 Container: :8080 β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββReading this diagram after understanding the five questions hits differently. The Ingress Controller is just a pod making HTTP requests to Services. Those Services are virtual IPs that don't exist anywhere. The iptables rules that do exist forward packets to real pods through veth pairs and bridges. Every layer is hiding how much work it's doing.
π§ Memory Trick
The 502 Race Condition Nobody Tells You About
There is a timing bug built into every Kubernetes cluster that doesn't have a specific 15-line YAML fix. It causes 502 errors on every single rolling deployment. It affects most production clusters running right now. It is completely preventable. Almost nobody has fixed it.
Here is the scenario: a rolling deployment starts. A pod enters Terminating. The endpoint controller removes the pod IP from the EndpointSlice. kube-proxy watches for this change and needs to update iptables β but this propagation takes 2 to 15 seconds.
During that window: iptables still routes new TCP connections to the terminating pod's IP. But if the application process already exited β because it received SIGTERM and shut down immediately β those new connections get a TCP RST. The Ingress Controller reports that as a 502.
The fix is embarrassingly simple:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: payment-api
lifecycle:
preStop:
exec:
# This 15-second sleep is doing more for your 502 rate
# than any amount of readiness probe tuning.
# Add it. Do not argue. Do not skip it.
command: ["/bin/sh", "-c", "sleep 15"]
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2 # fast removal from load balancing
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 6 # slow restart β do NOT panic-kill during traffic spikesβ‘ Pro Tip
preStop sleep keeps the pod alive for 15 seconds after SIGTERM β long enough for kube-proxy to update iptables and stop routing new connections here. This single change eliminates deployment-time 502s in virtually every cluster it's added to. It should be in your organisation's base Deployment template. Add it to every workload you own. Tell your team. They will thank you at 2 AM.The Black Friday Disaster (A Story in Two Failures)
60-node EKS cluster. 120 microservices. Normal peak: 4,000 RPS. Black Friday morning: 18,000 RPS in the first hour. 4.5Γ the historical peak.
The checkout service handled it. The networking stack did not.
At 10:14 AM, p99 latency crossed 8 seconds. Error rate hit 22%. Pods were healthy. Memory was normal. Application logs showed nothing. The team stared at dashboards that all looked green. It took 40 minutes to find two compounding networking failures that were invisible unless you knew what to look for.
Failure 1: iptables lock contention
3,400 Services. CI/CD still running β 15 deployments in the first hour. Each deployment caused kube-proxy to rewrite iptables rules. In iptables mode, rule rewrites hold kernel write locks. Under 18,000 RPS, those lock windows added 200β400 ms of latency per affected packet. Intermittent. Unpredictable. Completely invisible in application metrics. Fix: freeze non-critical deployments. Latency dropped from 8 s to 1.2 s within three minutes.
Failure 2: conntrack table exhaustion
Under 18,000 RPS with short-lived HTTP connections, the Linux conntrack table filled up. New connections dropped at the kernel level β below iptables, below every monitoring layer. ALB saw timeouts. Ingress logs showed upstream errors. Pods were completely fine. Fix: sysctl -w net.netfilter.nf_conntrack_max=2000000 pushed to every node via a one-liner DaemonSet.
π₯ Production Reality
# The 7-step networking incident playbook.
# Run in order. The answer is almost always in steps 1β3.
# 1. Do endpoints exist? (zero = complete silent failure)
kubectl get endpointslices -n production -l kubernetes.io/service-name=payment-api
# 2. Are pods READY β not just Running?
kubectl get pods -n production -l app=payment-api
# 3. Do Service selectors match pod labels? (most common root cause)
kubectl get svc payment-api -n production -o jsonpath='{.spec.selector}'
kubectl get pods -n production --show-labels | grep payment-api
# 4. What does the event log say?
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
# 5. Is the readinessProbe passing?
kubectl describe pod -n production -l app=payment-api | grep -A 8 "Readiness:"
# 6. What is the ingress controller complaining about?
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100 | grep "upstream|502|503"
# 7. Bypass Service β hit pod IP directly (isolates Service vs pod)
kubectl run debug --image=curlimages/curl --rm -it --restart=Never -- curl -sv http://10.244.1.5:8080/healthzThe Wall of Shame: Mistakes That Are in Production Right Now
π Senior Engineer Confession
- No preStop hook on any Deployment. This is the Kubernetes equivalent of kicking your tenant out of the apartment before the new one has moved in. The address still exists in iptables. The key still works. But nobody is home β and the process already exited 10 seconds ago. Every new connection gets a TCP RST. nginx calls that a 502. This is happening on every rolling deployment in your cluster right now if you don't have this hook. Fifteen lines of YAML. That's all it takes.
- Same failureThreshold for readiness and liveness probes. You built a smoke detector that also burns the building down when it triggers. Readiness failing means βI'm overwhelmed, please stop sending me traffic.β Liveness failing means βI am broken beyond recovery, please kill me.β These are not the same situation. A pod that is slow during a Black Friday traffic spike does not need to be killed and restarted β it needs a moment to breathe. Liveness
failureThresholdshould be at least 3Γ readiness. Always. - Running iptables mode past 500 Services. Imagine asking 25,000 people for directions β one by one β for every single car on the road. That is O(n) iptables evaluation at 5,000 Services Γ 5 pods. It is not slow in the way that running is slow. It is slow in the way that asking 25,000 people for directions is slow. IPVS is Google Maps. You have Google Maps. Use Google Maps.
- Applying Ingress YAML without installing an Ingress Controller. You hung a professionally formatted sign on the wall that says βRestaurant This Way.β There is no restaurant. There is no kitchen. There is no chef. The sign looks great. Zero requests are being routed. The Ingress resource is the menu. The Ingress Controller is the kitchen. No kitchen, no food. This specific mistake is made in new cluster setups at least twice a week across the industry. You are not alone. Install the controller.
- Hardcoding a pod IP anywhere it will outlive a deployment. Memorising a stranger's phone number instead of saving their contact. The number works. Until they get a new phone β which happens every time the pod restarts, every time a node drains, every time a deployment rolls. Pod IPs are ephemeral by design. That is their defining characteristic. Service DNS exists specifically so you never need to know a pod's IP. Use it.
- Ignoring
externalTrafficPolicy. The defaultClustermode is like asking someone in Chicago to pick up your order in New York and deliver it to Los Angeles β when Los Angeles is right next door. Extra network hop, source IP NATed away, your security team can no longer tell where requests are coming from. If you have ever spent an afternoon debugging βwhy do all 10 million requests appear to come from the same IP,β you have already paid the price of this default. SetLocal. - No PodDisruptionBudget on production workloads. A node drain during routine cluster maintenance can evict every replica of your Service simultaneously. This is the infrastructure equivalent of deciding to replace all four tyres at the same time while the car is on the motorway. Theoretically the car can be put back together. Practically it produces a very detailed post-mortem.
minAvailable: 1takes 30 seconds to write and has never once regretted existing. - Trusting the JVM's DNS cache. The JVM decided in 1997 that caching successful DNS lookups forever was reasonable. Kubernetes decided in 2014 that virtual IPs change when Services are recreated. These two decisions have been having an argument in your production environment ever since β producing traffic that silently routes to a dead IP for the entire lifetime of the JVM process, while every health check reports green and nobody knows why a specific subset of users can't connect. Set
networkaddress.cache.ttl=10. In every Java workload. Today.
Production Best Practices
- preStop sleep on every Deployment. 15β20 seconds. Non-negotiable. The single highest-ROI change for zero-downtime deployments.
- Separate readiness and liveness thresholds. Readiness = traffic routing. Liveness = process restart. Different concerns, different tuning. Liveness failureThreshold β₯ 3Γ readiness.
- PodDisruptionBudget on every production Service.
minAvailable: 1. 30 seconds to add. Prevents maintenance-window surprises forever. - Switch to IPVS before 500 Services. Proactively. One maintenance window. Never reactively during a latency incident.
- Alert on zero-endpoint EndpointSlices. A Service with no healthy backends is a silent complete outage. A 60-second Prometheus alert catches it before users report it.
- DNS TTL β€ 60 seconds on production-facing records. Every 5-minute TTL is 5 minutes of βwe fixed it but users can't tell yetβ during failovers.
- 3+ Ingress Controller replicas with zone anti-affinity. All replicas on one node = single node failure takes down your entire ingress tier. One
podAntiAffinityrule prevents this. - Evaluate Cilium for high-throughput clusters. eBPF replaces iptables + kube-proxy with O(1) routing, L7-aware Network Policies, and p99 improvements that show up in graphs within hours of migration.
FAQ
Does every request go through all 7 layers?
No. Internal pod-to-pod traffic via ClusterIP skips the cloud LB and Ingress Controller. Direct pod IP access skips iptables DNAT entirely. The full path applies to external HTTPS traffic entering through an Ingress.
How long does kube-proxy take to update iptables after an EndpointSlice change?
2β5 seconds on a lightly loaded cluster. Up to 30 seconds on a loaded cluster with high Service count. This is the window the preStop sleep covers. In clusters with aggressive deployment cadences, consider 20β30 second sleeps.
Ingress vs. LoadBalancer Service β which should I use?
LoadBalancer Service provisions one cloud LB per Service β one billing line item each. For 20 externally exposed services that is 20 LBs. One Ingress Controller handles all 20 via routing rules behind a single LB. Use Ingress for HTTP/HTTPS. Use LoadBalancer Services for non-HTTP protocols or when you need per-service cloud LB features like WAF or sticky sessions.
What is an EndpointSlice and why does it matter?
EndpointSlices are Kubernetes objects that list the current healthy pod IPs behind each Service. When a pod's readinessProbe passes, its IP is added. When it fails or terminates, it is removed. kube-proxy watches EndpointSlices to keep iptables current. kubectl get endpointslices -l kubernetes.io/service-name=your-svc showing ENDPOINTS: <none> is the fastest diagnosis of a dead Service. Check this first in any traffic incident.
Does the Ingress Controller add latency?
A few milliseconds for L7 inspection, TLS offload, header manipulation, and the extra TCP connection. In most applications this is irrelevant. In ultra-low-latency use cases (sub-5 ms budgets), consider direct LoadBalancer Services with Network Policies instead.
π€ The 60-Second Interview Answer
Back in the interview room. The whiteboard is still there. You've answered all five follow-up questions. Here is how you deliver the complete answer β covering the simple path and the kernel detail that gets you the offer:
π€ Say This Out Loud Until You Own It
βDNS resolves the domain to the cloud load balancer's IP. The LB terminates TLS and forwards to a NodePort on a cluster node.
On the node, kube-proxy's iptables rules DNAT that NodePort to the Ingress Controller pod's real IP. The Ingress Controller β just a pod running nginx or Traefik β reads Ingress resources, routes by hostname and path, then proxies to a Service ClusterIP.
Here's the key part: the ClusterIP is a virtual IP. No process listens on it. It exists only in iptables rules that kube-proxy programmed into the Linux kernel. Those rules DNAT again β from ClusterIP to a real pod IP selected probabilistically from the EndpointSlice.
The CNI plugin delivers the packet to the destination pod via a veth pair and a bridge β or VXLAN/BGP if the pod is on a different node. The container handles the request. The response reverses the path, with conntrack un-NATing at each hop.
Critical production detail: without a preStop sleep hook, iptables won't drain before the process exits during a rolling deployment. New connections hit a terminated process. nginx reports 502. This race condition affects every cluster without the hook β which is most of them.β
If you can say that in one breath, you're getting the job.
Key Takeaways
- βThe Service ClusterIP is a kernel fiction β no process listens on it. The iptables DNAT rules are the Service.
- βkube-proxy programs the kernel and leaves. It does not proxy packets. It is named incorrectly.
- βkube-proxy crashing does not stop traffic. Existing rules stay in the kernel. New changes stop propagating.
- βCNI handles pod-to-pod routing with veth pairs, bridges, and VXLAN or BGP for cross-node hops.
- βThe 502-on-every-deployment race condition is fixed by a 15-second preStop sleep. Add it today.
- βZero endpoints in an EndpointSlice means complete silent outage. Check this first in any traffic incident.
Targeting a Kubernetes or Platform Engineering Role?
AiResumeFit matches your resume to Kubernetes, cloud, and SRE job descriptions β improving your ATS score in seconds.
Optimize My Resume β