KubernetesπŸ”₯ Production-Critical⚑ Senior Engineer Level

How a Request Travels from the Internet to a Kubernetes Pod

The interview question that eliminates 90% of candidates β€” answered at the kernel level. Every layer, every iptables chain, the CNI truth, and the race condition causing 502s on every deployment at your company right now.

β€œThe Service doesn't route traffic. The Ingress doesn't route traffic. The Linux kernel does. Let's talk.”

Updated June 8, 2026|25 min read|Has ended 3 war rooms early

β€” PagerDuty Alert β€”

It's 2:17 AM.

You check the cluster:

βœ… Pods are Running

βœ… Deployment is Healthy

βœ… Service exists

βœ… Ingress exists

Yet your users are staring at a 502.

You don't know why. You're about to.

Three Months Later. A Different Kind of War Room.

No PagerDuty this time. Fluorescent lights. A Google interview loop. The interviewer β€” calm, unhurried β€” writes one question on the whiteboard:

β€œWalk me through what happens when a request travels from the Internet to a Pod.”

You take a breath. You've debugged this path at 2 AM. You've survived incidents in it. You know this cold.

The Answer That Gets 80% of Candidates Eliminated

You draw the flow. Confident. Clean.

User
β–Ό
ALB
β–Ό
Ingress Controller
β–Ό
ClusterIP Service
β–Ό
Pod

The interviewer nods. Writes something down. Then looks up.

Interviewer keeps going:

❓ β€œWhat component actually decides which Pod receives the request?”

❓ β€œWhere does kube-proxy participate in forwarding packets?”

❓ β€œWhat is the role of iptables or IPVS?”

❓ β€œWhat does the CNI plugin do?”

❓ β€œIf kube-proxy crashes, does traffic stop immediately?”

Five questions. Most candidates stumble on question one and never recover. Engineers who can answer all five β€” and explain why β€” walk out with the offer. Let's answer every single one.

Before we go kernel-deep, here's the mental model that makes everything click. Imagine your HTTP request is a late-night food delivery order:

Food Delivery WorldKubernetes World
You (hungry, typing a URL)Browser / API client
Google Maps finding the addressDNS resolution
The building's front entranceCloud Load Balancer β€” everyone enters here
The actual buildingKubernetes Node
The silent elevator systemkube-proxy / iptables β€” routes you without asking
Reception desk checking your nameIngress Controller β€” hostname & path routing
Restaurant manager knowing who's freeService β€” picks a healthy pod
The reservation bookEndpointSlice β€” lists available pod IPs
Hallways connecting the kitchensCNI plugin β€” invisible until they catch fire
The chefPod
The person actually cookingContainer process

Hold that analogy. Everything below is the exact same thing β€” except the elevator is iptables rules in the Linux kernel, the hallways are veth pairs, and the reception desk is nginx running in a pod. Let's go.

Q1: What Component Actually Decides Which Pod Gets the Request?

Most people say β€œthe Service.” That answer is like saying β€œthe recipe decides what you eat.” A recipe is instructions. Someone still has to cook.

The correct answer is: the Linux kernel, via iptables rules that kube-proxy programmed.

Here is the uncomfortable truth about a Kubernetes Service: the ClusterIP β€” that stable virtual IP like 10.96.45.12 β€” does not exist anywhere as a real IP address. No process listens on it. No socket is bound to it. It lives exclusively in iptables rules in the Linux kernel on every node in your cluster. When a packet arrives destined for that IP, the kernel intercepts it in the PREROUTING hook and rewrites the destination to a real pod IP before the packet even reaches a routing decision. This is called DNAT.

the iptables rules that ARE your Service
# What kube-proxy ACTUALLY writes into the Linux kernel.
# Run this on any node and see it yourself:

sudo iptables -t nat -L KUBE-SERVICES -n | grep 10.96.45.12
# β†’ KUBE-SVC-EQCHZ7S2PJ72OHAY  tcp  --  0.0.0.0/0  10.96.45.12  tcp dpt:80

# That KUBE-SVC chain does probabilistic load balancing:
# KUBE-SEP-ABCDEF  33%  β†’ DNAT to 10.244.1.5:8080   (Pod 1)
# KUBE-SEP-GHIJKL  50%  β†’ DNAT to 10.244.2.8:8080   (Pod 2 of remaining)
# KUBE-SEP-MNOPQR  100% β†’ DNAT to 10.244.3.11:8080  (Pod 3)

# ClusterIP 10.96.45.12 has NO process listening on it.
# No socket. No port binding. It exists ONLY in these rules.
# These iptables chains ARE the Service.

The chain is: PREROUTING β†’ KUBE-SERVICES β†’ KUBE-SVC-xxx (probabilistic fan-out across pods) β†’ KUBE-SEP-xxx (DNAT to a specific pod IP:port). The β€œload balancing” is probabilistic math at the kernel level. No queue awareness. No connection counting. Just: Pod 1 gets 33%, Pod 2 gets 33%, Pod 3 gets 34%. Simple. Fast. Happens at wire speed.

🚨 Interview Trap

The wrong answer is β€œkube-proxy decides.” kube-proxy programs the decision β€” it writes the rules. But by the time a packet arrives, kube-proxy is completely uninvolved. The kernel executes the rules. kube-proxy already left. This distinction is exactly what question two is about, and most people answer Q1 and Q2 identically. Don't be that person.

Q2: Where Does kube-proxy Participate in Forwarding Packets?

It doesn't.

kube-proxy is not a proxy. Despite literally being named kube-proxy. It is a controller β€” a DaemonSet pod that watches the Kubernetes API for Service and EndpointSlice changes and translates that into iptables (or IPVS) rules in the Linux kernel. Once it writes those rules, kube-proxy has zero involvement in forwarding packets. It could crash immediately after writing the rules and every existing connection would be completely unaffected.

🧠 Memory Trick

kube-proxy is like a contractor who installs your building's elevator system. Once the elevator is installed and working, the contractor goes home. The elevator runs 24/7 without them. If the contractor gets hit by a bus, the elevator keeps running. New floors (new Services) don't get added until a new contractor shows up. That's kube-proxy in one analogy.

The β€œkube-proxy” name is a historical accident. In Kubernetes v1.0 it literally was a userspace proxy β€” every packet went through a real process. They replaced it with iptables programming in v1.2, kept the name, and have been confusing engineers ever since.

When should you care about IPVS?

iptables evaluates rules linearly: O(n) per packet. With 5,000 Services and 5 pods each, a single packet can traverse 25,000 rules looking for a match. That is not β€œa little slow.” That is a performance installation. IPVS uses kernel hash tables: O(1) lookup regardless of Service count, real scheduling algorithms (round-robin, least connections, sticky sessions), and no lock contention on rule updates. For any cluster past ~500 Services, IPVS is a requirement you don't know you have yet.

switching to IPVS β€” do this before you need to
# Check current kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# Switch to IPVS  (do this before 500 Services, not after)
kubectl edit configmap kube-proxy -n kube-system
#   Change:  mode: ""
#   To:      mode: "ipvs"
kubectl rollout restart ds/kube-proxy -n kube-system

# Confirm virtual servers are populated
ipvsadm -Ln | grep -A 4 "10.96.45.12"
# TCP  10.96.45.12:80 rr
#   -> 10.244.1.5:8080   Masq  1  0  0
#   -> 10.244.2.8:8080   Masq  1  0  0

Q3: What Do iptables/IPVS Actually Do?

They do DNAT: Destination Network Address Translation. They rewrite the destination IP and port of a packet in the kernel β€” before any routing decision happens β€” so the rest of the network stack processes it as if it was always headed to the pod.

The return trip works because of conntrack β€” the Linux connection tracking table. When iptables rewrites a destination, conntrack records the mapping: β€œpackets from this source matching this 5-tuple were NATed from X to Y.” When the pod responds, the kernel looks up the response in conntrack and automatically reverses the NAT β€” restoring the original source on the outbound packet. The caller never learns the pod's real IP.

πŸ”₯ Production Reality

The conntrack table has a hard size cap: nf_conntrack_max. Under high traffic with many short-lived connections, this table fills up and new connections are silently dropped at the kernel level β€” below iptables, below kube-proxy, below everything your monitoring tracks. The ALB sees timeouts. The ingress logs show upstream errors. Your pods look completely healthy. This is conntrack exhaustion. Check it with:
wc -l /proc/net/nf_conntrack vs sysctl net.netfilter.nf_conntrack_max

Q4: What Does the CNI Plugin Do?

After iptables rewrites the packet destination to a real pod IP, something has to physically deliver it to that pod. That's the CNI plugin's job.

Every pod lives in its own isolated Linux network namespace β€” its own routing table, its own iptables, its own interfaces. It cannot grab packets off the node's eth0. It needs a private door. CNI creates that door.

When a pod starts, CNI creates a veth pair: a virtual ethernet cable where one end appears as eth0 inside the pod and the other sits on the host. All pod-side veth ends connect to a bridge (cni0) on the host, handling local pod-to-pod traffic. Cross-node traffic goes through the plugin's strategy β€” VXLAN encapsulation (Flannel), native BGP routes (Calico), or eBPF (Cilium).


  Pod A (10.244.1.5)        Pod B (10.244.1.8)
       eth0                      eth0
        |                          |
      veth2a                     veth2b   ← one end in pod namespace, one on host
        |                          |
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
             cni0 bridge (10.244.1.1/24)   ← local pod-to-pod traffic handled here
                   β”‚
                 eth0 (node: 10.0.0.15)
                   β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  VXLAN (Flannel)   β”‚  ← cross-node: wraps packet in UDP envelope
         β”‚  or BGP (Calico)   β”‚  ← cross-node: native L3 route, zero overhead
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 eth0 (node: 10.0.0.16)
                   β”‚
          veth───────────veth
           |                 |
  Pod C (10.244.2.5)    Pod D (10.244.2.9)
PluginCross-node strategyOverheadNetwork PolicyBest for
FlannelVXLAN (UDP wrap)Medium (~50B)External onlySimplicity, dev clusters
CalicoBGP native L3 routingZeroNative L3/L4Production default
CiliumeBPF (replaces kube-proxy too)Very lowL3/L4/L7Scale & observability
AWS VPC CNIReal VPC secondary IPsNoneVia security groupsEKS β€” must-use

πŸ˜… Senior Engineer Confession

β€œFlannel just works” is the most dangerous phrase in Kubernetes networking. It works until you have a latency SLA, a security audit, or a cluster with 200 Services where you suddenly care about Network Policies. Choose your CNI deliberately. Migrating CNI plugins on a running production cluster is technically possible and practically the kind of afternoon that generates a very detailed post-mortem.

Q5: If kube-proxy Crashes, Does Traffic Stop Immediately?

This is the question that separates engineers who memorised the docs from engineers who understand the system. The wrong answer: β€œyes, traffic stops.”

The right answer: No. Here's exactly why.

kube-proxy writes rules into the Linux kernel. The kernel does not forget rules when the process that wrote them crashes. Those iptables entries live in kernel memory β€” not in the kube-proxy process. When kube-proxy dies, the kernel continues forwarding packets using the last set of rules it was given, indefinitely.

What actually breaks when kube-proxy crashes:

  • New Services created after the crash get no iptables rules on that node. Their traffic is silently dropped.
  • Pod IP changes from rolling deployments don't propagate. Rules may route to terminated pod IPs.
  • Existing Services with existing pods keep working perfectly.

kube-proxy is a DaemonSet β€” Kubernetes restarts it within ~30 seconds. During that window, your cluster is frozen in time: routing correctly for everything that existed before the crash, quietly breaking anything new.

🚨 Interview Trap

β€œNothing breaks” is equally wrong. The correct answer describes the exact failure boundary: existing kernel rules keep forwarding; new Services and pod IP changes stop propagating until kube-proxy restarts. That level of precision is what gets you the offer. The interviewer is testing whether you know the difference between β€œthe rules in the kernel” and β€œthe process that writes the rules.” They are different things. Most people don't know that.

The Full 7-Layer Path (Now That You Know What's Actually Happening)

Here is the complete picture. Every box is something you can inspect, log, or accidentally break in a running cluster. Every arrow is a place where an engineer has filed a 3 AM incident.


  Internet: GET https://api.myapp.com/checkout

       β”‚  DNS β†’ 52.10.1.200
       β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  AWS ALB / GCP HTTPS LB              β”‚  TLS termination, L7 routing
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚  HTTP β†’ NodePort :32080
                 β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Kubernetes Node  (10.0.0.15)                    β”‚
  β”‚                                                  β”‚
  β”‚  iptables PREROUTING                             β”‚
  β”‚  └─▢ KUBE-SERVICES                              β”‚
  β”‚       └─▢ DNAT β†’ Ingress Controller pod IP      β”‚
  β”‚                                                  β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
  β”‚  β”‚  Ingress Controller (nginx / traefik)       β”‚  β”‚
  β”‚  β”‚  Reads Ingress rules β†’ routes by host/path  β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
  β”‚                 β”‚  HTTP β†’ ClusterIP:80            β”‚
  β”‚  iptables PREROUTING                             β”‚
  β”‚  └─▢ KUBE-SERVICES                              β”‚
  β”‚       └─▢ DNAT β†’ real Pod IP (from EndpointSlice)β”‚
  β”‚                                                  β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
  β”‚  β”‚  CNI  (veth β†’ bridge β†’ node routing)        β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
  β”‚                 β”‚                                β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
  β”‚  β”‚  Pod eth0: 10.244.2.5   Container: :8080   β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Reading this diagram after understanding the five questions hits differently. The Ingress Controller is just a pod making HTTP requests to Services. Those Services are virtual IPs that don't exist anywhere. The iptables rules that do exist forward packets to real pods through veth pairs and bridges. Every layer is hiding how much work it's doing.

🧠 Memory Trick

When debugging: ask β€œat which layer did the packet die?” DNS wrong β†’ stops at LB. NodePort unreachable β†’ stops at node. EndpointSlice empty β†’ stops at iptables (no pod selected). Pod IP unreachable directly β†’ CNI issue. Pod IP reachable but returns error β†’ application issue. Work down the path in order. The answer is at exactly one layer.

The 502 Race Condition Nobody Tells You About

There is a timing bug built into every Kubernetes cluster that doesn't have a specific 15-line YAML fix. It causes 502 errors on every single rolling deployment. It affects most production clusters running right now. It is completely preventable. Almost nobody has fixed it.

Here is the scenario: a rolling deployment starts. A pod enters Terminating. The endpoint controller removes the pod IP from the EndpointSlice. kube-proxy watches for this change and needs to update iptables β€” but this propagation takes 2 to 15 seconds.

During that window: iptables still routes new TCP connections to the terminating pod's IP. But if the application process already exited β€” because it received SIGTERM and shut down immediately β€” those new connections get a TCP RST. The Ingress Controller reports that as a 502.

The fix is embarrassingly simple:

the YAML that eliminates 502s on every deployment
spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: payment-api
      lifecycle:
        preStop:
          exec:
            # This 15-second sleep is doing more for your 502 rate
            # than any amount of readiness probe tuning.
            # Add it. Do not argue. Do not skip it.
            command: ["/bin/sh", "-c", "sleep 15"]
      readinessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 2      # fast removal from load balancing
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10
        failureThreshold: 6      # slow restart β€” do NOT panic-kill during traffic spikes

⚑ Pro Tip

The preStop sleep keeps the pod alive for 15 seconds after SIGTERM β€” long enough for kube-proxy to update iptables and stop routing new connections here. This single change eliminates deployment-time 502s in virtually every cluster it's added to. It should be in your organisation's base Deployment template. Add it to every workload you own. Tell your team. They will thank you at 2 AM.

The Black Friday Disaster (A Story in Two Failures)

60-node EKS cluster. 120 microservices. Normal peak: 4,000 RPS. Black Friday morning: 18,000 RPS in the first hour. 4.5Γ— the historical peak.

The checkout service handled it. The networking stack did not.

At 10:14 AM, p99 latency crossed 8 seconds. Error rate hit 22%. Pods were healthy. Memory was normal. Application logs showed nothing. The team stared at dashboards that all looked green. It took 40 minutes to find two compounding networking failures that were invisible unless you knew what to look for.

Failure 1: iptables lock contention

3,400 Services. CI/CD still running β€” 15 deployments in the first hour. Each deployment caused kube-proxy to rewrite iptables rules. In iptables mode, rule rewrites hold kernel write locks. Under 18,000 RPS, those lock windows added 200–400 ms of latency per affected packet. Intermittent. Unpredictable. Completely invisible in application metrics. Fix: freeze non-critical deployments. Latency dropped from 8 s to 1.2 s within three minutes.

Failure 2: conntrack table exhaustion

Under 18,000 RPS with short-lived HTTP connections, the Linux conntrack table filled up. New connections dropped at the kernel level β€” below iptables, below every monitoring layer. ALB saw timeouts. Ingress logs showed upstream errors. Pods were completely fine. Fix: sysctl -w net.netfilter.nf_conntrack_max=2000000 pushed to every node via a one-liner DaemonSet.

πŸ”₯ Production Reality

The lesson is not β€œtune your conntrack table.” The lesson is that at scale, failures compound across networking layers in ways that look like application problems from the outside. β€œPods are healthy” is not an incident summary. It is the beginning of the investigation. The only way to debug these fast is to know every layer.
the 7-step networking incident playbook
# The 7-step networking incident playbook.
# Run in order. The answer is almost always in steps 1–3.

# 1. Do endpoints exist?  (zero = complete silent failure)
kubectl get endpointslices -n production   -l kubernetes.io/service-name=payment-api

# 2. Are pods READY β€” not just Running?
kubectl get pods -n production -l app=payment-api

# 3. Do Service selectors match pod labels?  (most common root cause)
kubectl get svc payment-api -n production -o jsonpath='{.spec.selector}'
kubectl get pods -n production --show-labels | grep payment-api

# 4. What does the event log say?
kubectl get events -n production   --sort-by='.lastTimestamp' | tail -20

# 5. Is the readinessProbe passing?
kubectl describe pod -n production   -l app=payment-api | grep -A 8 "Readiness:"

# 6. What is the ingress controller complaining about?
kubectl logs -n ingress-nginx   deploy/ingress-nginx-controller --tail=100   | grep "upstream|502|503"

# 7. Bypass Service β€” hit pod IP directly  (isolates Service vs pod)
kubectl run debug --image=curlimages/curl   --rm -it --restart=Never --   curl -sv http://10.244.1.5:8080/healthz

The Wall of Shame: Mistakes That Are in Production Right Now

πŸ˜… Senior Engineer Confession

Every item on this list has been made in production, by experienced engineers, at companies you have heard of, more than once. The first step is acknowledgment. The second step is YAML.
  1. No preStop hook on any Deployment. This is the Kubernetes equivalent of kicking your tenant out of the apartment before the new one has moved in. The address still exists in iptables. The key still works. But nobody is home β€” and the process already exited 10 seconds ago. Every new connection gets a TCP RST. nginx calls that a 502. This is happening on every rolling deployment in your cluster right now if you don't have this hook. Fifteen lines of YAML. That's all it takes.
  2. Same failureThreshold for readiness and liveness probes. You built a smoke detector that also burns the building down when it triggers. Readiness failing means β€œI'm overwhelmed, please stop sending me traffic.” Liveness failing means β€œI am broken beyond recovery, please kill me.” These are not the same situation. A pod that is slow during a Black Friday traffic spike does not need to be killed and restarted β€” it needs a moment to breathe. Liveness failureThreshold should be at least 3Γ— readiness. Always.
  3. Running iptables mode past 500 Services. Imagine asking 25,000 people for directions β€” one by one β€” for every single car on the road. That is O(n) iptables evaluation at 5,000 Services Γ— 5 pods. It is not slow in the way that running is slow. It is slow in the way that asking 25,000 people for directions is slow. IPVS is Google Maps. You have Google Maps. Use Google Maps.
  4. Applying Ingress YAML without installing an Ingress Controller. You hung a professionally formatted sign on the wall that says β€œRestaurant This Way.” There is no restaurant. There is no kitchen. There is no chef. The sign looks great. Zero requests are being routed. The Ingress resource is the menu. The Ingress Controller is the kitchen. No kitchen, no food. This specific mistake is made in new cluster setups at least twice a week across the industry. You are not alone. Install the controller.
  5. Hardcoding a pod IP anywhere it will outlive a deployment. Memorising a stranger's phone number instead of saving their contact. The number works. Until they get a new phone β€” which happens every time the pod restarts, every time a node drains, every time a deployment rolls. Pod IPs are ephemeral by design. That is their defining characteristic. Service DNS exists specifically so you never need to know a pod's IP. Use it.
  6. Ignoring externalTrafficPolicy. The default Cluster mode is like asking someone in Chicago to pick up your order in New York and deliver it to Los Angeles β€” when Los Angeles is right next door. Extra network hop, source IP NATed away, your security team can no longer tell where requests are coming from. If you have ever spent an afternoon debugging β€œwhy do all 10 million requests appear to come from the same IP,” you have already paid the price of this default. Set Local.
  7. No PodDisruptionBudget on production workloads. A node drain during routine cluster maintenance can evict every replica of your Service simultaneously. This is the infrastructure equivalent of deciding to replace all four tyres at the same time while the car is on the motorway. Theoretically the car can be put back together. Practically it produces a very detailed post-mortem. minAvailable: 1 takes 30 seconds to write and has never once regretted existing.
  8. Trusting the JVM's DNS cache. The JVM decided in 1997 that caching successful DNS lookups forever was reasonable. Kubernetes decided in 2014 that virtual IPs change when Services are recreated. These two decisions have been having an argument in your production environment ever since β€” producing traffic that silently routes to a dead IP for the entire lifetime of the JVM process, while every health check reports green and nobody knows why a specific subset of users can't connect. Set networkaddress.cache.ttl=10. In every Java workload. Today.

Production Best Practices

  1. preStop sleep on every Deployment. 15–20 seconds. Non-negotiable. The single highest-ROI change for zero-downtime deployments.
  2. Separate readiness and liveness thresholds. Readiness = traffic routing. Liveness = process restart. Different concerns, different tuning. Liveness failureThreshold β‰₯ 3Γ— readiness.
  3. PodDisruptionBudget on every production Service. minAvailable: 1. 30 seconds to add. Prevents maintenance-window surprises forever.
  4. Switch to IPVS before 500 Services. Proactively. One maintenance window. Never reactively during a latency incident.
  5. Alert on zero-endpoint EndpointSlices. A Service with no healthy backends is a silent complete outage. A 60-second Prometheus alert catches it before users report it.
  6. DNS TTL ≀ 60 seconds on production-facing records. Every 5-minute TTL is 5 minutes of β€œwe fixed it but users can't tell yet” during failovers.
  7. 3+ Ingress Controller replicas with zone anti-affinity. All replicas on one node = single node failure takes down your entire ingress tier. One podAntiAffinity rule prevents this.
  8. Evaluate Cilium for high-throughput clusters. eBPF replaces iptables + kube-proxy with O(1) routing, L7-aware Network Policies, and p99 improvements that show up in graphs within hours of migration.

FAQ

Does every request go through all 7 layers?

No. Internal pod-to-pod traffic via ClusterIP skips the cloud LB and Ingress Controller. Direct pod IP access skips iptables DNAT entirely. The full path applies to external HTTPS traffic entering through an Ingress.

How long does kube-proxy take to update iptables after an EndpointSlice change?

2–5 seconds on a lightly loaded cluster. Up to 30 seconds on a loaded cluster with high Service count. This is the window the preStop sleep covers. In clusters with aggressive deployment cadences, consider 20–30 second sleeps.

Ingress vs. LoadBalancer Service β€” which should I use?

LoadBalancer Service provisions one cloud LB per Service β€” one billing line item each. For 20 externally exposed services that is 20 LBs. One Ingress Controller handles all 20 via routing rules behind a single LB. Use Ingress for HTTP/HTTPS. Use LoadBalancer Services for non-HTTP protocols or when you need per-service cloud LB features like WAF or sticky sessions.

What is an EndpointSlice and why does it matter?

EndpointSlices are Kubernetes objects that list the current healthy pod IPs behind each Service. When a pod's readinessProbe passes, its IP is added. When it fails or terminates, it is removed. kube-proxy watches EndpointSlices to keep iptables current. kubectl get endpointslices -l kubernetes.io/service-name=your-svc showing ENDPOINTS: <none> is the fastest diagnosis of a dead Service. Check this first in any traffic incident.

Does the Ingress Controller add latency?

A few milliseconds for L7 inspection, TLS offload, header manipulation, and the extra TCP connection. In most applications this is irrelevant. In ultra-low-latency use cases (sub-5 ms budgets), consider direct LoadBalancer Services with Network Policies instead.

🎀 The 60-Second Interview Answer

Back in the interview room. The whiteboard is still there. You've answered all five follow-up questions. Here is how you deliver the complete answer β€” covering the simple path and the kernel detail that gets you the offer:

🎀 Say This Out Loud Until You Own It

β€œDNS resolves the domain to the cloud load balancer's IP. The LB terminates TLS and forwards to a NodePort on a cluster node.

On the node, kube-proxy's iptables rules DNAT that NodePort to the Ingress Controller pod's real IP. The Ingress Controller β€” just a pod running nginx or Traefik β€” reads Ingress resources, routes by hostname and path, then proxies to a Service ClusterIP.

Here's the key part: the ClusterIP is a virtual IP. No process listens on it. It exists only in iptables rules that kube-proxy programmed into the Linux kernel. Those rules DNAT again β€” from ClusterIP to a real pod IP selected probabilistically from the EndpointSlice.

The CNI plugin delivers the packet to the destination pod via a veth pair and a bridge β€” or VXLAN/BGP if the pod is on a different node. The container handles the request. The response reverses the path, with conntrack un-NATing at each hop.

Critical production detail: without a preStop sleep hook, iptables won't drain before the process exits during a rolling deployment. New connections hit a terminated process. nginx reports 502. This race condition affects every cluster without the hook β€” which is most of them.”

If you can say that in one breath, you're getting the job.

Key Takeaways

  • β†’The Service ClusterIP is a kernel fiction β€” no process listens on it. The iptables DNAT rules are the Service.
  • β†’kube-proxy programs the kernel and leaves. It does not proxy packets. It is named incorrectly.
  • β†’kube-proxy crashing does not stop traffic. Existing rules stay in the kernel. New changes stop propagating.
  • β†’CNI handles pod-to-pod routing with veth pairs, bridges, and VXLAN or BGP for cross-node hops.
  • β†’The 502-on-every-deployment race condition is fixed by a 15-second preStop sleep. Add it today.
  • β†’Zero endpoints in an EndpointSlice means complete silent outage. Check this first in any traffic incident.

Targeting a Kubernetes or Platform Engineering Role?

AiResumeFit matches your resume to Kubernetes, cloud, and SRE job descriptions β€” improving your ATS score in seconds.

Optimize My Resume β†’