KubernetesJune 8, 2026 Β· 19 min read

etcd Explained for DevOps Engineers: The Database That Runs Your Kubernetes Cluster

A deep technical guide covering Raft consensus, MVCC, backup and restore procedures, production incidents, monitoring, and interview Q&A. Written from the perspective of someone who has restored a corrupted etcd cluster from backup while customers were paging.

🚨 Real Incident

At 11:47 PM on a Wednesday, a production Kubernetes cluster stopped accepting new workloads. kubectl apply commands hung with no output. kubectl get pods still worked. kubectl create namespace hung indefinitely. PagerDuty was lit up. After 30 minutes of investigation β€” checking API server logs, network connectivity, node health β€” the team finally found it: etcd had run out of disk space on its data volume. The etcd alarm system had triggered, locking the cluster into read-only mode. No new objects could be created. No changes could be written. The cluster was completely frozen. The fix required compacting etcd history, defragmenting the database, and raising the storage quota. Recovery took two hours. A proper monitoring strategy would have surfaced the db_size warning days earlier.

That incident is the reason this article exists. etcd is not a secondary concern in Kubernetes operations β€” it is the cluster. Every pod definition, every secret, every ConfigMap, every node registration, every service endpoint: all of it lives in etcd. When etcd is unhealthy, Kubernetes is unhealthy, no matter how pristine your nodes and workloads are.

Most DevOps engineers interact with Kubernetes daily without ever thinking about etcd. That is fine β€” until something goes wrong. And when etcd fails, the blast radius is total. This guide gives you the knowledge to understand, operate, monitor, and recover etcd in production.

What Is etcd and Why Does Kubernetes Need It?

etcd (pronounced β€œet-see-dee”) is a distributed, strongly consistent key-value store created by CoreOS in 2013. The name comes from /etc β€” the Unix directory for system configuration β€” combined with d for distributed. It was purpose-built to store configuration data that must be consistent across a distributed system.

Kubernetes chose etcd as its backing store for one reason: strong consistency guarantees. Kubernetes cannot afford eventual consistency. If two API server instances disagree on the current state of a Deployment, the controllers will fight each other, double-schedule pods, and produce chaos. etcd's Raft consensus algorithm guarantees that every read returns the most recent write β€” a property called linearizability β€” which makes it safe for distributed control systems.

Every Kubernetes object β€” pods, services, namespaces, ConfigMaps, Secrets, RBAC roles, CRDs β€” is serialized to protobuf and stored as a key-value pair in etcd. The API server is the only component that writes to or reads from etcd directly. Every other component (kube-scheduler, kube-controller-manager, kubelet) talks to the API server, which translates requests into etcd operations.

🎯 Interview Tip

When asked β€œwhat is the role of etcd in Kubernetes?” do not just say β€œit stores cluster state.” Say: β€œetcd is the sole source of truth for all Kubernetes objects. It provides linearizable reads and writes using the Raft consensus algorithm. The API server is its only client. Every watch event that drives controller reconciliation originates from an etcd watch stream.” That answer signals production experience.

etcd Data Model: What Lives Inside

Key-Value Store with Hierarchical Keys

etcd stores data as flat key-value pairs, but Kubernetes uses path-like keys to create a logical hierarchy. All Kubernetes objects live under the /registry/ prefix. The pattern is /registry/<resource-type>/<namespace>/<name> for namespaced resources, and /registry/<resource-type>/<name> for cluster-scoped resources.

Common key paths include:

Browsing etcd Directly

You should know how to read etcd directly. This skill is invaluable during incidents when the API server is down but etcd is healthy β€” it lets you verify what state is actually stored.

# Set environment variables (avoids repeating flags)
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key

# Check cluster health
etcdctl endpoint health

# Show cluster status β€” includes leader, db size, raft index
etcdctl endpoint status --write-out=table

# List all members
etcdctl member list --write-out=table
# List all pods stored in etcd (raw bytes, use --keys-only for cleaner output)
etcdctl get /registry/pods/ --prefix --keys-only

# Sample output:
# /registry/pods/default/nginx-deployment-6d9f4dc946-2wxkj
# /registry/pods/default/nginx-deployment-6d9f4dc946-8trps
# /registry/pods/kube-system/coredns-5d78c9869d-f9bvr

# Get a specific pod object (binary protobuf β€” you'll see garbled text without a decoder)
etcdctl get /registry/pods/default/my-pod

# List all deployments
etcdctl get /registry/deployments/ --prefix --keys-only

# List all namespaces
etcdctl get /registry/namespaces/ --prefix --keys-only

# List all secrets (keys only β€” DO NOT print values in production logs)
etcdctl get /registry/secrets/ --prefix --keys-only

# List all configmaps
etcdctl get /registry/configmaps/ --prefix --keys-only

# Count total keys in etcd
etcdctl get "" --prefix --keys-only | wc -l

⚠️ Common Mistake

Never run etcdctl get /registry/secrets/ --prefix without --keys-only in production. Secret values are stored base64-encoded (not encrypted unless you have EncryptionConfiguration enabled). Printing them to a terminal logs them in your shell history and potentially in audit logs. Always use --keys-only when browsing secrets.

MVCC: Why Compaction Is Not Optional

etcd uses Multi-Version Concurrency Control (MVCC) to handle concurrent reads and writes without locking. Every write to a key creates a new revision β€” a monotonically increasing integer that represents the global state of the store at a point in time. Old revisions are never immediately deleted; they are kept so that watchers can receive events for changes that happened while they were disconnected.

This design means etcd's database grows over time even if the total number of live objects stays constant. A cluster that has been running for a year and has processed millions of pod lifecycle events will have an etcd database many times larger than its current state. Without periodic compaction, etcd will eventually hit its storage quota and trigger a read-only alarm β€” exactly what happened in the opening incident.

Compaction discards all revisions older than a specified revision number. Defragmentation then reclaims the disk space that compaction freed (compaction marks space as free but does not shrink the file until defragmentation runs).

⚑ Production Tip

Enable auto-compaction-mode: periodic with auto-compaction-retention: "1h" in your etcd configuration. This automatically compacts revisions older than 1 hour without manual intervention. Pair it with a weekly defragmentation job (run during maintenance windows, one member at a time) to keep disk usage stable.

Raft Consensus: How etcd Stays Consistent

Raft is the consensus algorithm that makes etcd a distributed system rather than a single point of failure. Understanding Raft is essential for understanding why etcd behaves the way it does under network partitions, disk pressure, and node failures.

The Raft Model

In a Raft cluster, all nodes are equal at startup. Through an election process, one node becomes the leader. All writes go through the leader. The leader replicates entries to followers and confirms a write is committed only after a majority (quorum) of nodes acknowledge it. This guarantees that even if the leader crashes immediately after confirming a write, the entry will survive because a majority of nodes already have it.


  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                   RAFT LEADER ELECTION (Term 1)                     β”‚
  β”‚                                                                     β”‚
  │   Node-1 (Follower)    Node-2 (Candidate→Leader)   Node-3 (Follower)│
  β”‚        β”‚                       β”‚                        β”‚           β”‚
  β”‚        β”‚  election timeout     β”‚                        β”‚           β”‚
  β”‚        β”‚  fires first         ─┼─ becomes Candidate     β”‚           β”‚
  β”‚        β”‚                       β”‚                        β”‚           β”‚
  β”‚        │◄──── RequestVote(term=1, logIndex=0) ───────────           β”‚
  β”‚        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ίβ”‚                        β”‚           β”‚
  β”‚        β”‚    VoteGranted        │◄─── RequestVote ────────           β”‚
  β”‚        β”‚                       β”‚                        β”‚           β”‚
  β”‚        β”‚                       │────── VoteGranted ────►│           β”‚
  β”‚        β”‚                       β”‚                        β”‚           β”‚
  β”‚        β”‚  Won quorum (2/3 votes)β”‚                        β”‚           β”‚
  β”‚        β”‚                 [LEADER elected for Term 1]     β”‚           β”‚
  β”‚        β”‚                       β”‚                        β”‚           β”‚
  β”‚        │◄──── Heartbeat(term=1)────── Heartbeat ────────►│           β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                   RAFT LOG REPLICATION                              β”‚
  β”‚                                                                     β”‚
  β”‚   Client              Leader (Node-2)          Follower (Node-1,3) β”‚
  β”‚     β”‚                      β”‚                         β”‚             β”‚
  β”‚     │── write request ────►│                         β”‚             β”‚
  β”‚     β”‚                      β”‚ 1. Append to local log  β”‚             β”‚
  β”‚     β”‚                      │──── AppendEntries ─────►│             β”‚
  β”‚     β”‚                      β”‚                         β”‚ 2. Ack      β”‚
  β”‚     β”‚                      │◄─── Success ─────────────             β”‚
  β”‚     β”‚                      β”‚ 3. Quorum reached        β”‚             β”‚
  β”‚     β”‚                      β”‚    β†’ commit entry        β”‚             β”‚
  β”‚     β”‚                      │──── Commit notify ──────►│             β”‚
  β”‚     │◄─── write success ────                         β”‚             β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Leader Election in Detail

Every node in a Raft cluster has an election timeout β€” a random timer. When a follower does not hear from a leader within its election timeout period, it assumes the leader is dead and starts an election. It increments its local term number (a logical clock), transitions to candidate state, and sends RequestVote RPCs to all other nodes.

A node grants its vote to a candidate if it has not already voted in this term and the candidate's log is at least as up-to-date as its own. The first candidate to collect votes from a majority wins the election and becomes the new leader for that term. It then immediately begins sending heartbeats to all followers to prevent another election from starting.

The term number is critical: it prevents old leaders that were temporarily partitioned from overriding the new leader's decisions. Any node that receives a message with a higher term number immediately steps down to follower state.

Log Replication

Once a leader is elected, all writes flow through it. When the API server sends a write request to etcd:

  1. The leader appends the entry to its local log with the current term number and the next log index.
  2. The leader sends AppendEntries RPCs to all followers in parallel.
  3. Followers append the entry to their logs and send an acknowledgment.
  4. Once the leader receives acknowledgments from a majority (quorum), it marks the entry as committed and applies it to the state machine.
  5. The leader then notifies followers in the next AppendEntries that the entry is committed, and they apply it too.
  6. The leader responds to the client with success.

The key insight: a write is only acknowledged to the client after it is durable on a majority of nodes. A single node failure cannot lose committed data.

Why 3 Nodes Is the Minimum and When to Use 5

The quorum formula is (N/2) + 1, rounded down for N/2. A 3-node cluster has a quorum of 2, tolerating 1 failure. A 5-node cluster has a quorum of 3, tolerating 2 simultaneous failures.

Use 3 nodes for standard production. Use 5 nodes when you need to tolerate 2 simultaneous failures β€” typically in multi-availability-zone deployments where you want to survive losing an entire AZ plus one additional node. The trade-off is write latency: with 5 nodes, the leader must wait for 3 acknowledgments instead of 2 before committing, which adds round-trip time.

⚠️ Common Mistake

Never run etcd with an even number of nodes. A 4-node cluster does not provide better availability than a 3-node cluster β€” quorum is still (4/2)+1 = 3, so you can still only tolerate 1 failure. You have added operational complexity and write latency for no availability gain. Always use odd numbers: 1 (dev only), 3, or 5.

Network Partitions and Split-Brain Prevention

Raft's quorum requirement is what prevents split-brain. If a network partition divides a 3-node cluster into a group of 2 and a group of 1, the group of 2 can elect a leader and continue operating (they have quorum). The isolated single node cannot elect itself as leader because it cannot gather quorum. It will keep attempting elections but never succeed. No writes will be accepted by the minority partition. This is the correct behavior β€” it prioritizes consistency over availability.

The practical implication: in a 3-node cluster spanning 3 AZs, a single AZ failure is handled gracefully. But if two AZs lose connectivity to each other and the third, all three partitions are below quorum and the cluster becomes fully unavailable. This is the theoretical worst case; in practice, multi-AZ network partitions this severe are extremely rare.

etcd Watch Mechanism: The Heartbeat of Kubernetes Controllers

The watch API is one of etcd's most important features for Kubernetes. Rather than polling for changes, every Kubernetes component that needs to respond to state changes opens a watch on a key or key prefix. etcd pushes events to watchers whenever a watched key is modified.


  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                     Kubernetes Control Plane                     β”‚
  β”‚                                                                  β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
  β”‚  β”‚kube-schedulerβ”‚  β”‚   controller  β”‚  β”‚  cloud-controller-mgr β”‚ β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    manager    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
  β”‚         β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚              β”‚
  β”‚         β”‚  Watch/List      β”‚  Watch/List          β”‚ Watch/List   β”‚
  β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
  β”‚                            β”‚                                     β”‚
  β”‚                            β–Ό                                     β”‚
  β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
  β”‚                 β”‚   kube-apiserver    │◄─── kubectl / clients    β”‚
  β”‚                 β”‚  (ONLY client that  β”‚                          β”‚
  β”‚                 β”‚   talks to etcd)    β”‚                          β”‚
  β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
  β”‚                            β”‚                                     β”‚
  β”‚                            β”‚  gRPC (port 2379)                   β”‚
  β”‚                            β–Ό                                     β”‚
  β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
  β”‚         β”‚           etcd cluster               β”‚                 β”‚
  β”‚         β”‚                                      β”‚                 β”‚
  β”‚         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚                 β”‚
  β”‚         β”‚  β”‚ etcd-0 β”‚  β”‚ etcd-1 β”‚  β”‚ etcd-2 β”‚  β”‚                 β”‚
  β”‚         β”‚  β”‚(Leader)β”‚β—„β–Ίβ”‚(Follow)β”‚β—„β–Ίβ”‚(Follow)β”‚  β”‚                 β”‚
  β”‚         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                 β”‚
  β”‚         β”‚    port 2380 (peer-to-peer Raft)      β”‚                 β”‚
  β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Here is how the watch mechanism drives the core Kubernetes control loops:

None of these components talk to etcd directly. They all talk to the API server, which maintains watch caches internally. The API server opens a relatively small number of watch connections to etcd and multiplexes the events to potentially thousands of clients. This is why the API server is described as the gateway or proxy for etcd β€” it fans out events to all interested parties while keeping etcd's connection count manageable.

🎯 Interview Tip

A common interview question is: β€œHow does the kube-scheduler know when a new pod needs to be scheduled?” The answer flows through the watch mechanism: the scheduler watches the API server for pods with an empty nodeName field. The API server gets that notification from an etcd watch. Without etcd's watch streaming, every Kubernetes component would need to poll, which would not scale beyond a few hundred nodes.

etcd Cluster Sizing

Sizing an etcd cluster requires balancing fault tolerance, write latency, and operational complexity. Here are the practical guidelines:

1-Node etcd (Development Only)

A single-node etcd requires no consensus and has no fault tolerance. The node goes down, the cluster goes down. This is appropriate for local development with kind, minikube, or k3s. It should never be used in any environment that requires availability.

3-Node etcd (Standard Production)

Three nodes is the correct choice for the vast majority of production Kubernetes clusters. Quorum requires 2 nodes, so 1 node can fail without service interruption. Spread the three nodes across three availability zones. This configuration handles the most common failure scenarios: a node crash, a disk failure, a VM restart, or an AZ maintenance event.

5-Node etcd (High-Availability Production)

Five nodes tolerate 2 simultaneous failures. This is appropriate for clusters where the cost of downtime is extremely high, for clusters spanning geographically separated data centers with real network latency between sites, or for clusters large enough that multi-node failures become statistically likely. The trade-off is that the leader must wait for 3 acknowledgments per write, which increases write latency by approximately one additional cross-AZ round trip.

Latency Requirements

Raft performance is extremely sensitive to network latency between members. etcd's default heartbeat interval is 100ms and default election timeout is 1000ms. For these defaults to work correctly, round-trip time between etcd members should be under 10ms. In practice, this means etcd members should be in the same region, typically in the same metropolitan area. Cross-region etcd clusters with 50-100ms RTT will experience constant election timeouts and leader instability.

3-Node vs. 5-Node etcd Comparison

Factor3-Node Cluster5-Node Cluster
Quorum required2 nodes3 nodes
Failures tolerated12
Write latencyLower (2 acks needed)Higher (3 acks needed)
Operational overheadLowerHigher
Cost3 control-plane nodes5 control-plane nodes
Use caseStandard production, single regionCritical clusters, multi-AZ HA
Rolling update safetyCan update 1 at a timeCan update 2 at a time

etcd Hardware Requirements

Storage: SSD Is Mandatory

etcd writes every log entry to disk before acknowledging it to the leader. This is the fsync path that shows up in the etcd_disk_wal_fsync_duration_seconds metric. On a spinning disk (HDD), a single fsync takes 2-10ms under typical load. During busy periods, this can spike to 20-50ms.

etcd's election timeout is 1000ms by default, meaning a member that does not hear from the leader within 1 second starts an election. On a busy HDD, the disk subsystem can delay heartbeat processing long enough to trigger spurious elections. You will see etcd_server_leader_changes_total incrementing constantly, with no actual network partition occurring. The cluster is healthy in theory but thrashing in practice.

On AWS, use io1 or io2 volumes with a minimum of 3,000 IOPS provisioned. For very active clusters, provision 6,000+ IOPS. gp3 volumes (baseline 3,000 IOPS, configurable up to 16,000) are a cost-effective choice for most clusters. On GCP, use SSD persistent disks. On Azure, use Premium SSD.

Dedicated Disk

etcd's data directory must be on a dedicated disk, not the root volume. If etcd shares its disk with container image layers, log files, or OS writes, a runaway workload can fill the shared disk and trigger etcd's space alarm. Use a separate volume for /var/lib/etcd and size it at 50-100GB. Monitor it separately from the OS disk.

RAM

etcd keeps its working set in memory. For a small cluster (fewer than 100 nodes, fewer than 5,000 objects), 4GB of RAM for etcd is sufficient. For large clusters with thousands of nodes and tens of thousands of objects, 8-16GB is appropriate. etcd's memory usage grows with the number of watch connections and the size of objects being watched. Large Secret or ConfigMap objects stored in etcd are multiplied across all watchers.

CPU

etcd is not CPU-intensive. 2-4 vCPUs is sufficient for most clusters. The bottleneck is almost always disk I/O, not CPU.

etcd Performance Tuning

Heartbeat and Election Timeout

The heartbeat interval is how often the leader sends heartbeats to followers to prevent elections. The election timeout is how long a follower waits without a heartbeat before starting an election. The election timeout must be at least 5x the heartbeat interval to handle momentary delays without triggering unnecessary elections.

Defaults: heartbeat=100ms, election-timeout=1000ms. These work well when network RTT is under 10ms. If you are seeing frequent leader changes on healthy hardware, consider increasing the election timeout to 2000-5000ms to give the system more tolerance for transient latency spikes.

Storage Quota

etcd's default storage quota is 2GB. This is dangerously low for production clusters. A moderately active cluster can hit 2GB within months. Set quota-backend-bytes: 8589934592 (8GB) as a baseline. For very large clusters, consider 16GB. When the quota is hit, etcd enters read-only mode and raises the space alarm β€” exactly the opening incident.

Compaction and Defragmentation

Enable automatic compaction with auto-compaction-mode: periodic and auto-compaction-retention: "1h". This compacts all revisions older than 1 hour automatically. After compaction, the on-disk file still holds the space as free pages β€” defragmentation is required to actually reclaim it. Run defragmentation on a schedule (monthly is usually sufficient), one member at a time, starting with followers to avoid interrupting the leader.

# Check current revision and db size
etcdctl endpoint status --write-out=json | python3 -m json.tool

# Compact history up to current revision (frees space in MVCC store)
REV=$(etcdctl endpoint status --write-out=json | python3 -c "import sys,json; print(json.load(sys.stdin)[0]['Status']['header']['revision'])")
etcdctl compact $REV

# Defragment all members (run one at a time β€” do NOT use --cluster in production)
etcdctl defrag --endpoints=https://10.0.0.10:2379
etcdctl defrag --endpoints=https://10.0.0.11:2379
etcdctl defrag --endpoints=https://10.0.0.12:2379

# Check and clear alarms
etcdctl alarm list
etcdctl alarm disarm

# Take a snapshot backup
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db

# Verify snapshot integrity
etcdctl snapshot status /backup/etcd-snapshot-latest.db --write-out=table

⚑ Production Tip

When running etcdctl defrag, defragment one member at a time. Use --endpoints to target a specific member. Defragmentation causes a brief pause in service from that member while it rewrites its data file. If you defrag all members simultaneously with --cluster, you can temporarily lose quorum on a 3-node cluster. Always defrag the followers first, then the leader last.

Snapshot Count

The snapshot-count parameter controls how many applied entries trigger an etcd snapshot. A snapshot allows the WAL (write-ahead log) to be compacted. The default is 100,000. Lowering it reduces memory usage but increases disk writes. The default is appropriate for most clusters.

Backup and Restore: The Runbook You Need Before You Need It

etcd Snapshot Backup

etcd's built-in snapshot mechanism creates a consistent point-in-time copy of the entire database. A snapshot includes all committed entries up to the revision at the time of the snapshot. It is safe to take a snapshot from any member, but taking it from the leader ensures the most recent state.

The most important rule about backups: test them. A backup you have never restored is not a backup, it is a hope. Run a quarterly restore drill into a throwaway cluster.


  BACKUP FLOW
  ───────────
  etcd Leader
      β”‚
      β”‚  etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
      β”‚
      β–Ό
  Snapshot file (consistent point-in-time, includes all MVCC revisions)
      β”‚
      β”‚  copy to S3 / GCS / NFS
      β–Ό
  Offsite storage (retention policy: 7 daily, 4 weekly)

  RESTORE FLOW (Disaster Recovery)
  ─────────────────────────────────
  1. Stop all kube-apiserver instances
  2. Stop all etcd members
  3. Remove corrupted data directories
  4. etcdctl snapshot restore /backup/etcd-latest.db \
       --name etcd-0 \
       --initial-cluster etcd-0=https://10.0.0.10:2380 \
       --initial-cluster-token etcd-cluster-1 \
       --initial-advertise-peer-urls https://10.0.0.10:2380 \
       --data-dir /var/lib/etcd
  5. Repeat step 4 for each member with its own --name and peer URL
  6. Start etcd members (they form cluster from restored snapshot)
  7. Start kube-apiserver
  8. Verify: kubectl get nodes

Automated Backup CronJob

# etcd-backup-cronjob.yaml
# Runs inside the control plane node or a pod with access to etcd certs.
# Adjust the schedule, S3 bucket, and cert paths for your environment.
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
  labels:
    app: etcd-backup
spec:
  schedule: "0 */6 * * *"         # Every 6 hours
  concurrencyPolicy: Forbid        # Never run two backups simultaneously
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true        # Required to reach etcd at 127.0.0.1:2379
          restartPolicy: OnFailure
          tolerations:
            - key: node-role.kubernetes.io/control-plane
              operator: Exists
              effect: NoSchedule
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          containers:
            - name: etcd-backup
              image: bitnami/etcd:3.5.12
              env:
                - name: ETCDCTL_API
                  value: "3"
                - name: ETCDCTL_ENDPOINTS
                  value: "https://127.0.0.1:2379"
                - name: ETCDCTL_CACERT
                  value: /etc/kubernetes/pki/etcd/ca.crt
                - name: ETCDCTL_CERT
                  value: /etc/kubernetes/pki/etcd/server.crt
                - name: ETCDCTL_KEY
                  value: /etc/kubernetes/pki/etcd/server.key
                - name: S3_BUCKET
                  value: "my-etcd-backups"
                - name: AWS_DEFAULT_REGION
                  value: "us-east-1"
              command:
                - /bin/sh
                - -c
                - |
                  set -e
                  TIMESTAMP=$(date +%Y%m%d-%H%M%S)
                  BACKUP_FILE="/tmp/etcd-${TIMESTAMP}.db"
                  etcdctl snapshot save "${BACKUP_FILE}"
                  etcdctl snapshot status "${BACKUP_FILE}" --write-out=table
                  aws s3 cp "${BACKUP_FILE}" "s3://${S3_BUCKET}/etcd/${TIMESTAMP}.db"
                  echo "Backup complete: ${TIMESTAMP}.db"
                  # Delete backups older than 30 days from S3
                  aws s3 ls "s3://${S3_BUCKET}/etcd/" | \
                    awk '{print $4}' | \
                    while read f; do
                      d=$(echo "$f" | sed 's/\.db//' | sed 's/-/\/g' | awk -F/ '{print $1"-"$2"-"$3}')
                      if [ "$(date -d "$d" +%s 2>/dev/null)" -lt "$(date -d '30 days ago' +%s)" ]; then
                        aws s3 rm "s3://${S3_BUCKET}/etcd/$f"
                      fi
                    done
              volumeMounts:
                - name: etcd-certs
                  mountPath: /etc/kubernetes/pki/etcd
                  readOnly: true
          volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd
                type: Directory

Backup Frequency Recommendations

Velero Integration

Velero is not an etcd backup tool β€” it backs up Kubernetes resource manifests by calling the Kubernetes API, not by reading etcd directly. However, Velero complements etcd snapshots: etcd snapshots give you binary-level recovery for disaster scenarios, while Velero gives you selective resource restoration (restore a single namespace, restore without cluster-scoped objects, etc.). Use both.

DR Runbook: Step-by-Step Restore Procedure

The following procedure restores etcd from a snapshot in a kubeadm-managed cluster:

  1. Identify the most recent valid snapshot. Run etcdctl snapshot status <file> to verify it is not corrupted.
  2. Stop kube-apiserver on all control plane nodes: mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/ (as a static pod, this stops it without systemd).
  3. Stop etcd on all members: mv /etc/kubernetes/manifests/etcd.yaml /tmp/
  4. Back up the existing data directory: mv /var/lib/etcd /var/lib/etcd.bak
  5. Restore the snapshot on each member with unique peer URLs and --name flags. All members must use the same snapshot and the same --initial-cluster-token.
  6. Restore kube-apiserver manifests: mv /tmp/etcd.yaml /etc/kubernetes/manifests/
  7. Restore kube-apiserver: mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
  8. Verify etcd health: etcdctl endpoint health
  9. Verify Kubernetes: kubectl get nodes, kubectl get pods -A
  10. Verify controller reconciliation by checking that all Deployments report their expected replica counts.

⚠️ Common Mistake

Do not restore to a different revision than the one used for all other cluster members. If you restore node A from a snapshot at revision 1,000,000 and node B from a snapshot at revision 1,050,000, they will not form a consistent cluster. All members must be restored from the same snapshot. After restore, the cluster will reconcile to the snapshot state as controllers rerun their reconciliation loops.

Monitoring etcd in Production

Key Metrics

The following Prometheus metrics are the minimum you should alert on:

# etcd-servicemonitor.yaml
# Requires kube-prometheus-stack or Prometheus Operator.
# etcd metrics are exposed on port 2381 (metrics port) by default.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  jobLabel: k8s-app
  endpoints:
    - port: metrics          # etcd --listen-metrics-urls=http://0.0.0.0:2381
      interval: 30s
      scheme: https
      tlsConfig:
        caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
        certFile: /etc/prometheus/secrets/etcd-certs/server.crt
        keyFile: /etc/prometheus/secrets/etcd-certs/server.key
        insecureSkipVerify: false
  namespaceSelector:
    matchNames:
      - kube-system
  selector:
    matchLabels:
      component: etcd
---
# Key Prometheus alert rules for etcd
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: etcd-alerts
  namespace: monitoring
spec:
  groups:
    - name: etcd
      rules:
        - alert: EtcdNoLeader
          expr: etcd_server_has_leader == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "etcd member has no leader"
            description: "etcd member {{ $labels.instance }} has no leader for > 1 minute"

        - alert: EtcdHighNumberOfLeaderChanges
          expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: "Frequent etcd leader changes"
            description: "{{ $value }} leader changes in the last hour on {{ $labels.instance }}"

        - alert: EtcdDatabaseSizeHigh
          expr: etcd_mvcc_db_total_size_in_bytes / 1024 / 1024 / 1024 > 6
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "etcd database approaching quota"
            description: "etcd db size is {{ $value | humanize }}GB. Default quota is 8GB."

        - alert: EtcdDiskFsyncSlow
          expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.01
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "etcd WAL fsync latency high"
            description: "99th percentile WAL fsync latency is {{ $value | humanizeDuration }} on {{ $labels.instance }}"

πŸ” Troubleshooting Tip

If you see frequent leader changes but no network issues, check etcd_disk_wal_fsync_duration_seconds first. Slow disk I/O is the most common cause of spurious elections in production clusters. Run fio --name=etcd-test --ioengine=sync --fdatasync=1 --directory=/var/lib/etcd --size=22m --bs=2300 --readwrite=write to benchmark your disk. etcd publishes this exact fio command in its documentation as the recommended disk benchmark.

etcd in Managed Kubernetes: EKS, GKE, AKS

In managed Kubernetes services, the control plane β€” including etcd β€” is operated by the cloud provider. You do not have access to the etcd nodes, cannot run etcdctl against them, and cannot take or restore etcd snapshots directly.

Self-Managed vs. Managed Kubernetes etcd

AspectSelf-Managed (kubeadm)Managed (EKS/GKE/AKS)
etcd accessFull access via etcdctlNo direct access
Snapshot backupYour responsibility (etcdctl)Provider managed; use Velero for resource backup
Disk sizingYour responsibilityProvider managed
Compaction/defragYour responsibilityProvider managed
MonitoringYour responsibility (Prometheus)Provider cloud metrics only
HA configurationYou choose 3 or 5 nodesProvider managed (usually 3)
Disaster recoveryFull etcd restore runbookVelero restore or cluster recreate
Interview relevanceDeep operational knowledge requiredArchitectural understanding still required

Even if you use EKS or GKE exclusively, you must understand etcd architecture for senior interviews. Interviewers at companies running managed Kubernetes still ask about etcd because it reveals whether you understand how Kubernetes actually works. The question β€œwhat would happen if etcd went down in your EKS cluster?” is a common senior DevOps interview question. The answer: the API server would stop serving write requests. Existing pods continue running (kubelet operates independently), but no new pods can be scheduled, no config changes can be applied, and no healing will occur.

Three Production Incident Stories

Incident 1: etcd Disk Full β€” Cluster Freeze at Midnight

🚨 Real Incident

The cluster described in the introduction had been running for 14 months with no maintenance performed on etcd. Auto-compaction was not enabled. The etcd database had grown from its initial 500MB to 7.8GB against a default 8GB quota. When the quota alarm fired, etcd switched to read-only mode. The kube-apiserver could still serve GET requests from its in-memory watch cache, which is why kubectl get pods still worked. But any operation that required writing to etcd β€” kubectl apply, kubectl create, pod scheduling, deployment scaling β€” hung indefinitely at the API server as it waited for etcd to accept the write. Recovery steps: (1) SSH to a control plane node and set ETCDCTL env vars. (2) Run etcdctl alarm list to confirm the NOSPACE alarm. (3) Run etcdctl compact to the current revision. (4) Run etcdctl defrag on each member. (5) Run etcdctl alarm disarm. (6) Increase quota to 8GB in etcd config and restart etcd. Total downtime for write operations: 2 hours.

Incident 2: Leader Election Storm from Slow Disks

🚨 Real Incident

A production cluster was migrated from dedicated bare metal etcd nodes to virtual machines with network-attached storage. Within hours, the on-call engineer was receiving alerts: etcd_server_leader_changes_seen_total had incremented 47 times in one hour. The cluster was technically functional, but API server latency had spiked to 800ms p99 as write requests queued behind rapidly changing leaders. Root cause: the NAS storage was provisioned on a shared storage array with no IOPS reservation. Under concurrent load from other VMs on the same array, etcd WAL fsync latency spiked to 250-400ms β€” well above the 1000ms election timeout. Each time the leader's disk stalled, a follower started an election. The fix required migrating to dedicated SSD volumes with 6,000 provisioned IOPS. Lesson: never share etcd storage with other workloads. Disk I/O contention is invisible in monitoring until it causes election storms.

Incident 3: Restore Gone Wrong β€” Wrong Revision, Duplicate Resources

🚨 Real Incident

After a botched upgrade that corrupted etcd's BoltDB backend, the team restored from a snapshot. The snapshot was taken 18 hours before the corruption. The restore procedure was technically correct, but one critical mistake was made: only two of the three etcd members were restored from the snapshot. The third member was restored from a more recent local backup that a junior engineer found on the node. The three members could not agree on cluster state β€” two members were at revision 1,200,000 and one member was at revision 1,250,000. etcd itself would not start cleanly. After 45 minutes of debugging, the team realized the mismatch and reran the restore with the same snapshot on all three nodes. The resulting cluster was 18 hours behind, which required manually re-applying 200+ ConfigMap and Secret changes from git history. Lesson: always use the same snapshot file for all members during a restore. Document the restore procedure, test it quarterly, and never improvise under pressure.

Troubleshooting etcd

API Server Log Errors That Point to etcd

When etcd has problems, the API server logs are often where you first see the symptoms:

Common Diagnostic Commands

# Check if etcd has a leader
etcdctl endpoint status --write-out=table

# Check for alarms (NOSPACE is the most common)
etcdctl alarm list

# Check current db size vs quota
etcdctl endpoint status --write-out=json | \
  python3 -c "import sys,json; d=json.load(sys.stdin); [print(m['Endpoint'], 'db_size:', m['Status']['dbSize'], 'bytes') for m in d]"

# Check if Raft is making progress (raftIndex should be incrementing)
watch -n 1 'etcdctl endpoint status --write-out=table'

# Check etcd logs for slow fsync messages
journalctl -u etcd -f | grep -E "slow|failed|timeout|exceed"

# Kubernetes API server error logs related to etcd
kubectl logs -n kube-system -l component=kube-apiserver | grep -E "etcd|context deadline"

# Check etcd member list and verify all are healthy
etcdctl member list --write-out=table

πŸ” Troubleshooting Tip

If kubectl apply hangs but kubectl get works, your first check should always be etcdctl alarm list. The split behavior β€” reads work, writes hang β€” is the classic fingerprint of an etcd NOSPACE alarm. This narrows your diagnosis from β€œsomething is wrong somewhere” to a 3-step fix in under 5 minutes.

15 Common etcd Mistakes

  1. Using the default 2GB quota in production. Set it to 8GB minimum. The default has caused countless midnight incidents.
  2. Not enabling auto-compaction. Without it, etcd grows indefinitely until it hits quota.
  3. Defragmenting all members simultaneously. Defrag one member at a time to avoid losing quorum.
  4. Running etcd on HDD. HDD fsync latency causes election storms. Use SSD with dedicated IOPS.
  5. Sharing etcd disk with OS or container runtime. I/O contention from other processes causes etcd latency spikes.
  6. Running etcd with even node counts. Even numbers provide no additional fault tolerance over N-1 odd number.
  7. Not testing restore procedures. A backup you have never restored is not a backup.
  8. Using different snapshot files for different members during restore. All members must restore from the same snapshot.
  9. Storing large objects in etcd. Kubernetes has a 1.5MB limit per object, but storing frequently-updated large ConfigMaps or Secrets causes excessive etcd churn. Use external stores for large blobs.
  10. Not monitoring etcd_server_leader_changes_seen_total. Frequent leader changes are an early warning of instability, always before a full outage.
  11. Running etcd members across regions with high latency. RTT over 10ms causes constant election instability with default timeout settings.
  12. Forgetting TLS between etcd members and clients. etcd peer traffic and client traffic should always be TLS-encrypted, especially in cloud environments.
  13. Compacting too aggressively. Compacting to the very latest revision can cause β€œrequired revision has been compacted” errors in API server watches. Compact to a revision that is at least a few minutes old.
  14. Not allocating a dedicated network interface for etcd peer traffic. In high-throughput clusters, etcd peer replication traffic can saturate a shared NIC.
  15. Assuming managed Kubernetes means no etcd knowledge needed. You will be asked about etcd in senior SRE and platform engineer interviews regardless of whether you use managed Kubernetes.

Interview Q&A

Beginner Questions

Q: What is etcd and what does it do in Kubernetes?

etcd is a distributed key-value store that serves as the sole backing data store for Kubernetes. Every Kubernetes object β€” pods, services, namespaces, secrets, deployments β€” is stored in etcd. The Kubernetes API server is the only client of etcd. etcd provides strong consistency via the Raft consensus algorithm, ensuring that all API server instances see the same data.

Q: What happens to running pods if etcd goes down?

Running pods continue to run. The kubelet operates independently on each node β€” it maintains its own local state and does not need etcd to keep containers running. However, no new pods can be scheduled, no deployments can be updated, no healing will occur (if a pod crashes, the controller cannot create a replacement), and kubectl commands that require writes will fail or hang. The cluster is alive but frozen.

Q: Why does Kubernetes use etcd instead of a relational database?

etcd provides linearizable reads and writes across a distributed cluster with built-in leader election and consensus. A relational database would require additional coordination logic to handle leader election and distributed consensus. etcd's watch API also provides efficient push-based change notification, which is fundamental to how Kubernetes controllers work. A SQL database would require polling.

Q: What is the minimum number of etcd nodes for a production cluster?

Three nodes is the minimum for production. A single-node etcd has no fault tolerance. A two-node cluster cannot achieve quorum (requires 2 of 2) when one node fails, providing no benefit over a single node. Three nodes can tolerate one failure while maintaining quorum with the remaining two.

Q: How do you check if etcd is healthy?

Run etcdctl endpoint health to check if each member is responding. Run etcdctl endpoint status --write-out=table to see which member is the leader, the current database size, and the Raft index. Run etcdctl alarm list to check for any active alarms such as NOSPACE.

Intermediate Questions

Q: Explain the Raft consensus algorithm at a high level.

Raft divides time into terms. In each term, a leader is elected through a voting process. Any node that does not hear from a leader within its election timeout starts a new term and requests votes from other nodes. The first node to receive votes from a majority (quorum) becomes leader. The leader handles all writes by appending entries to its log and replicating them to followers. A write is committed only after a majority of nodes acknowledge it. This guarantees that committed entries survive the leader crashing, because they already exist on a majority of nodes.

Q: What is MVCC in etcd and why is compaction needed?

MVCC (Multi-Version Concurrency Control) means etcd keeps every historical version of every key. Each write increments a global revision counter. Old revisions are retained so that watch clients can replay changes they missed. Without compaction, the database grows indefinitely as every create, update, and delete adds a new revision. Compaction deletes all revisions older than a specified point, freeing space. Defragmentation then reclaims the freed space on disk.

Q: What is the etcd watch mechanism and how does Kubernetes use it?

etcd watches are long-lived streaming connections. A client specifies a key prefix and etcd pushes an event every time a key in that prefix is created, modified, or deleted. Kubernetes uses this extensively: the scheduler watches for unscheduled pods, controllers watch for their respective resources, kubelet watches for pods assigned to its node, and kube-proxy watches EndpointSlices. The API server maintains watch connections to etcd and multiplexes events to all controller clients.

Q: How would you recover from an etcd space alarm?

First, confirm the alarm with etcdctl alarm list. Then compact the database: get the current revision from etcdctl endpoint status and run etcdctl compact <revision>. Next, defragment each member one at a time with etcdctl defrag. Then disarm the alarm with etcdctl alarm disarm. Finally, verify recovery by checking that writes to the API server succeed and by monitoring db_size. Increase the quota in the etcd configuration to prevent recurrence.

Q: What causes frequent leader elections in etcd?

The most common cause is slow disk I/O. etcd writes every log entry to disk (WAL fsync) before acknowledging it. If fsync takes longer than the election timeout, followers conclude the leader is dead and start elections. This happens most often when etcd is on HDD, network-attached storage with IOPS contention, or a disk that is being I/O starved by other workloads. Network issues can also cause it, but disk I/O is the more common culprit in cloud environments. Check etcd_disk_wal_fsync_duration_seconds histogram and compare it against the election timeout.

Advanced Questions

Q: Describe what happens during a network partition in a 3-node etcd cluster.

Assume nodes A, B, C where A is leader. If C is partitioned away from A and B, A and B retain quorum (2 of 3). They continue operating normally. C cannot reach A or B, so it cannot hear heartbeats. C starts an election, but it can only vote for itself. Since C cannot reach quorum (needs 2 votes, only has 1), C's election never succeeds. C loops in election state indefinitely. When the partition heals, C receives a heartbeat from A with A's current term. If A's term is higher than C's (which it will be if C was repeatedly incrementing its term during election attempts), C steps down to follower and accepts A's log. A then replicates any missing entries to C. This self-healing behavior is built into Raft.

Q: How does etcd ensure linearizable reads?

By default, etcd read requests are also routed through the Raft consensus mechanism (linearizable reads). The leader must verify it is still the leader before serving a read β€” this is done by sending a round of heartbeats to verify quorum before responding. This prevents a partitioned old leader from serving stale reads. The trade-off is latency. etcd also supports serializable reads (reading from any member without consensus verification), which are faster but may return stale data. Kubernetes uses linearizable reads for correctness.

Q: What are the implications of increasing etcd's snapshot-count parameter?

The snapshot-count parameter determines how many applied Raft log entries accumulate before etcd takes a snapshot and truncates the WAL. A higher value means less frequent snapshotting, which reduces disk write amplification but increases memory usage (since more entries are held in memory) and increases the time required to replay the WAL on restart after a crash. A lower value means more frequent snapshotting and faster recovery, but more disk I/O. For most clusters, the default of 100,000 is appropriate. On very write-heavy clusters, lowering it to 50,000 can reduce restart recovery time.

Q: A Kubernetes controller is processing events very slowly, and you notice it keeps receiving the same watch events multiple times. What is the likely etcd cause?

The controller's watch connection is falling behind the revision stream. When a watcher's last-seen revision is compacted away, etcd closes the watch and the API server re-establishes it with a full relist. This floods the controller with a full dump of all objects in its watch scope, which it must process again as if they were all new. The fix is to ensure compaction is not too aggressive β€” compact to a revision that is at least 5 minutes old β€” and to ensure the controller's watch reconnect logic uses the ResourceVersion returned from the previous list to avoid unnecessary full relists.

Q: How would you design an etcd backup strategy for a 5-node HA cluster across 3 AZs?

Take snapshots from a follower in each AZ every 2 hours using a CronJob running on a dedicated control plane node in each AZ (to avoid putting load on the leader). Store snapshots in an S3 bucket with cross-region replication enabled. Retain 7 days of hourly snapshots, 4 weeks of daily snapshots, and 12 months of monthly snapshots. Test restore procedures quarterly in a throwaway cluster. Use Velero for application-level backup (namespace scoped) as a complement to etcd snapshots. Integrate snapshot success/failure metrics into Prometheus and alert if no successful snapshot has completed in 3 hours. Store the restore runbook in a location accessible without cluster access (e.g., a wiki or a runbook repository) because when you need to restore etcd, your cluster may not be available.

20 etcd Best Practices

  1. Always use an odd number of etcd members (1, 3, or 5). Never 2, 4, or 6.
  2. Set quota-backend-bytes to at least 8GB in production from day one.
  3. Enable auto-compaction-mode: periodic with auto-compaction-retention: "1h".
  4. Schedule monthly defragmentation during maintenance windows, one member at a time.
  5. Use dedicated SSD volumes for etcd data with provisioned IOPS (3,000 minimum, 6,000 recommended).
  6. Never share the etcd data volume with other workloads or the OS root volume.
  7. Spread etcd members across at least 3 availability zones.
  8. Keep RTT between etcd members under 10ms.
  9. Enable mTLS for both client-to-server and peer-to-peer etcd communication.
  10. Take etcd snapshots at least every 6 hours; before any major change, take a manual snapshot.
  11. Store etcd snapshots in at least two separate geographic locations.
  12. Run quarterly restore drills. Never treat an untested backup as valid.
  13. Alert on etcd_server_has_leader == 0 with a 1-minute window β€” this is a critical alert.
  14. Alert on etcd_mvcc_db_total_size_in_bytes exceeding 75% of quota.
  15. Alert on WAL fsync p99 exceeding 10ms β€” this is an early warning before election storms start.
  16. Alert on more than 3 leader changes per hour.
  17. Enable etcd_server_slow_apply_total monitoring β€” nonzero values indicate backend issues.
  18. Use Prometheus + Grafana with the standard etcd dashboard (dashboard ID 3070 on grafana.com).
  19. Document a restore runbook and store it outside the cluster being backed up.
  20. For managed Kubernetes (EKS/GKE/AKS), use Velero for application-level backup and test Velero restores quarterly.

kubeadm HA etcd Configuration

# kubeadm-etcd-ha.yaml
# External etcd cluster configuration for kubeadm HA setup.
# Run 'kubeadm init' with this config on your first control plane node.
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.29.0
controlPlaneEndpoint: "k8s-api.internal.example.com:6443"  # Load balancer VIP
etcd:
  external:
    endpoints:
      - https://10.0.0.10:2379
      - https://10.0.0.11:2379
      - https://10.0.0.12:2379
    caFile: /etc/etcd/pki/ca.crt
    certFile: /etc/etcd/pki/apiserver-etcd-client.crt
    keyFile: /etc/etcd/pki/apiserver-etcd-client.key
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"
---
# etcd member configuration (run on each etcd node)
# /etc/etcd/etcd.conf.yaml
name: etcd-0
data-dir: /var/lib/etcd
listen-client-urls: https://0.0.0.0:2379
advertise-client-urls: https://10.0.0.10:2379
listen-peer-urls: https://0.0.0.0:2380
initial-advertise-peer-urls: https://10.0.0.10:2380
initial-cluster: >
  etcd-0=https://10.0.0.10:2380,
  etcd-1=https://10.0.0.11:2380,
  etcd-2=https://10.0.0.12:2380
initial-cluster-token: etcd-cluster-prod-1
initial-cluster-state: new
# Performance tuning
heartbeat-interval: 100          # ms β€” increase if network latency > 10ms
election-timeout: 1000           # ms β€” must be 5-10x heartbeat-interval
snapshot-count: 10000
quota-backend-bytes: 8589934592  # 8GB
auto-compaction-mode: periodic
auto-compaction-retention: "1h"
# TLS
cert-file: /etc/etcd/pki/server.crt
key-file: /etc/etcd/pki/server.key
trusted-ca-file: /etc/etcd/pki/ca.crt
peer-cert-file: /etc/etcd/pki/peer.crt
peer-key-file: /etc/etcd/pki/peer.key
peer-trusted-ca-file: /etc/etcd/pki/ca.crt
peer-client-cert-auth: true
client-cert-auth: true

Frequently Asked Questions

Can I use etcd for application data storage?

Technically yes, but you should not. etcd is optimized for small, infrequently-changing configuration data. It is not designed for high-throughput transactional workloads. For application state, use a purpose-built database. etcd should only store Kubernetes control plane state.

How large can a single value be in etcd?

etcd has a default value size limit of 1.5MB per key. Kubernetes enforces this for API objects. Storing large objects (large ConfigMaps, CRD instances with large embedded data) fragments etcd storage and causes performance issues. If you need to store large configuration data, store a reference to an external store (S3, Vault) in the ConfigMap rather than the data itself.

What is the difference between etcdctl v2 and v3?

etcd v2 and v3 use different data models and APIs. v2 had a hierarchical directory-like data model. v3 has a flat key-value model with a prefix-based range query API. Kubernetes switched to etcdctl API v3 in Kubernetes 1.6. Always set ETCDCTL_API=3 when working with Kubernetes etcd. Running v2 commands against a v3 cluster will either fail or operate on a completely separate legacy data space.

Can I run etcd as a pod inside Kubernetes?

etcd in kubeadm-managed clusters runs as a static pod β€” a pod managed directly by the kubelet, not by the API server. The manifests live in /etc/kubernetes/manifests/etcd.yaml. This is intentional: if etcd ran as a regular pod, a failing etcd would prevent the pod from being rescheduled, creating a circular dependency. Static pods avoid this by running independently of the API server.

What is the etcd WAL?

WAL stands for Write-Ahead Log. Before any entry is applied to the etcd key-value store, it is first written to the WAL and fsynced to disk. This ensures that if etcd crashes mid-write, it can replay the WAL on restart and recover all committed entries. The WAL is stored in /var/lib/etcd/member/wal/. WAL fsync latency is one of the most important performance metrics for etcd.

What happens if two etcd members see different leaders at the same time?

Raft's term mechanism prevents this from causing inconsistency. If an old leader (say, term 5) is partitioned and a new leader (term 6) is elected, any writes to the old leader are not committed (quorum cannot be reached without the partitioned nodes). When the partition heals, the old leader receives a message with term 6, recognizes it is behind, steps down to follower, and accepts the new leader's log. Raft guarantees that only one leader can commit writes in any given term.

How do I add a new etcd member to an existing cluster?

Use etcdctl member add <name> --peer-urls=https://<new-peer-ip>:2380 to register the new member. Then start the new etcd process with --initial-cluster-state=existing pointing to the registered cluster. The new member will receive a full snapshot from the leader and catch up. Never start a new member with --initial-cluster-state=new against an existing cluster β€” it will create a conflicting cluster.

What is the etcd v3 lease mechanism?

A lease is a time-to-live (TTL) associated with a set of keys. When the lease expires, all keys attached to it are automatically deleted. Kubernetes uses leases for node heartbeats: each node holds a Lease object in the kube-node-lease namespace. If the node fails to renew its lease, the node controller marks it NotReady. This is more efficient than updating a large Node object for every heartbeat.

Can I encrypt data at rest in etcd?

Yes. Kubernetes supports EncryptionConfiguration for the API server, which encrypts specified resource types (most commonly Secrets) before writing them to etcd. The encryption happens in the API server β€” etcd itself does not know the data is encrypted. This protects against an attacker who gains direct access to the etcd data directory. Configure it via --encryption-provider-config on kube-apiserver.

How does etcd handle a slow follower?

The leader maintains a replication queue for each follower. If a follower falls behind, the leader buffers log entries in memory up to a configurable limit. If the follower falls so far behind that the leader has already snapshotted the needed entries, the leader sends a full snapshot to the slow follower to catch it up. In Kubernetes, this can happen during maintenance windows when a control plane node is temporarily offline.

What is the default etcd port and what runs on each port?

etcd uses three ports: 2379 for client-to-server communication (API server and etcdctl connect here), 2380 for server-to-server peer communication (Raft replication between members), and 2381 for metrics HTTP endpoint (Prometheus scrapes this). Make sure only the API server has network access to port 2379, and only etcd peers have access to port 2380.

How do I tell which etcd member is the leader?

Run etcdctl endpoint status --write-out=table. The output includes an IS LEADER column. Only one member should show true. If no member shows true, the cluster has lost quorum.

What is the difference between etcd compaction and defragmentation?

Compaction removes old revisions from etcd's logical data model, marking those pages as free in the BoltDB database file. However, the database file on disk does not shrink after compaction β€” the space is just marked as available for reuse. Defragmentation rewrites the BoltDB file, removing the free pages and actually reducing the on-disk file size. You need both: compaction to reduce logical data, defragmentation to reduce physical disk usage.

Can etcd run without TLS in production?

Technically yes, but you absolutely should not. etcd stores every Kubernetes Secret, RBAC policy, and service account token. Without TLS, any process on the network that can reach port 2379 can read all cluster secrets. In cloud environments, configure security groups to restrict etcd port access and always use TLS with mutual authentication for both client and peer connections.

What is the relationship between etcd and the Kubernetes API server cache?

The API server maintains an in-memory watch cache of all objects it has read from etcd. When clients do kubectl get without specifying --watch, the API server typically serves the response from this cache rather than querying etcd directly. This is why reads still work when etcd is in a degraded state but not completely down. The cache is populated and kept current by the API server's own watch connection to etcd.

Key Takeaways

Preparing for a Kubernetes Interview?

AiResumeFit optimizes your DevOps resume for Kubernetes and platform engineering roles in seconds. Highlight your etcd, Raft, and production operations experience the way interviewers want to see it.

Optimize My Resume β†’