π¨ Real Incident
At 11:47 PM on a Wednesday, a production Kubernetes cluster stopped accepting new workloads. kubectl apply commands hung with no output. kubectl get pods still worked. kubectl create namespace hung indefinitely. PagerDuty was lit up. After 30 minutes of investigation β checking API server logs, network connectivity, node health β the team finally found it: etcd had run out of disk space on its data volume. The etcd alarm system had triggered, locking the cluster into read-only mode. No new objects could be created. No changes could be written. The cluster was completely frozen. The fix required compacting etcd history, defragmenting the database, and raising the storage quota. Recovery took two hours. A proper monitoring strategy would have surfaced the db_size warning days earlier.
That incident is the reason this article exists. etcd is not a secondary concern in Kubernetes operations β it is the cluster. Every pod definition, every secret, every ConfigMap, every node registration, every service endpoint: all of it lives in etcd. When etcd is unhealthy, Kubernetes is unhealthy, no matter how pristine your nodes and workloads are.
Most DevOps engineers interact with Kubernetes daily without ever thinking about etcd. That is fine β until something goes wrong. And when etcd fails, the blast radius is total. This guide gives you the knowledge to understand, operate, monitor, and recover etcd in production.
What Is etcd and Why Does Kubernetes Need It?
etcd (pronounced βet-see-deeβ) is a distributed, strongly consistent key-value store created by CoreOS in 2013. The name comes from /etc β the Unix directory for system configuration β combined with d for distributed. It was purpose-built to store configuration data that must be consistent across a distributed system.
Kubernetes chose etcd as its backing store for one reason: strong consistency guarantees. Kubernetes cannot afford eventual consistency. If two API server instances disagree on the current state of a Deployment, the controllers will fight each other, double-schedule pods, and produce chaos. etcd's Raft consensus algorithm guarantees that every read returns the most recent write β a property called linearizability β which makes it safe for distributed control systems.
Every Kubernetes object β pods, services, namespaces, ConfigMaps, Secrets, RBAC roles, CRDs β is serialized to protobuf and stored as a key-value pair in etcd. The API server is the only component that writes to or reads from etcd directly. Every other component (kube-scheduler, kube-controller-manager, kubelet) talks to the API server, which translates requests into etcd operations.
π― Interview Tip
When asked βwhat is the role of etcd in Kubernetes?β do not just say βit stores cluster state.β Say: βetcd is the sole source of truth for all Kubernetes objects. It provides linearizable reads and writes using the Raft consensus algorithm. The API server is its only client. Every watch event that drives controller reconciliation originates from an etcd watch stream.β That answer signals production experience.
etcd Data Model: What Lives Inside
Key-Value Store with Hierarchical Keys
etcd stores data as flat key-value pairs, but Kubernetes uses path-like keys to create a logical hierarchy. All Kubernetes objects live under the /registry/ prefix. The pattern is /registry/<resource-type>/<namespace>/<name> for namespaced resources, and /registry/<resource-type>/<name> for cluster-scoped resources.
Common key paths include:
/registry/pods/<namespace>/<pod-name>/registry/services/specs/<namespace>/<service-name>/registry/deployments/<namespace>/<deployment-name>/registry/namespaces/<namespace-name>/registry/secrets/<namespace>/<secret-name>/registry/configmaps/<namespace>/<configmap-name>/registry/nodes/<node-name>/registry/clusterroles/<role-name>/registry/apiextensions.k8s.io/customresourcedefinitions/<crd-name>
Browsing etcd Directly
You should know how to read etcd directly. This skill is invaluable during incidents when the API server is down but etcd is healthy β it lets you verify what state is actually stored.
# Set environment variables (avoids repeating flags)
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
# Check cluster health
etcdctl endpoint health
# Show cluster status β includes leader, db size, raft index
etcdctl endpoint status --write-out=table
# List all members
etcdctl member list --write-out=table# List all pods stored in etcd (raw bytes, use --keys-only for cleaner output)
etcdctl get /registry/pods/ --prefix --keys-only
# Sample output:
# /registry/pods/default/nginx-deployment-6d9f4dc946-2wxkj
# /registry/pods/default/nginx-deployment-6d9f4dc946-8trps
# /registry/pods/kube-system/coredns-5d78c9869d-f9bvr
# Get a specific pod object (binary protobuf β you'll see garbled text without a decoder)
etcdctl get /registry/pods/default/my-pod
# List all deployments
etcdctl get /registry/deployments/ --prefix --keys-only
# List all namespaces
etcdctl get /registry/namespaces/ --prefix --keys-only
# List all secrets (keys only β DO NOT print values in production logs)
etcdctl get /registry/secrets/ --prefix --keys-only
# List all configmaps
etcdctl get /registry/configmaps/ --prefix --keys-only
# Count total keys in etcd
etcdctl get "" --prefix --keys-only | wc -lβ οΈ Common Mistake
Never run etcdctl get /registry/secrets/ --prefix without --keys-only in production. Secret values are stored base64-encoded (not encrypted unless you have EncryptionConfiguration enabled). Printing them to a terminal logs them in your shell history and potentially in audit logs. Always use --keys-only when browsing secrets.
MVCC: Why Compaction Is Not Optional
etcd uses Multi-Version Concurrency Control (MVCC) to handle concurrent reads and writes without locking. Every write to a key creates a new revision β a monotonically increasing integer that represents the global state of the store at a point in time. Old revisions are never immediately deleted; they are kept so that watchers can receive events for changes that happened while they were disconnected.
This design means etcd's database grows over time even if the total number of live objects stays constant. A cluster that has been running for a year and has processed millions of pod lifecycle events will have an etcd database many times larger than its current state. Without periodic compaction, etcd will eventually hit its storage quota and trigger a read-only alarm β exactly what happened in the opening incident.
Compaction discards all revisions older than a specified revision number. Defragmentation then reclaims the disk space that compaction freed (compaction marks space as free but does not shrink the file until defragmentation runs).
β‘ Production Tip
Enable auto-compaction-mode: periodic with auto-compaction-retention: "1h" in your etcd configuration. This automatically compacts revisions older than 1 hour without manual intervention. Pair it with a weekly defragmentation job (run during maintenance windows, one member at a time) to keep disk usage stable.
Raft Consensus: How etcd Stays Consistent
Raft is the consensus algorithm that makes etcd a distributed system rather than a single point of failure. Understanding Raft is essential for understanding why etcd behaves the way it does under network partitions, disk pressure, and node failures.
The Raft Model
In a Raft cluster, all nodes are equal at startup. Through an election process, one node becomes the leader. All writes go through the leader. The leader replicates entries to followers and confirms a write is committed only after a majority (quorum) of nodes acknowledge it. This guarantees that even if the leader crashes immediately after confirming a write, the entry will survive because a majority of nodes already have it.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAFT LEADER ELECTION (Term 1) β
β β
β Node-1 (Follower) Node-2 (CandidateβLeader) Node-3 (Follower)β
β β β β β
β β election timeout β β β
β β fires first ββΌβ becomes Candidate β β
β β β β β
β ββββββ RequestVote(term=1, logIndex=0) βββββββββββ€ β
β ββββββββββββββββββββββββΊβ β β
β β VoteGranted βββββ RequestVote ββββββββ€ β
β β β β β
β β βββββββ VoteGranted βββββΊβ β
β β β β β
β β Won quorum (2/3 votes)β β β
β β [LEADER elected for Term 1] β β
β β β β β
β ββββββ Heartbeat(term=1)ββ€ββββ Heartbeat βββββββββΊβ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAFT LOG REPLICATION β
β β
β Client Leader (Node-2) Follower (Node-1,3) β
β β β β β
β βββ write request βββββΊβ β β
β β β 1. Append to local log β β
β β βββββ AppendEntries ββββββΊβ β
β β β β 2. Ack β
β β βββββ Success βββββββββββββ€ β
β β β 3. Quorum reached β β
β β β β commit entry β β
β β βββββ Commit notify βββββββΊβ β
β βββββ write success ββββ€ β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Leader Election in Detail
Every node in a Raft cluster has an election timeout β a random timer. When a follower does not hear from a leader within its election timeout period, it assumes the leader is dead and starts an election. It increments its local term number (a logical clock), transitions to candidate state, and sends RequestVote RPCs to all other nodes.
A node grants its vote to a candidate if it has not already voted in this term and the candidate's log is at least as up-to-date as its own. The first candidate to collect votes from a majority wins the election and becomes the new leader for that term. It then immediately begins sending heartbeats to all followers to prevent another election from starting.
The term number is critical: it prevents old leaders that were temporarily partitioned from overriding the new leader's decisions. Any node that receives a message with a higher term number immediately steps down to follower state.
Log Replication
Once a leader is elected, all writes flow through it. When the API server sends a write request to etcd:
- The leader appends the entry to its local log with the current term number and the next log index.
- The leader sends
AppendEntriesRPCs to all followers in parallel. - Followers append the entry to their logs and send an acknowledgment.
- Once the leader receives acknowledgments from a majority (quorum), it marks the entry as committed and applies it to the state machine.
- The leader then notifies followers in the next
AppendEntriesthat the entry is committed, and they apply it too. - The leader responds to the client with success.
The key insight: a write is only acknowledged to the client after it is durable on a majority of nodes. A single node failure cannot lose committed data.
Why 3 Nodes Is the Minimum and When to Use 5
The quorum formula is (N/2) + 1, rounded down for N/2. A 3-node cluster has a quorum of 2, tolerating 1 failure. A 5-node cluster has a quorum of 3, tolerating 2 simultaneous failures.
Use 3 nodes for standard production. Use 5 nodes when you need to tolerate 2 simultaneous failures β typically in multi-availability-zone deployments where you want to survive losing an entire AZ plus one additional node. The trade-off is write latency: with 5 nodes, the leader must wait for 3 acknowledgments instead of 2 before committing, which adds round-trip time.
β οΈ Common Mistake
Never run etcd with an even number of nodes. A 4-node cluster does not provide better availability than a 3-node cluster β quorum is still (4/2)+1 = 3, so you can still only tolerate 1 failure. You have added operational complexity and write latency for no availability gain. Always use odd numbers: 1 (dev only), 3, or 5.
Network Partitions and Split-Brain Prevention
Raft's quorum requirement is what prevents split-brain. If a network partition divides a 3-node cluster into a group of 2 and a group of 1, the group of 2 can elect a leader and continue operating (they have quorum). The isolated single node cannot elect itself as leader because it cannot gather quorum. It will keep attempting elections but never succeed. No writes will be accepted by the minority partition. This is the correct behavior β it prioritizes consistency over availability.
The practical implication: in a 3-node cluster spanning 3 AZs, a single AZ failure is handled gracefully. But if two AZs lose connectivity to each other and the third, all three partitions are below quorum and the cluster becomes fully unavailable. This is the theoretical worst case; in practice, multi-AZ network partitions this severe are extremely rare.
etcd Watch Mechanism: The Heartbeat of Kubernetes Controllers
The watch API is one of etcd's most important features for Kubernetes. Rather than polling for changes, every Kubernetes component that needs to respond to state changes opens a watch on a key or key prefix. etcd pushes events to watchers whenever a watched key is modified.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Control Plane β
β β
β ββββββββββββββββ βββββββββββββββββ βββββββββββββββββββββββββ β
β βkube-schedulerβ β controller β β cloud-controller-mgr β β
β ββββββββ¬ββββββββ β manager β ββββββββββββ¬βββββββββββββ β
β β βββββββββ¬ββββββββ β β
β β Watch/List β Watch/List β Watch/List β
β ββββββββββββββββββββΌββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββ β
β β kube-apiserver βββββ kubectl / clients β
β β (ONLY client that β β
β β talks to etcd) β β
β ββββββββββββ¬βββββββββββ β
β β β
β β gRPC (port 2379) β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββ β
β β etcd cluster β β
β β β β
β β ββββββββββ ββββββββββ ββββββββββ β β
β β β etcd-0 β β etcd-1 β β etcd-2 β β β
β β β(Leader)βββΊβ(Follow)βββΊβ(Follow)β β β
β β ββββββββββ ββββββββββ ββββββββββ β β
β β port 2380 (peer-to-peer Raft) β β
β ββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Here is how the watch mechanism drives the core Kubernetes control loops:
- kube-scheduler watches
/registry/pods/for pods withnodeName: ""(unscheduled pods). When a new pod appears, it calculates the best node and writes the binding back to the API server. - kube-controller-manager watches Deployments, ReplicaSets, nodes, and dozens of other resources. When a Deployment changes, the Deployment controller reconciles the desired state by creating or deleting ReplicaSets.
- kubelet watches pods assigned to its node. When a new pod binding appears, it pulls images and starts containers.
- kube-proxy watches Services and EndpointSlices. When a new Service endpoint is added or removed, kube-proxy updates iptables/IPVS rules on every node.
None of these components talk to etcd directly. They all talk to the API server, which maintains watch caches internally. The API server opens a relatively small number of watch connections to etcd and multiplexes the events to potentially thousands of clients. This is why the API server is described as the gateway or proxy for etcd β it fans out events to all interested parties while keeping etcd's connection count manageable.
π― Interview Tip
A common interview question is: βHow does the kube-scheduler know when a new pod needs to be scheduled?β The answer flows through the watch mechanism: the scheduler watches the API server for pods with an empty nodeName field. The API server gets that notification from an etcd watch. Without etcd's watch streaming, every Kubernetes component would need to poll, which would not scale beyond a few hundred nodes.
etcd Cluster Sizing
Sizing an etcd cluster requires balancing fault tolerance, write latency, and operational complexity. Here are the practical guidelines:
1-Node etcd (Development Only)
A single-node etcd requires no consensus and has no fault tolerance. The node goes down, the cluster goes down. This is appropriate for local development with kind, minikube, or k3s. It should never be used in any environment that requires availability.
3-Node etcd (Standard Production)
Three nodes is the correct choice for the vast majority of production Kubernetes clusters. Quorum requires 2 nodes, so 1 node can fail without service interruption. Spread the three nodes across three availability zones. This configuration handles the most common failure scenarios: a node crash, a disk failure, a VM restart, or an AZ maintenance event.
5-Node etcd (High-Availability Production)
Five nodes tolerate 2 simultaneous failures. This is appropriate for clusters where the cost of downtime is extremely high, for clusters spanning geographically separated data centers with real network latency between sites, or for clusters large enough that multi-node failures become statistically likely. The trade-off is that the leader must wait for 3 acknowledgments per write, which increases write latency by approximately one additional cross-AZ round trip.
Latency Requirements
Raft performance is extremely sensitive to network latency between members. etcd's default heartbeat interval is 100ms and default election timeout is 1000ms. For these defaults to work correctly, round-trip time between etcd members should be under 10ms. In practice, this means etcd members should be in the same region, typically in the same metropolitan area. Cross-region etcd clusters with 50-100ms RTT will experience constant election timeouts and leader instability.
3-Node vs. 5-Node etcd Comparison
| Factor | 3-Node Cluster | 5-Node Cluster |
|---|---|---|
| Quorum required | 2 nodes | 3 nodes |
| Failures tolerated | 1 | 2 |
| Write latency | Lower (2 acks needed) | Higher (3 acks needed) |
| Operational overhead | Lower | Higher |
| Cost | 3 control-plane nodes | 5 control-plane nodes |
| Use case | Standard production, single region | Critical clusters, multi-AZ HA |
| Rolling update safety | Can update 1 at a time | Can update 2 at a time |
etcd Hardware Requirements
Storage: SSD Is Mandatory
etcd writes every log entry to disk before acknowledging it to the leader. This is the fsync path that shows up in the etcd_disk_wal_fsync_duration_seconds metric. On a spinning disk (HDD), a single fsync takes 2-10ms under typical load. During busy periods, this can spike to 20-50ms.
etcd's election timeout is 1000ms by default, meaning a member that does not hear from the leader within 1 second starts an election. On a busy HDD, the disk subsystem can delay heartbeat processing long enough to trigger spurious elections. You will see etcd_server_leader_changes_total incrementing constantly, with no actual network partition occurring. The cluster is healthy in theory but thrashing in practice.
On AWS, use io1 or io2 volumes with a minimum of 3,000 IOPS provisioned. For very active clusters, provision 6,000+ IOPS. gp3 volumes (baseline 3,000 IOPS, configurable up to 16,000) are a cost-effective choice for most clusters. On GCP, use SSD persistent disks. On Azure, use Premium SSD.
Dedicated Disk
etcd's data directory must be on a dedicated disk, not the root volume. If etcd shares its disk with container image layers, log files, or OS writes, a runaway workload can fill the shared disk and trigger etcd's space alarm. Use a separate volume for /var/lib/etcd and size it at 50-100GB. Monitor it separately from the OS disk.
RAM
etcd keeps its working set in memory. For a small cluster (fewer than 100 nodes, fewer than 5,000 objects), 4GB of RAM for etcd is sufficient. For large clusters with thousands of nodes and tens of thousands of objects, 8-16GB is appropriate. etcd's memory usage grows with the number of watch connections and the size of objects being watched. Large Secret or ConfigMap objects stored in etcd are multiplied across all watchers.
CPU
etcd is not CPU-intensive. 2-4 vCPUs is sufficient for most clusters. The bottleneck is almost always disk I/O, not CPU.
etcd Performance Tuning
Heartbeat and Election Timeout
The heartbeat interval is how often the leader sends heartbeats to followers to prevent elections. The election timeout is how long a follower waits without a heartbeat before starting an election. The election timeout must be at least 5x the heartbeat interval to handle momentary delays without triggering unnecessary elections.
Defaults: heartbeat=100ms, election-timeout=1000ms. These work well when network RTT is under 10ms. If you are seeing frequent leader changes on healthy hardware, consider increasing the election timeout to 2000-5000ms to give the system more tolerance for transient latency spikes.
Storage Quota
etcd's default storage quota is 2GB. This is dangerously low for production clusters. A moderately active cluster can hit 2GB within months. Set quota-backend-bytes: 8589934592 (8GB) as a baseline. For very large clusters, consider 16GB. When the quota is hit, etcd enters read-only mode and raises the space alarm β exactly the opening incident.
Compaction and Defragmentation
Enable automatic compaction with auto-compaction-mode: periodic and auto-compaction-retention: "1h". This compacts all revisions older than 1 hour automatically. After compaction, the on-disk file still holds the space as free pages β defragmentation is required to actually reclaim it. Run defragmentation on a schedule (monthly is usually sufficient), one member at a time, starting with followers to avoid interrupting the leader.
# Check current revision and db size
etcdctl endpoint status --write-out=json | python3 -m json.tool
# Compact history up to current revision (frees space in MVCC store)
REV=$(etcdctl endpoint status --write-out=json | python3 -c "import sys,json; print(json.load(sys.stdin)[0]['Status']['header']['revision'])")
etcdctl compact $REV
# Defragment all members (run one at a time β do NOT use --cluster in production)
etcdctl defrag --endpoints=https://10.0.0.10:2379
etcdctl defrag --endpoints=https://10.0.0.11:2379
etcdctl defrag --endpoints=https://10.0.0.12:2379
# Check and clear alarms
etcdctl alarm list
etcdctl alarm disarm
# Take a snapshot backup
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
# Verify snapshot integrity
etcdctl snapshot status /backup/etcd-snapshot-latest.db --write-out=tableβ‘ Production Tip
When running etcdctl defrag, defragment one member at a time. Use --endpoints to target a specific member. Defragmentation causes a brief pause in service from that member while it rewrites its data file. If you defrag all members simultaneously with --cluster, you can temporarily lose quorum on a 3-node cluster. Always defrag the followers first, then the leader last.
Snapshot Count
The snapshot-count parameter controls how many applied entries trigger an etcd snapshot. A snapshot allows the WAL (write-ahead log) to be compacted. The default is 100,000. Lowering it reduces memory usage but increases disk writes. The default is appropriate for most clusters.
Backup and Restore: The Runbook You Need Before You Need It
etcd Snapshot Backup
etcd's built-in snapshot mechanism creates a consistent point-in-time copy of the entire database. A snapshot includes all committed entries up to the revision at the time of the snapshot. It is safe to take a snapshot from any member, but taking it from the leader ensures the most recent state.
The most important rule about backups: test them. A backup you have never restored is not a backup, it is a hope. Run a quarterly restore drill into a throwaway cluster.
BACKUP FLOW
βββββββββββ
etcd Leader
β
β etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
β
βΌ
Snapshot file (consistent point-in-time, includes all MVCC revisions)
β
β copy to S3 / GCS / NFS
βΌ
Offsite storage (retention policy: 7 daily, 4 weekly)
RESTORE FLOW (Disaster Recovery)
βββββββββββββββββββββββββββββββββ
1. Stop all kube-apiserver instances
2. Stop all etcd members
3. Remove corrupted data directories
4. etcdctl snapshot restore /backup/etcd-latest.db \
--name etcd-0 \
--initial-cluster etcd-0=https://10.0.0.10:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-advertise-peer-urls https://10.0.0.10:2380 \
--data-dir /var/lib/etcd
5. Repeat step 4 for each member with its own --name and peer URL
6. Start etcd members (they form cluster from restored snapshot)
7. Start kube-apiserver
8. Verify: kubectl get nodes
Automated Backup CronJob
# etcd-backup-cronjob.yaml
# Runs inside the control plane node or a pod with access to etcd certs.
# Adjust the schedule, S3 bucket, and cert paths for your environment.
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
labels:
app: etcd-backup
spec:
schedule: "0 */6 * * *" # Every 6 hours
concurrencyPolicy: Forbid # Never run two backups simultaneously
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
hostNetwork: true # Required to reach etcd at 127.0.0.1:2379
restartPolicy: OnFailure
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
nodeSelector:
node-role.kubernetes.io/control-plane: ""
containers:
- name: etcd-backup
image: bitnami/etcd:3.5.12
env:
- name: ETCDCTL_API
value: "3"
- name: ETCDCTL_ENDPOINTS
value: "https://127.0.0.1:2379"
- name: ETCDCTL_CACERT
value: /etc/kubernetes/pki/etcd/ca.crt
- name: ETCDCTL_CERT
value: /etc/kubernetes/pki/etcd/server.crt
- name: ETCDCTL_KEY
value: /etc/kubernetes/pki/etcd/server.key
- name: S3_BUCKET
value: "my-etcd-backups"
- name: AWS_DEFAULT_REGION
value: "us-east-1"
command:
- /bin/sh
- -c
- |
set -e
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="/tmp/etcd-${TIMESTAMP}.db"
etcdctl snapshot save "${BACKUP_FILE}"
etcdctl snapshot status "${BACKUP_FILE}" --write-out=table
aws s3 cp "${BACKUP_FILE}" "s3://${S3_BUCKET}/etcd/${TIMESTAMP}.db"
echo "Backup complete: ${TIMESTAMP}.db"
# Delete backups older than 30 days from S3
aws s3 ls "s3://${S3_BUCKET}/etcd/" | \
awk '{print $4}' | \
while read f; do
d=$(echo "$f" | sed 's/\.db//' | sed 's/-/\/g' | awk -F/ '{print $1"-"$2"-"$3}')
if [ "$(date -d "$d" +%s 2>/dev/null)" -lt "$(date -d '30 days ago' +%s)" ]; then
aws s3 rm "s3://${S3_BUCKET}/etcd/$f"
fi
done
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryBackup Frequency Recommendations
- Development clusters: Daily backup, 7-day retention.
- Production clusters with low change rate: Every 6 hours, 30-day retention.
- Production clusters with high change rate: Every 1-2 hours, 30-day retention, weekly offsite copy.
- Before any major operation (cluster upgrade, large-scale deployment, CRD changes): manual snapshot immediately before the change.
Velero Integration
Velero is not an etcd backup tool β it backs up Kubernetes resource manifests by calling the Kubernetes API, not by reading etcd directly. However, Velero complements etcd snapshots: etcd snapshots give you binary-level recovery for disaster scenarios, while Velero gives you selective resource restoration (restore a single namespace, restore without cluster-scoped objects, etc.). Use both.
DR Runbook: Step-by-Step Restore Procedure
The following procedure restores etcd from a snapshot in a kubeadm-managed cluster:
- Identify the most recent valid snapshot. Run
etcdctl snapshot status <file>to verify it is not corrupted. - Stop kube-apiserver on all control plane nodes:
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/(as a static pod, this stops it without systemd). - Stop etcd on all members:
mv /etc/kubernetes/manifests/etcd.yaml /tmp/ - Back up the existing data directory:
mv /var/lib/etcd /var/lib/etcd.bak - Restore the snapshot on each member with unique peer URLs and
--nameflags. All members must use the same snapshot and the same--initial-cluster-token. - Restore kube-apiserver manifests:
mv /tmp/etcd.yaml /etc/kubernetes/manifests/ - Restore kube-apiserver:
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/ - Verify etcd health:
etcdctl endpoint health - Verify Kubernetes:
kubectl get nodes,kubectl get pods -A - Verify controller reconciliation by checking that all Deployments report their expected replica counts.
β οΈ Common Mistake
Do not restore to a different revision than the one used for all other cluster members. If you restore node A from a snapshot at revision 1,000,000 and node B from a snapshot at revision 1,050,000, they will not form a consistent cluster. All members must be restored from the same snapshot. After restore, the cluster will reconcile to the snapshot state as controllers rerun their reconciliation loops.
Monitoring etcd in Production
Key Metrics
The following Prometheus metrics are the minimum you should alert on:
etcd_server_has_leaderβ Binary gauge. Should always be 1. Alert immediately if 0.etcd_server_leader_changes_seen_totalβ Cumulative counter. More than 3 changes in 1 hour indicates instability. Often caused by slow disks or network issues.etcd_server_proposals_failed_totalβ Failed write proposals. Any increase warrants investigation.etcd_disk_wal_fsync_duration_secondsβ 99th percentile should be under 10ms. Above 25ms indicates disk pressure. Above 100ms causes election timeouts.etcd_disk_backend_commit_duration_secondsβ BoltDB backend commit latency. Should be under 25ms p99.etcd_mvcc_db_total_size_in_bytesβ Current database size. Alert at 6GB if quota is 8GB (75% usage).etcd_mvcc_db_total_size_in_use_in_bytesβ Actual in-use size (after compaction, before defrag). Large gap between this anddb_total_sizemeans defragmentation is needed.etcd_network_peer_round_trip_time_secondsβ RTT between members. Should stay under 10ms in a healthy cluster.etcd_server_slow_apply_totalβ Count of slow applies. Nonzero values indicate backend performance problems.
# etcd-servicemonitor.yaml
# Requires kube-prometheus-stack or Prometheus Operator.
# etcd metrics are exposed on port 2381 (metrics port) by default.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
jobLabel: k8s-app
endpoints:
- port: metrics # etcd --listen-metrics-urls=http://0.0.0.0:2381
interval: 30s
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/server.crt
keyFile: /etc/prometheus/secrets/etcd-certs/server.key
insecureSkipVerify: false
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
component: etcd
---
# Key Prometheus alert rules for etcd
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: etcd-alerts
namespace: monitoring
spec:
groups:
- name: etcd
rules:
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 1m
labels:
severity: critical
annotations:
summary: "etcd member has no leader"
description: "etcd member {{ $labels.instance }} has no leader for > 1 minute"
- alert: EtcdHighNumberOfLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
for: 0m
labels:
severity: warning
annotations:
summary: "Frequent etcd leader changes"
description: "{{ $value }} leader changes in the last hour on {{ $labels.instance }}"
- alert: EtcdDatabaseSizeHigh
expr: etcd_mvcc_db_total_size_in_bytes / 1024 / 1024 / 1024 > 6
for: 5m
labels:
severity: warning
annotations:
summary: "etcd database approaching quota"
description: "etcd db size is {{ $value | humanize }}GB. Default quota is 8GB."
- alert: EtcdDiskFsyncSlow
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "etcd WAL fsync latency high"
description: "99th percentile WAL fsync latency is {{ $value | humanizeDuration }} on {{ $labels.instance }}"π Troubleshooting Tip
If you see frequent leader changes but no network issues, check etcd_disk_wal_fsync_duration_seconds first. Slow disk I/O is the most common cause of spurious elections in production clusters. Run fio --name=etcd-test --ioengine=sync --fdatasync=1 --directory=/var/lib/etcd --size=22m --bs=2300 --readwrite=write to benchmark your disk. etcd publishes this exact fio command in its documentation as the recommended disk benchmark.
etcd in Managed Kubernetes: EKS, GKE, AKS
In managed Kubernetes services, the control plane β including etcd β is operated by the cloud provider. You do not have access to the etcd nodes, cannot run etcdctl against them, and cannot take or restore etcd snapshots directly.
Self-Managed vs. Managed Kubernetes etcd
| Aspect | Self-Managed (kubeadm) | Managed (EKS/GKE/AKS) |
|---|---|---|
| etcd access | Full access via etcdctl | No direct access |
| Snapshot backup | Your responsibility (etcdctl) | Provider managed; use Velero for resource backup |
| Disk sizing | Your responsibility | Provider managed |
| Compaction/defrag | Your responsibility | Provider managed |
| Monitoring | Your responsibility (Prometheus) | Provider cloud metrics only |
| HA configuration | You choose 3 or 5 nodes | Provider managed (usually 3) |
| Disaster recovery | Full etcd restore runbook | Velero restore or cluster recreate |
| Interview relevance | Deep operational knowledge required | Architectural understanding still required |
Even if you use EKS or GKE exclusively, you must understand etcd architecture for senior interviews. Interviewers at companies running managed Kubernetes still ask about etcd because it reveals whether you understand how Kubernetes actually works. The question βwhat would happen if etcd went down in your EKS cluster?β is a common senior DevOps interview question. The answer: the API server would stop serving write requests. Existing pods continue running (kubelet operates independently), but no new pods can be scheduled, no config changes can be applied, and no healing will occur.
Three Production Incident Stories
Incident 1: etcd Disk Full β Cluster Freeze at Midnight
π¨ Real Incident
The cluster described in the introduction had been running for 14 months with no maintenance performed on etcd. Auto-compaction was not enabled. The etcd database had grown from its initial 500MB to 7.8GB against a default 8GB quota. When the quota alarm fired, etcd switched to read-only mode. The kube-apiserver could still serve GET requests from its in-memory watch cache, which is why kubectl get pods still worked. But any operation that required writing to etcd β kubectl apply, kubectl create, pod scheduling, deployment scaling β hung indefinitely at the API server as it waited for etcd to accept the write. Recovery steps: (1) SSH to a control plane node and set ETCDCTL env vars. (2) Run etcdctl alarm list to confirm the NOSPACE alarm. (3) Run etcdctl compact to the current revision. (4) Run etcdctl defrag on each member. (5) Run etcdctl alarm disarm. (6) Increase quota to 8GB in etcd config and restart etcd. Total downtime for write operations: 2 hours.
Incident 2: Leader Election Storm from Slow Disks
π¨ Real Incident
A production cluster was migrated from dedicated bare metal etcd nodes to virtual machines with network-attached storage. Within hours, the on-call engineer was receiving alerts: etcd_server_leader_changes_seen_total had incremented 47 times in one hour. The cluster was technically functional, but API server latency had spiked to 800ms p99 as write requests queued behind rapidly changing leaders. Root cause: the NAS storage was provisioned on a shared storage array with no IOPS reservation. Under concurrent load from other VMs on the same array, etcd WAL fsync latency spiked to 250-400ms β well above the 1000ms election timeout. Each time the leader's disk stalled, a follower started an election. The fix required migrating to dedicated SSD volumes with 6,000 provisioned IOPS. Lesson: never share etcd storage with other workloads. Disk I/O contention is invisible in monitoring until it causes election storms.
Incident 3: Restore Gone Wrong β Wrong Revision, Duplicate Resources
π¨ Real Incident
After a botched upgrade that corrupted etcd's BoltDB backend, the team restored from a snapshot. The snapshot was taken 18 hours before the corruption. The restore procedure was technically correct, but one critical mistake was made: only two of the three etcd members were restored from the snapshot. The third member was restored from a more recent local backup that a junior engineer found on the node. The three members could not agree on cluster state β two members were at revision 1,200,000 and one member was at revision 1,250,000. etcd itself would not start cleanly. After 45 minutes of debugging, the team realized the mismatch and reran the restore with the same snapshot on all three nodes. The resulting cluster was 18 hours behind, which required manually re-applying 200+ ConfigMap and Secret changes from git history. Lesson: always use the same snapshot file for all members during a restore. Document the restore procedure, test it quarterly, and never improvise under pressure.
Troubleshooting etcd
API Server Log Errors That Point to etcd
When etcd has problems, the API server logs are often where you first see the symptoms:
context deadline exceededβ The API server timed out waiting for etcd to respond. Usually caused by overloaded etcd or disk I/O spikes.etcdserver: request timed outβ etcd itself is reporting a timeout on a proposal. Check disk latency and leader stability.etcdserver: mvcc: required revision has been compactedβ A watcher was too slow and the revision it was watching has been compacted away. The API server will re-list and reestablish the watch automatically, but this causes a spike in etcd load.etcdserver: mvcc: database space exceededβ The quota alarm has fired. Run compaction, defrag, and disarm immediately.dial tcp 127.0.0.1:2379: connect: connection refusedβ etcd is not running or the port is blocked. Check the etcd process and firewall rules.
Common Diagnostic Commands
# Check if etcd has a leader
etcdctl endpoint status --write-out=table
# Check for alarms (NOSPACE is the most common)
etcdctl alarm list
# Check current db size vs quota
etcdctl endpoint status --write-out=json | \
python3 -c "import sys,json; d=json.load(sys.stdin); [print(m['Endpoint'], 'db_size:', m['Status']['dbSize'], 'bytes') for m in d]"
# Check if Raft is making progress (raftIndex should be incrementing)
watch -n 1 'etcdctl endpoint status --write-out=table'
# Check etcd logs for slow fsync messages
journalctl -u etcd -f | grep -E "slow|failed|timeout|exceed"
# Kubernetes API server error logs related to etcd
kubectl logs -n kube-system -l component=kube-apiserver | grep -E "etcd|context deadline"
# Check etcd member list and verify all are healthy
etcdctl member list --write-out=tableπ Troubleshooting Tip
If kubectl apply hangs but kubectl get works, your first check should always be etcdctl alarm list. The split behavior β reads work, writes hang β is the classic fingerprint of an etcd NOSPACE alarm. This narrows your diagnosis from βsomething is wrong somewhereβ to a 3-step fix in under 5 minutes.
15 Common etcd Mistakes
- Using the default 2GB quota in production. Set it to 8GB minimum. The default has caused countless midnight incidents.
- Not enabling auto-compaction. Without it, etcd grows indefinitely until it hits quota.
- Defragmenting all members simultaneously. Defrag one member at a time to avoid losing quorum.
- Running etcd on HDD. HDD fsync latency causes election storms. Use SSD with dedicated IOPS.
- Sharing etcd disk with OS or container runtime. I/O contention from other processes causes etcd latency spikes.
- Running etcd with even node counts. Even numbers provide no additional fault tolerance over N-1 odd number.
- Not testing restore procedures. A backup you have never restored is not a backup.
- Using different snapshot files for different members during restore. All members must restore from the same snapshot.
- Storing large objects in etcd. Kubernetes has a 1.5MB limit per object, but storing frequently-updated large ConfigMaps or Secrets causes excessive etcd churn. Use external stores for large blobs.
- Not monitoring
etcd_server_leader_changes_seen_total. Frequent leader changes are an early warning of instability, always before a full outage. - Running etcd members across regions with high latency. RTT over 10ms causes constant election instability with default timeout settings.
- Forgetting TLS between etcd members and clients. etcd peer traffic and client traffic should always be TLS-encrypted, especially in cloud environments.
- Compacting too aggressively. Compacting to the very latest revision can cause βrequired revision has been compactedβ errors in API server watches. Compact to a revision that is at least a few minutes old.
- Not allocating a dedicated network interface for etcd peer traffic. In high-throughput clusters, etcd peer replication traffic can saturate a shared NIC.
- Assuming managed Kubernetes means no etcd knowledge needed. You will be asked about etcd in senior SRE and platform engineer interviews regardless of whether you use managed Kubernetes.
Interview Q&A
Beginner Questions
Q: What is etcd and what does it do in Kubernetes?
etcd is a distributed key-value store that serves as the sole backing data store for Kubernetes. Every Kubernetes object β pods, services, namespaces, secrets, deployments β is stored in etcd. The Kubernetes API server is the only client of etcd. etcd provides strong consistency via the Raft consensus algorithm, ensuring that all API server instances see the same data.
Q: What happens to running pods if etcd goes down?
Running pods continue to run. The kubelet operates independently on each node β it maintains its own local state and does not need etcd to keep containers running. However, no new pods can be scheduled, no deployments can be updated, no healing will occur (if a pod crashes, the controller cannot create a replacement), and kubectl commands that require writes will fail or hang. The cluster is alive but frozen.
Q: Why does Kubernetes use etcd instead of a relational database?
etcd provides linearizable reads and writes across a distributed cluster with built-in leader election and consensus. A relational database would require additional coordination logic to handle leader election and distributed consensus. etcd's watch API also provides efficient push-based change notification, which is fundamental to how Kubernetes controllers work. A SQL database would require polling.
Q: What is the minimum number of etcd nodes for a production cluster?
Three nodes is the minimum for production. A single-node etcd has no fault tolerance. A two-node cluster cannot achieve quorum (requires 2 of 2) when one node fails, providing no benefit over a single node. Three nodes can tolerate one failure while maintaining quorum with the remaining two.
Q: How do you check if etcd is healthy?
Run etcdctl endpoint health to check if each member is responding. Run etcdctl endpoint status --write-out=table to see which member is the leader, the current database size, and the Raft index. Run etcdctl alarm list to check for any active alarms such as NOSPACE.
Intermediate Questions
Q: Explain the Raft consensus algorithm at a high level.
Raft divides time into terms. In each term, a leader is elected through a voting process. Any node that does not hear from a leader within its election timeout starts a new term and requests votes from other nodes. The first node to receive votes from a majority (quorum) becomes leader. The leader handles all writes by appending entries to its log and replicating them to followers. A write is committed only after a majority of nodes acknowledge it. This guarantees that committed entries survive the leader crashing, because they already exist on a majority of nodes.
Q: What is MVCC in etcd and why is compaction needed?
MVCC (Multi-Version Concurrency Control) means etcd keeps every historical version of every key. Each write increments a global revision counter. Old revisions are retained so that watch clients can replay changes they missed. Without compaction, the database grows indefinitely as every create, update, and delete adds a new revision. Compaction deletes all revisions older than a specified point, freeing space. Defragmentation then reclaims the freed space on disk.
Q: What is the etcd watch mechanism and how does Kubernetes use it?
etcd watches are long-lived streaming connections. A client specifies a key prefix and etcd pushes an event every time a key in that prefix is created, modified, or deleted. Kubernetes uses this extensively: the scheduler watches for unscheduled pods, controllers watch for their respective resources, kubelet watches for pods assigned to its node, and kube-proxy watches EndpointSlices. The API server maintains watch connections to etcd and multiplexes events to all controller clients.
Q: How would you recover from an etcd space alarm?
First, confirm the alarm with etcdctl alarm list. Then compact the database: get the current revision from etcdctl endpoint status and run etcdctl compact <revision>. Next, defragment each member one at a time with etcdctl defrag. Then disarm the alarm with etcdctl alarm disarm. Finally, verify recovery by checking that writes to the API server succeed and by monitoring db_size. Increase the quota in the etcd configuration to prevent recurrence.
Q: What causes frequent leader elections in etcd?
The most common cause is slow disk I/O. etcd writes every log entry to disk (WAL fsync) before acknowledging it. If fsync takes longer than the election timeout, followers conclude the leader is dead and start elections. This happens most often when etcd is on HDD, network-attached storage with IOPS contention, or a disk that is being I/O starved by other workloads. Network issues can also cause it, but disk I/O is the more common culprit in cloud environments. Check etcd_disk_wal_fsync_duration_seconds histogram and compare it against the election timeout.
Advanced Questions
Q: Describe what happens during a network partition in a 3-node etcd cluster.
Assume nodes A, B, C where A is leader. If C is partitioned away from A and B, A and B retain quorum (2 of 3). They continue operating normally. C cannot reach A or B, so it cannot hear heartbeats. C starts an election, but it can only vote for itself. Since C cannot reach quorum (needs 2 votes, only has 1), C's election never succeeds. C loops in election state indefinitely. When the partition heals, C receives a heartbeat from A with A's current term. If A's term is higher than C's (which it will be if C was repeatedly incrementing its term during election attempts), C steps down to follower and accepts A's log. A then replicates any missing entries to C. This self-healing behavior is built into Raft.
Q: How does etcd ensure linearizable reads?
By default, etcd read requests are also routed through the Raft consensus mechanism (linearizable reads). The leader must verify it is still the leader before serving a read β this is done by sending a round of heartbeats to verify quorum before responding. This prevents a partitioned old leader from serving stale reads. The trade-off is latency. etcd also supports serializable reads (reading from any member without consensus verification), which are faster but may return stale data. Kubernetes uses linearizable reads for correctness.
Q: What are the implications of increasing etcd's snapshot-count parameter?
The snapshot-count parameter determines how many applied Raft log entries accumulate before etcd takes a snapshot and truncates the WAL. A higher value means less frequent snapshotting, which reduces disk write amplification but increases memory usage (since more entries are held in memory) and increases the time required to replay the WAL on restart after a crash. A lower value means more frequent snapshotting and faster recovery, but more disk I/O. For most clusters, the default of 100,000 is appropriate. On very write-heavy clusters, lowering it to 50,000 can reduce restart recovery time.
Q: A Kubernetes controller is processing events very slowly, and you notice it keeps receiving the same watch events multiple times. What is the likely etcd cause?
The controller's watch connection is falling behind the revision stream. When a watcher's last-seen revision is compacted away, etcd closes the watch and the API server re-establishes it with a full relist. This floods the controller with a full dump of all objects in its watch scope, which it must process again as if they were all new. The fix is to ensure compaction is not too aggressive β compact to a revision that is at least 5 minutes old β and to ensure the controller's watch reconnect logic uses the ResourceVersion returned from the previous list to avoid unnecessary full relists.
Q: How would you design an etcd backup strategy for a 5-node HA cluster across 3 AZs?
Take snapshots from a follower in each AZ every 2 hours using a CronJob running on a dedicated control plane node in each AZ (to avoid putting load on the leader). Store snapshots in an S3 bucket with cross-region replication enabled. Retain 7 days of hourly snapshots, 4 weeks of daily snapshots, and 12 months of monthly snapshots. Test restore procedures quarterly in a throwaway cluster. Use Velero for application-level backup (namespace scoped) as a complement to etcd snapshots. Integrate snapshot success/failure metrics into Prometheus and alert if no successful snapshot has completed in 3 hours. Store the restore runbook in a location accessible without cluster access (e.g., a wiki or a runbook repository) because when you need to restore etcd, your cluster may not be available.
20 etcd Best Practices
- Always use an odd number of etcd members (1, 3, or 5). Never 2, 4, or 6.
- Set
quota-backend-bytesto at least 8GB in production from day one. - Enable
auto-compaction-mode: periodicwithauto-compaction-retention: "1h". - Schedule monthly defragmentation during maintenance windows, one member at a time.
- Use dedicated SSD volumes for etcd data with provisioned IOPS (3,000 minimum, 6,000 recommended).
- Never share the etcd data volume with other workloads or the OS root volume.
- Spread etcd members across at least 3 availability zones.
- Keep RTT between etcd members under 10ms.
- Enable mTLS for both client-to-server and peer-to-peer etcd communication.
- Take etcd snapshots at least every 6 hours; before any major change, take a manual snapshot.
- Store etcd snapshots in at least two separate geographic locations.
- Run quarterly restore drills. Never treat an untested backup as valid.
- Alert on
etcd_server_has_leader == 0with a 1-minute window β this is a critical alert. - Alert on
etcd_mvcc_db_total_size_in_bytesexceeding 75% of quota. - Alert on WAL fsync p99 exceeding 10ms β this is an early warning before election storms start.
- Alert on more than 3 leader changes per hour.
- Enable
etcd_server_slow_apply_totalmonitoring β nonzero values indicate backend issues. - Use Prometheus + Grafana with the standard etcd dashboard (dashboard ID 3070 on grafana.com).
- Document a restore runbook and store it outside the cluster being backed up.
- For managed Kubernetes (EKS/GKE/AKS), use Velero for application-level backup and test Velero restores quarterly.
kubeadm HA etcd Configuration
# kubeadm-etcd-ha.yaml
# External etcd cluster configuration for kubeadm HA setup.
# Run 'kubeadm init' with this config on your first control plane node.
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.29.0
controlPlaneEndpoint: "k8s-api.internal.example.com:6443" # Load balancer VIP
etcd:
external:
endpoints:
- https://10.0.0.10:2379
- https://10.0.0.11:2379
- https://10.0.0.12:2379
caFile: /etc/etcd/pki/ca.crt
certFile: /etc/etcd/pki/apiserver-etcd-client.crt
keyFile: /etc/etcd/pki/apiserver-etcd-client.key
networking:
podSubnet: "10.244.0.0/16"
serviceSubnet: "10.96.0.0/12"
---
# etcd member configuration (run on each etcd node)
# /etc/etcd/etcd.conf.yaml
name: etcd-0
data-dir: /var/lib/etcd
listen-client-urls: https://0.0.0.0:2379
advertise-client-urls: https://10.0.0.10:2379
listen-peer-urls: https://0.0.0.0:2380
initial-advertise-peer-urls: https://10.0.0.10:2380
initial-cluster: >
etcd-0=https://10.0.0.10:2380,
etcd-1=https://10.0.0.11:2380,
etcd-2=https://10.0.0.12:2380
initial-cluster-token: etcd-cluster-prod-1
initial-cluster-state: new
# Performance tuning
heartbeat-interval: 100 # ms β increase if network latency > 10ms
election-timeout: 1000 # ms β must be 5-10x heartbeat-interval
snapshot-count: 10000
quota-backend-bytes: 8589934592 # 8GB
auto-compaction-mode: periodic
auto-compaction-retention: "1h"
# TLS
cert-file: /etc/etcd/pki/server.crt
key-file: /etc/etcd/pki/server.key
trusted-ca-file: /etc/etcd/pki/ca.crt
peer-cert-file: /etc/etcd/pki/peer.crt
peer-key-file: /etc/etcd/pki/peer.key
peer-trusted-ca-file: /etc/etcd/pki/ca.crt
peer-client-cert-auth: true
client-cert-auth: trueFrequently Asked Questions
Can I use etcd for application data storage?
Technically yes, but you should not. etcd is optimized for small, infrequently-changing configuration data. It is not designed for high-throughput transactional workloads. For application state, use a purpose-built database. etcd should only store Kubernetes control plane state.
How large can a single value be in etcd?
etcd has a default value size limit of 1.5MB per key. Kubernetes enforces this for API objects. Storing large objects (large ConfigMaps, CRD instances with large embedded data) fragments etcd storage and causes performance issues. If you need to store large configuration data, store a reference to an external store (S3, Vault) in the ConfigMap rather than the data itself.
What is the difference between etcdctl v2 and v3?
etcd v2 and v3 use different data models and APIs. v2 had a hierarchical directory-like data model. v3 has a flat key-value model with a prefix-based range query API. Kubernetes switched to etcdctl API v3 in Kubernetes 1.6. Always set ETCDCTL_API=3 when working with Kubernetes etcd. Running v2 commands against a v3 cluster will either fail or operate on a completely separate legacy data space.
Can I run etcd as a pod inside Kubernetes?
etcd in kubeadm-managed clusters runs as a static pod β a pod managed directly by the kubelet, not by the API server. The manifests live in /etc/kubernetes/manifests/etcd.yaml. This is intentional: if etcd ran as a regular pod, a failing etcd would prevent the pod from being rescheduled, creating a circular dependency. Static pods avoid this by running independently of the API server.
What is the etcd WAL?
WAL stands for Write-Ahead Log. Before any entry is applied to the etcd key-value store, it is first written to the WAL and fsynced to disk. This ensures that if etcd crashes mid-write, it can replay the WAL on restart and recover all committed entries. The WAL is stored in /var/lib/etcd/member/wal/. WAL fsync latency is one of the most important performance metrics for etcd.
What happens if two etcd members see different leaders at the same time?
Raft's term mechanism prevents this from causing inconsistency. If an old leader (say, term 5) is partitioned and a new leader (term 6) is elected, any writes to the old leader are not committed (quorum cannot be reached without the partitioned nodes). When the partition heals, the old leader receives a message with term 6, recognizes it is behind, steps down to follower, and accepts the new leader's log. Raft guarantees that only one leader can commit writes in any given term.
How do I add a new etcd member to an existing cluster?
Use etcdctl member add <name> --peer-urls=https://<new-peer-ip>:2380 to register the new member. Then start the new etcd process with --initial-cluster-state=existing pointing to the registered cluster. The new member will receive a full snapshot from the leader and catch up. Never start a new member with --initial-cluster-state=new against an existing cluster β it will create a conflicting cluster.
What is the etcd v3 lease mechanism?
A lease is a time-to-live (TTL) associated with a set of keys. When the lease expires, all keys attached to it are automatically deleted. Kubernetes uses leases for node heartbeats: each node holds a Lease object in the kube-node-lease namespace. If the node fails to renew its lease, the node controller marks it NotReady. This is more efficient than updating a large Node object for every heartbeat.
Can I encrypt data at rest in etcd?
Yes. Kubernetes supports EncryptionConfiguration for the API server, which encrypts specified resource types (most commonly Secrets) before writing them to etcd. The encryption happens in the API server β etcd itself does not know the data is encrypted. This protects against an attacker who gains direct access to the etcd data directory. Configure it via --encryption-provider-config on kube-apiserver.
How does etcd handle a slow follower?
The leader maintains a replication queue for each follower. If a follower falls behind, the leader buffers log entries in memory up to a configurable limit. If the follower falls so far behind that the leader has already snapshotted the needed entries, the leader sends a full snapshot to the slow follower to catch it up. In Kubernetes, this can happen during maintenance windows when a control plane node is temporarily offline.
What is the default etcd port and what runs on each port?
etcd uses three ports: 2379 for client-to-server communication (API server and etcdctl connect here), 2380 for server-to-server peer communication (Raft replication between members), and 2381 for metrics HTTP endpoint (Prometheus scrapes this). Make sure only the API server has network access to port 2379, and only etcd peers have access to port 2380.
How do I tell which etcd member is the leader?
Run etcdctl endpoint status --write-out=table. The output includes an IS LEADER column. Only one member should show true. If no member shows true, the cluster has lost quorum.
What is the difference between etcd compaction and defragmentation?
Compaction removes old revisions from etcd's logical data model, marking those pages as free in the BoltDB database file. However, the database file on disk does not shrink after compaction β the space is just marked as available for reuse. Defragmentation rewrites the BoltDB file, removing the free pages and actually reducing the on-disk file size. You need both: compaction to reduce logical data, defragmentation to reduce physical disk usage.
Can etcd run without TLS in production?
Technically yes, but you absolutely should not. etcd stores every Kubernetes Secret, RBAC policy, and service account token. Without TLS, any process on the network that can reach port 2379 can read all cluster secrets. In cloud environments, configure security groups to restrict etcd port access and always use TLS with mutual authentication for both client and peer connections.
What is the relationship between etcd and the Kubernetes API server cache?
The API server maintains an in-memory watch cache of all objects it has read from etcd. When clients do kubectl get without specifying --watch, the API server typically serves the response from this cache rather than querying etcd directly. This is why reads still work when etcd is in a degraded state but not completely down. The cache is populated and kept current by the API server's own watch connection to etcd.
Key Takeaways
- etcd is the single source of truth for all Kubernetes state. Every object you have ever created in Kubernetes lives in etcd as a key-value pair under
/registry/. - The API server is the only component that communicates with etcd. Every other Kubernetes component goes through the API server.
- Raft consensus requires a majority (quorum) to commit a write. In a 3-node cluster, 2 nodes must agree. This is what prevents split-brain.
- MVCC means etcd retains all historical revisions. Without periodic compaction, the database will grow until it hits quota and locks the cluster into read-only mode.
- SSD with dedicated IOPS is not optional for production etcd. HDD causes WAL fsync latency that triggers election storms.
- Always use odd node counts (3 or 5). Even numbers offer no additional fault tolerance.
- Set
quota-backend-bytesto 8GB and enable auto-compaction from day one. The 2GB default is a time bomb. - Backup etcd on a schedule, test restore procedures quarterly, and store the restore runbook outside the cluster.
- Monitor WAL fsync latency, leader changes, and db_size. These three metrics will catch 90% of etcd problems before they become outages.
- Even if you use managed Kubernetes (EKS, GKE, AKS), understanding etcd is required for senior platform engineering and SRE roles.
Preparing for a Kubernetes Interview?
AiResumeFit optimizes your DevOps resume for Kubernetes and platform engineering roles in seconds. Highlight your etcd, Raft, and production operations experience the way interviewers want to see it.
Optimize My Resume β