SREBy Ravi Kapoor · 70+ Questions · Beginner to Expert

SRE Interview Questions & Answers (2026)

70+ Site Reliability Engineering questions with expert answers. SLOs, SLIs, error budgets, toil reduction, blameless post-mortems, chaos engineering, and observability design.

Beginner

Q: What is Site Reliability Engineering (SRE) and how is it different from DevOps?

**Site Reliability Engineering (SRE)** is a discipline created at Google that applies software engineering practices to operations problems. SREs treat operations as a software problem: they automate repetitive work, build observability systems, define reliability targets mathematically, and use engineering rigor to improve system reliability. **Core SRE principles (from the Google SRE book):** - Define reliability targets quantitatively (SLOs) - Eliminate toil (manual, repetitive operational work) through automation - Balance reliability and velocity using error budgets - Reduce operational complexity through simplicity and standardization - Use blameless post-mortems to learn from failures without assigning blame **SRE vs. DevOps:** DevOps is a cultural philosophy that emphasizes collaboration between Development and Operations teams, shared ownership, and continuous delivery. It is broad and applies to the entire software delivery lifecycle. SRE is a specific implementation of DevOps principles with a strong emphasis on reliability engineering. Google describes SRE as "what happens when you ask a software engineer to design an operations function." SRE prescribes specific practices (error budgets, SLOs, toil budgets) that DevOps does not. Key differences: - SRE defines reliability mathematically; DevOps focuses on collaboration and culture. - SRE teams typically spend 50% on reliability work and 50% on feature development to stay close to the software. - DevOps is a mindset; SRE is a job function with defined responsibilities. At many companies, the terms are used interchangeably. The distinction matters most in organizations where dedicated SRE teams exist alongside product engineering teams.

Q: What are SLI, SLO, and SLA? How do they relate to each other?

These three terms form the reliability measurement framework that distinguishes SRE from traditional operations. **SLI (Service Level Indicator):** A quantitative measure of some aspect of the service's behavior. The raw metric. Examples: - "The fraction of HTTP requests that complete in under 200ms" - "The fraction of API calls that return a non-5xx response" - "The fraction of login attempts that succeed" SLIs should measure what users actually experience — not internal metrics like CPU usage. **SLO (Service Level Objective):** A target value or range for an SLI. The reliability goal the team commits to internally. Examples: - "99.5% of requests complete in under 200ms, measured over a 28-day rolling window" - "99.9% of API calls return non-5xx responses per calendar month" SLOs are internal targets — they are more aggressive than SLAs because there must be a buffer between "we failed our SLO" and "we breached our SLA." **SLA (Service Level Agreement):** A contractual agreement with customers that defines the reliability commitment and consequences for failure (credits, refunds). SLAs should always be more conservative than SLOs to give you a safety margin. **Error Budget:** The inverse of the SLO — the amount of unreliability permitted within the SLO window. - SLO: 99.9% availability → Error budget = 0.1% of time = ~43.8 minutes/month - SLO: 99.5% availability → Error budget = 0.5% of time = ~219 minutes/month The error budget is shared between reliability incidents and risk from new feature deployments. If the budget is depleted, new deployments are paused until it is replenished.

Intermediate

Q: What is toil in SRE and how do you eliminate it?

**Toil** is operational work that is manual, repetitive, automatable, tactical (reactive), and devoid of lasting value. Google's SRE model holds that SREs should spend at most 50% of their time on toil, with the other 50% dedicated to engineering work that reduces future toil and improves reliability. **Characteristics of toil (all must apply):** - **Manual:** Requires a human to execute it. - **Repetitive:** The same task performed over and over. - **Automatable:** Could be replaced by software. - **Tactical / reactive:** Triggered by an event, not part of a strategic improvement. - **No enduring value:** Completing it does not make the system better for the future. - **Scales linearly with service growth:** As traffic doubles, the toil doubles. **Examples of toil:** - Manually rotating SSL certificates every 90 days. - Answering the same "what's the status of X?" Slack messages by looking at dashboards. - Running weekly manual database backups. - Provisioning developer environments step-by-step. - Restarting a crashed service at 2 AM. **How to eliminate toil:** 1. **Measure it first:** Track toil hours per sprint. Google recommends 50% maximum — if you are over, prioritize elimination. 2. **Automate the immediate task:** Write the script, cron job, or operator that handles the repetitive action. 3. **Fix the underlying cause:** If a service crashes every week and needs manual restart, fix the crash — don't just automate the restart. Automating toil without fixing root causes is "toil laundering." 4. **Build self-service tooling:** Runbooks → automated runbooks (Argo Workflows) → self-service UI. 5. **Use managed services:** Replace self-managed infrastructure with managed alternatives (e.g., replace self-managed Elasticsearch with OpenSearch Service). **Impact measurement:** Track hours-of-toil-eliminated per quarter as an SRE team productivity metric.

Q: Walk me through how you would conduct a blameless post-mortem.

A blameless post-mortem is a structured analysis of a significant incident, focused on systemic causes rather than individual mistakes. The goal is organizational learning, not blame assignment. **Blameless culture rationale:** People make mistakes when working in complex systems with incomplete information. Blaming individuals for system failures: 1. Causes people to hide mistakes or cover their tracks. 2. Discourages the risk-taking necessary for fast delivery. 3. Fails to address the systemic conditions that caused the incident. **Post-mortem structure:** **1. Incident Timeline:** Reconstruct the sequence of events with exact timestamps: when the issue started, when it was detected, what actions were taken, when it was resolved. Use logs, monitoring data, and chat history. **2. Impact Statement:** Quantify the impact: how many users affected? For how long? SLO impact? Revenue impact? This establishes the severity and importance of the findings. **3. Root Cause Analysis:** Use the "5 Whys" or fishbone diagram to find the underlying system conditions, not just the proximate cause. The root cause is rarely "human error" — it is the system condition that made that human error possible. **4. Contributing Factors:** All the conditions that contributed to the incident: insufficient monitoring, lack of automated testing, unclear runbooks, missing alerts, knowledge silos, etc. **5. Action Items:** Concrete, specific, assigned, and time-bounded follow-up tasks. Each action item should have an owner and a deadline. Categories: - **Prevent recurrence:** Fix the root cause. - **Detect sooner:** Improve alerting/monitoring to reduce MTTD. - **Recover faster:** Improve runbooks, automate recovery, reduce MTTR. **6. Lessons Learned:** What did the team learn? What should other teams know? These go into a searchable post-mortem database. **What to avoid:** - "Bob pushed a bad deploy" (blame) - "We need to be more careful" (non-actionable) - Incomplete action items without owners or due dates - Not sharing post-mortems broadly (knowledge should flow across teams)

Q: What is chaos engineering and how do you implement it safely?

Chaos engineering is the practice of deliberately injecting failures into production (or staging) systems to discover weaknesses before they manifest as unplanned outages. The famous example is Netflix's Chaos Monkey, which randomly terminates EC2 instances to verify that systems tolerate instance failures. **The principle:** Systems that have never been tested for failure will fail in unpredictable ways when failure eventually occurs. Chaos engineering turns unpredictable failures into known, managed ones. **The Chaos Engineering maturity model:** **Level 1 — Chaos in staging:** Start with non-production environments. Practice failure injection safely. Build team confidence and processes. **Level 2 — Chaos in production with explicit controls:** Inject failures in production during low-traffic periods with explicit blast radius controls. Define a "halt" condition — if X metric breaches Y threshold, stop the experiment immediately. **Level 3 — Continuous chaos:** Automated chaos experiments that run on a schedule in production without manual intervention. Only viable with high system maturity and strong observability. **How to run a chaos experiment:** 1. **Define steady state:** What does healthy look like? (e.g., error rate < 0.1%, p99 latency < 200ms) 2. **Form a hypothesis:** "We believe the system will maintain steady state when we terminate one of three database replicas." 3. **Define blast radius:** Limit the experiment to 5% of traffic or a specific region. 4. **Inject the failure:** Use a chaos engineering tool. 5. **Observe:** Watch your observability stack for deviation from steady state. 6. **Halt if steady state is broken:** Rollback the experiment and fix the issue. 7. **Document findings:** Record what happened and what systemic improvements are needed. **Chaos engineering tools:** - **LitmusChaos (CNCF):** Kubernetes-native chaos experiments (pod kill, node drain, network latency, CPU stress). - **Chaos Monkey (Netflix):** Random EC2 termination. - **Gremlin:** Commercial, SaaS platform for enterprise chaos engineering with a UI and controlled experiments. - **AWS Fault Injection Simulator (FIS):** Native AWS tool for chaos on AWS services. - **Chaos Toolkit:** Open-source framework for defining and running chaos experiments as code.

Advanced

Q: Design an observability strategy for a microservices system.

Observability for microservices is about understanding the internal state of a system from its external outputs. The three pillars are metrics, logs, and traces — but modern observability adds profiling and events. **The Three Pillars:** **Metrics (time-series data):** Numerical measurements over time. Use the RED method: - **Rate:** Requests per second per service. - **Errors:** Error rate per service. - **Duration:** Latency percentiles (p50, p90, p99, p999). And the USE method for infrastructure: - **Utilization:** CPU %, memory %, disk %. - **Saturation:** Queue depth, run queue length. - **Errors:** Hardware errors, network drops. Tool: Prometheus + Grafana. For long-term retention: Thanos or Cortex. For cloud: Datadog, New Relic. **Logs (event records):** Structured logs (JSON format) from every service. Essential for debugging specific requests or failures. - Emit structured logs: `{"level":"error","service":"order-service","trace_id":"abc123","message":"payment failed","user_id":"u-789","latency_ms":453}` - Include trace IDs to correlate logs with traces. - Use log aggregation: Loki (with Grafana), ELK (Elasticsearch/Logstash/Kibana), Datadog Logs. **Distributed Traces (request flows):** Show the path of a request through multiple services — essential for debugging latency and failures in microservices. - Instrument with OpenTelemetry SDK (the CNCF standard that works with any backend). - Store in Jaeger (open source) or Tempo (Grafana stack). - Critical trace data: span durations, error spans, service dependencies. **The practical strategy:** **Instrument at the platform level:** - Deploy OpenTelemetry Collector as a DaemonSet on Kubernetes. - Instrument all services with the OTel SDK (language-specific). - Auto-instrument frameworks (Spring Boot, FastAPI, Express) where possible. **Define SLO-based alerting:** Alert on SLO burn rates, not on raw metrics. Multi-window burn rate alerts (5m and 1h windows) dramatically reduce false positives. **Correlation workflow:** When an alert fires: 1. Grafana dashboard: Is it widespread or isolated to one service? (Metrics) 2. Distributed trace for an affected request: Which service is slow or erroring? (Traces) 3. Logs for that specific service at the time of the error: What exactly went wrong? (Logs) This metrics → traces → logs workflow is the "Grafana stack" (LGTM: Loki, Grafana, Tempo, Mimir) pattern, increasingly the open-source standard.

About the author

Ravi Kapoor

Senior DevOps Engineer & Technical Writer

CKA & AWS SA-Pro Certified9 yrs — Atlassian & FintechKubernetes open-source contributor

Ravi is a senior DevOps engineer with 9 years of experience building cloud-native infrastructure at Atlassian and multiple fintech companies. CKA and AWS Solutions Architect Professional certified, he has managed Kubernetes clusters serving millions of daily users and contributes to open-source tooling.

Prepare Your SRE Resume

Make sure your resume gets through ATS before the interview.

Check ATS Score →View SRE Resume Example →