← Interview Questions
DevOps80+ Questions · Beginner to Expert

DevOps Interview Questions & Answers (2026)

80+ DevOps interview questions covering CI/CD, GitOps, deployment strategies, DORA metrics, secrets management, and system design. Real answers, not textbook definitions.

Beginner

Q: What is DevOps and why does it matter?

DevOps is a cultural and technical philosophy that brings together Development (Dev) and Operations (Ops) teams to shorten the software development lifecycle and deliver high-quality software continuously. **Core principles:** - **Collaboration:** Dev and Ops work together throughout the entire lifecycle, not as separate silos. - **Automation:** Automate repetitive tasks: testing, building, deployment, infrastructure provisioning, monitoring. - **Continuous delivery:** Code changes are automatically tested and deployed to production (or staging) frequently — sometimes dozens of times per day. - **Measurement:** Track metrics that matter: deployment frequency, lead time, MTTR, change failure rate (DORA metrics). - **Sharing:** Teams share responsibility for both building features and operating services reliably. **Why it matters:** Before DevOps, "the wall of confusion" — developers throwing code over to operations and washing their hands — led to slow release cycles (quarterly/yearly), high failure rates, and blame-driven cultures. DevOps practices have been shown to correlate strongly with organizational performance. DORA research shows that elite-performing DevOps teams deploy 973× more frequently and recover from incidents 6,570× faster than low performers.

Q: What is CI/CD and what is the difference between CI and CD?

**Continuous Integration (CI):** CI is the practice of frequently merging developer code changes into a shared repository (multiple times per day), triggering automated builds and tests with each merge. **What CI does:** - Runs automated unit tests, integration tests, and linting on every commit. - Builds the artifact (Docker image, JAR, binary). - Reports failures immediately to the developer — "fail fast, fail early." - Prevents "integration hell" by catching conflicts early. **Continuous Delivery (CD):** CD extends CI by automatically deploying every successfully built artifact to a staging/pre-production environment. The deployment to production requires a manual approval step. **Continuous Deployment:** Goes one step further — every successful build is automatically deployed all the way to production without manual approval. Only possible with very high test coverage and confidence. **Pipeline stages (typical):** Source → Build → Test (unit/integration) → Security Scan → Artifact Push → Deploy to Staging → Acceptance Tests → Deploy to Production (manual gate for CD, automatic for Continuous Deployment). **Common CI/CD tools:** Jenkins, GitHub Actions, GitLab CI, CircleCI, TeamCity, Argo Workflows, Tekton.

Q: What is the difference between blue-green and canary deployments?

Both are strategies to deploy new versions with reduced risk, but they work differently. **Blue-Green Deployment:** - Two identical production environments: "blue" (current) and "green" (new version). - The new version is deployed to "green" and fully tested. - Traffic is switched all at once (via load balancer or DNS change) from blue to green. - If something goes wrong, roll back instantly by switching back to blue. - **Pros:** Instant rollback, zero-downtime switch, simple to understand. - **Cons:** Requires 2× production infrastructure cost. Not suitable for database schema changes that can't be reversed. **Canary Deployment:** - The new version is deployed to a small percentage of production traffic (e.g., 5%) initially. - Metrics are monitored (error rate, latency, conversion rate) for the canary group. - If the canary looks healthy, traffic is gradually shifted: 5% → 25% → 50% → 100%. - If issues are detected, the canary is removed and 100% of traffic returns to the old version. - **Pros:** Lower risk — only a small fraction of users are affected by bugs. Real-world production testing. - **Cons:** More complex to implement. Requires feature flags or header-based routing. Old and new versions run simultaneously (schema compatibility required). **When to use which:** Blue-green for major changes requiring clean cutover. Canary for gradual confidence-building on high-risk changes.
Intermediate

Q: What is GitOps and how does it differ from traditional CI/CD?

GitOps is an operational model where Git is the single source of truth for both application code and infrastructure configuration. Changes to the system are made by committing to Git, and an automated agent continuously reconciles the actual system state with the desired state declared in Git. **GitOps principles (Weaveworks / OpenGitOps):** 1. The entire system (apps + infrastructure) is described declaratively in Git. 2. The desired state is versioned, meaning any change is tracked with commit history. 3. Changes are pulled from Git by an agent (ArgoCD, Flux) — not pushed by the CI system. 4. Software agents continuously ensure the actual state matches the desired state (drift correction). **Traditional CI/CD (push model):** CI pipeline builds the artifact → pipeline runs kubectl apply or helm upgrade (pushing to the cluster). The cluster has no awareness of what "should" be running. **GitOps (pull model):** CI pipeline builds and pushes the artifact image → updates the image tag in the Git manifest → ArgoCD/Flux agent detects the change → agent pulls and applies the change to the cluster. **Advantages of GitOps:** - **Audit trail:** Every change to production is a Git commit — who made it, when, and why. - **Self-healing:** ArgoCD constantly watches for drift and corrects it — no one can manually kubectl apply without the change being detected. - **Rollback = git revert:** Rollback is as simple as reverting a commit. - **Developer ergonomics:** Developers use familiar Git workflows (PR, review, merge) for deploying. **Tools:** ArgoCD, Flux v2, Rancher Fleet, Weave GitOps.

Q: Explain the key DevOps metrics (DORA) and how to measure them.

DORA (DevOps Research and Assessment) identified four key metrics that distinguish high-performing engineering teams from low performers. These are the industry-standard measures of software delivery performance. **1. Deployment Frequency:** How often an organization deploys to production. Elite teams deploy multiple times per day. Low performers deploy monthly or less. Measured by counting production deployments per time period. **2. Lead Time for Changes:** Time from a code commit being merged to it running in production. Elite: less than 1 hour. Low performers: 1–6 months. Measured by tracking the timestamp of code merge vs. deployment. **3. Change Failure Rate:** Percentage of deployments that cause an incident, rollback, or require a hotfix. Elite: 0–15%. Low performers: 46–60%. Measured by tracking incidents tied to deployments. **4. Mean Time to Recovery (MTTR):** How long it takes to restore service after a production incident. Elite: less than 1 hour. Low performers: 1 week to 1 month. Measured from incident detection to service restoration. **How to measure:** - Deployment frequency: CI/CD system logs (GitHub Actions, Jenkins, ArgoCD). - Lead time: Git commit timestamps vs. deployment timestamps, tools like LinearB or Sleuth. - Change failure rate: Incident management systems (PagerDuty) linked to deployment events. - MTTR: PagerDuty or OpsGenie incident timeline from trigger to resolution. **A 5th metric (added by DORA in 2021):** Reliability (operational performance) — whether teams are meeting their SLO targets.

Q: How would you implement secrets management in a CI/CD pipeline?

Secrets management is a critical security topic. Secrets (API keys, DB passwords, certificates) should never be hardcoded in code, environment files, or CI/CD pipeline YAML. **Tiered approach:** **1. At-rest in the CI system:** For CI pipeline secrets (build-time: Docker Hub credentials, cloud credentials): use the secret manager built into your CI system (GitHub Actions Secrets, GitLab CI Variables, Jenkins Credentials). These are encrypted at rest and injected as environment variables during the pipeline run. **2. Runtime secrets in Kubernetes:** Options ranked by security: - **Kubernetes Secrets (baseline):** Base64-encoded, not encrypted by default. Sufficient for low-sensitivity data. Enable encryption at rest with a KMS provider for better security. - **HashiCorp Vault + Vault Agent Sidecar:** Vault stores secrets, the sidecar authenticates to Vault (via Kubernetes service account JWT), fetches secrets, and writes them to a shared volume. Secrets are never stored in Kubernetes. Full audit trail. The gold standard for most enterprises. - **External Secrets Operator (ESO):** Syncs secrets from AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, or Vault into Kubernetes Secrets automatically. Secrets are managed in the cloud provider's native secret store and synced to K8s on a schedule or on-change. Simpler to operate than Vault Agent. - **AWS Secrets Manager + IRSA (IAM Roles for Service Accounts):** For EKS clusters, use IRSA to grant Pods IAM permissions to fetch secrets from AWS Secrets Manager directly — no sidecar required. **3. Principles to follow:** - Rotate secrets regularly (automate rotation where possible). - Audit secret access (who accessed which secret when). - Use short-lived, scoped credentials (IAM roles, Vault dynamic secrets) over long-lived static credentials. - Never log secrets — mask them in CI output.
Advanced

Q: Design a CI/CD pipeline for a microservices architecture with 50+ services.

Managing CI/CD for 50+ microservices requires moving from per-service pipelines to a platform-level CI/CD strategy. Key concerns: build times, dependency management, deployment coordination, and observability. **Monorepo vs. Polyrepo:** - **Monorepo:** All services in one repo. Simpler dependency management, easier atomic changes across services. Requires tooling (Nx, Turborepo, Bazel) to build only changed services — otherwise every change triggers 50 builds. - **Polyrepo:** Each service has its own repo. Independent velocity and ownership. Harder to make cross-service changes atomically. **For 50+ services, recommended architecture:** **1. Service-level CI (per service or changed service):** - Every PR triggers: lint, unit tests, build Docker image, security scan (Trivy), push to registry. - Change detection (in monorepo): only run CI for services with changed files. - Standardize the CI pipeline structure: all services use the same workflow template — reduces cognitive load and maintenance. **2. Centralized GitOps CD (ArgoCD ApplicationSets):** - Each service maintains its Helm chart / Kustomize manifests in a deployment repo (or "gitops repo"). - CI pipeline updates the image tag in the gitops repo on successful build. - ArgoCD watches the gitops repo and deploys to the appropriate cluster/namespace. - Use ArgoCD ApplicationSet with Git generator to automatically create an Argo Application for every service that follows the manifest directory convention. **3. Promotion workflow:** - Environments: dev (auto-deploy on merge to main) → staging (auto-deploy, integration tests) → production (manual promotion or canary gating). - Use semantic versioning and immutable image tags (SHA digest, not `latest`). **4. Deployment coordination for dependent services:** - Maintain a dependency graph or use contract testing (Pact) to prevent breaking changes. - Deploy in dependency order using Argo Workflows or Argo Rollouts with traffic shifting. **5. Observability:** - Centralize deployment events in your monitoring system (tag Grafana annotations with deployments). - Connect deployment events to incident detection (if error rate spikes 5 minutes after a deploy, flag the deployment as suspect).

Q: How do you implement and measure SLOs in a DevOps context?

Service Level Objectives (SLOs) are the foundation of reliability engineering and represent the bridge between DevOps and SRE practices. **SLO Framework:** **SLI (Service Level Indicator):** A quantitative measure of service behavior. Example: "The proportion of HTTP requests that return a success response (2xx) within 200ms." **SLO (Service Level Objective):** The target value for an SLI. Example: "99.5% of requests return 2xx within 200ms, measured over a 28-day rolling window." **SLA (Service Level Agreement):** The contractual commitment to customers. Should be more conservative than your SLO (e.g., SLA: 99%, SLO: 99.5%). **Error Budget:** The allowable amount of bad behavior before the SLO is breached. With a 99.5% SLO over 28 days, the error budget is 0.5% × 28 days × 24h × 60min = ~201.6 minutes of allowable failures. **Implementing SLOs:** **Step 1 — Define good SLIs:** - Availability SLI: `rate(http_requests_total{code!~"5.."}[5m]) / rate(http_requests_total[5m])` - Latency SLI: `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) < 0.5` - Error rate SLI: `1 - (sum(rate(errors_total[5m])) / sum(rate(requests_total[5m])))` **Step 2 — Implement multi-window, multi-burn-rate alerting (Google's model):** Alert on error budget consumption rate, not on instantaneous thresholds: - Fast burn alert: consuming 2% of 1-hour budget in 5 minutes (2× burn rate) → page immediately. - Slow burn alert: consuming 5% of 6-hour budget → ticket/non-urgent alert. This approach avoids false positives from transient spikes. **Step 3 — Error budget policy:** Define what happens when the error budget is depleted: - Stop new feature deployments. - Focus the team on reliability work until budget is restored. **Tools:** Prometheus + Grafana (manual SLO dashboards), Sloth (SLO-as-code with Prometheus rules), Google SLO Generator, Nobl9, Datadog SLO tracking.

Prepare Your DevOps Resume Too

Great interview prep starts with a resume that gets you the interview.