← Interview Questions
MLOps65+ Questions · Beginner to Expert
MLOps Interview Questions & Answers (2026)
65+ MLOps interview questions with expert answers. Model deployment patterns, feature stores, drift detection, CI/CD for ML, Kubeflow, MLflow, and production ML system design.
Beginner
Q: What is MLOps and why is it important?
MLOps (Machine Learning Operations) is a set of practices that combines ML, DevOps, and Data Engineering to reliably and efficiently deploy and maintain ML models in production. The term was coined by analogy with DevOps — just as DevOps brought engineering discipline to software deployment, MLOps brings it to ML model deployment.
**Why most ML projects fail without MLOps:**
Research shows that 85–90% of ML projects never make it to production. Of those that do, many degrade silently over time without anyone noticing. The barriers:
- Moving from a Jupyter notebook experiment to a reliable production system requires significant engineering.
- ML models have a "freshness" problem — the world changes, and models trained on old data become stale.
- Debugging a model that is performing poorly in production is much harder than debugging a software bug.
- Data pipelines, training pipelines, and serving infrastructure all need to be maintained.
**What MLOps solves:**
1. **Reproducibility:** Anyone can reproduce any experiment with the same data and code.
2. **Deployment:** Models go from training to production reliably through automated pipelines.
3. **Monitoring:** Automatic detection of model degradation, data drift, and prediction quality issues.
4. **Retraining:** Automated pipelines that retrain and redeploy models when needed.
5. **Governance:** Audit trails, approval workflows, and compliance for production models.
**The ML lifecycle (what MLOps covers):**
Data preparation → Feature engineering → Model training → Model evaluation → Model registry → Model deployment → Model monitoring → Retraining trigger → repeat.
Q: What is the difference between a batch prediction pipeline and a real-time inference API?
These are the two primary model serving patterns, each suited to different use cases.
**Batch Prediction Pipeline:**
- Runs periodically (hourly, daily, weekly) on a large dataset.
- Loads the model, processes all records, writes predictions to a database or file, and shuts down.
- High throughput, lower infrastructure cost.
- Prediction is not immediately available — there is always latency equal to the batch interval.
**Use cases:** Recommendation emails sent at midnight, fraud risk scores computed nightly for all accounts, demand forecasting for next week's inventory.
**Implementation:** Apache Spark, AWS SageMaker Batch Transform, dbt + model inference UDFs, Airflow DAG.
**Real-time Inference API:**
- A persistent web service that responds to individual prediction requests within milliseconds.
- Keeps the model loaded in memory for fast responses.
- Higher infrastructure cost (always-on), requires autoscaling for traffic spikes.
- Prediction is immediately available.
**Use cases:** Real-time fraud detection at payment time, chatbot responses, image classification during upload, product recommendations while the user is browsing.
**Implementation:** FastAPI + model loaded in memory, BentoML, Seldon Core, KServe on Kubernetes, AWS SageMaker Real-Time Endpoints.
**Hybrid — Near-real-time (streaming):**
A third pattern: predictions computed asynchronously as events arrive, stored in a feature store or cache, and served immediately to the application. Balances latency and cost. Example: precomputing fraud scores for all active card transactions as they arrive via Kafka.
Intermediate
Q: What is a feature store and why is it needed?
A feature store is a centralized repository for storing, sharing, and serving ML features. It solves the "training-serving skew" problem — one of the most common and costly issues in production ML systems.
**The problem it solves:**
Without a feature store, the same feature (e.g., "user's average order value in the last 30 days") might be computed differently in:
- The training pipeline (offline, using a SQL query over historical data)
- The serving API (online, using a different query or logic under time pressure)
Even small inconsistencies between training and serving features can significantly degrade model performance. This training-serving skew is often invisible and notoriously hard to debug.
**What a feature store provides:**
**Offline store:** Historical feature data for training. Enables point-in-time correct feature retrieval — ensuring you always retrieve the feature value as it existed at the time of the label, not today's value (which would cause data leakage).
**Online store:** Low-latency (sub-millisecond) feature retrieval for real-time inference. Features are pre-computed and stored in Redis, DynamoDB, or a similar cache.
**Feature registry:** Catalog of available features with documentation, owners, lineage, and usage statistics.
**Feature pipelines:** Automated computation of features from raw data, keeping both stores in sync.
**Popular feature stores:**
- **Feast (open source):** CNCF project. Supports multiple offline stores (Snowflake, BigQuery, Parquet) and online stores (Redis, DynamoDB).
- **Tecton:** Enterprise SaaS feature platform. Supports real-time, near-real-time, and batch features.
- **Hopsworks:** Open-core feature store with strong streaming support.
- **Databricks Feature Store:** Native Databricks integration, easy if you already use Databricks.
- **AWS SageMaker Feature Store:** Managed AWS option, tight SageMaker integration.
**When you need a feature store:**
When you have multiple models using the same features, when training-serving skew is causing unexplained model degradation, or when your ML team is duplicating feature computation across different notebooks and pipelines.
Q: What is data drift and model drift, and how do you detect and handle them?
Drift is the phenomenon where a model's performance degrades over time because the real-world data it sees in production no longer resembles the data it was trained on. It is one of the most pervasive problems in production ML.
**Types of drift:**
**Data drift (feature drift / covariate shift):**
The distribution of input features changes in production compared to training. Example: a customer churn model trained on pre-pandemic behavior now receives post-pandemic user patterns it has never seen.
**Label drift (concept drift / target shift):**
The relationship between features and labels changes. The same input features now map to different outputs than during training. Example: a spam filter trained on 2023 email patterns faces entirely new spam tactics in 2026.
**Prediction drift:**
The distribution of model outputs changes — the model is predicting different classes or values than it did historically, even if the inputs look the same.
**Label quality degradation:**
Ground truth labels become unavailable, delayed, or unreliable.
**Detecting drift:**
**Statistical tests for feature drift:**
- **Population Stability Index (PSI):** Compare feature distribution between training and production windows. PSI > 0.2 indicates significant drift.
- **Kolmogorov-Smirnov test:** For continuous features — tests if two distributions are significantly different.
- **Chi-squared test:** For categorical features.
- **Jensen-Shannon divergence:** Symmetric measure of distribution similarity.
**Monitoring model performance:**
- Compare predictions to ground truth (with label latency). For fraud: label is available in days; for product recommendations: label (click/purchase) is available in hours.
- Track business metrics (conversion rate, revenue per recommendation) as proxies for model performance when labels are unavailable.
**Tools:**
- **Evidently AI (open source):** Generates drift reports and dashboards from pandas DataFrames.
- **WhyLabs:** Enterprise ML observability platform.
- **Arize:** Production ML monitoring and debugging.
- **Fiddler:** Explainable AI and model monitoring.
**Handling drift:**
1. **Alert and investigate:** Not all drift requires retraining immediately. Investigate whether the drift is meaningful (affecting predictions) or benign.
2. **Retrain with recent data:** Most common response. Schedule regular retraining (weekly, monthly) to keep the model fresh.
3. **Automated retraining pipelines:** Trigger retraining when drift metrics exceed thresholds.
4. **Feature engineering adjustment:** If the drifted feature is available earlier in the data pipeline, consider adding a feature that captures the new distributional pattern.
5. **Segment-specific models:** If drift is concentrated in a specific user segment or geography, train segment-specific models.
Advanced
Q: How do you design a CI/CD pipeline for machine learning models?
ML CI/CD extends traditional software CI/CD with additional concerns: data validation, model training reproducibility, model evaluation gates, and shadow/canary model deployment.
**Stages of a production ML CI/CD pipeline:**
**1. Data Validation (CI):**
Before any training, validate that the incoming data meets expectations:
- Schema validation (correct columns, types)
- Distribution checks (no sudden shifts in key features)
- Missing value thresholds
- Row count sanity checks
Tools: Great Expectations, TFX Data Validation, Evidently AI.
Fail the pipeline if data quality checks fail — prevent garbage-in-garbage-out.
**2. Feature Engineering (CI):**
- Run feature computation pipeline (dbt, Spark, Airflow).
- Validate computed features against the feature store schema.
- Store versioned feature snapshots for reproducibility.
**3. Model Training (CI):**
- Train the model with the validated data.
- Log all hyperparameters, data version, code version, and environment to the experiment tracker (MLflow, Weights & Biases).
- Ensure training is deterministic (fixed random seeds, deterministic GPU ops).
- Output: a model artifact registered in the model registry.
**4. Model Evaluation (CI Gate):**
The most important stage — determines whether the new model is better than the current production model:
- Run evaluation on a held-out test set.
- Run fairness and bias checks (e.g., performance parity across demographic groups).
- Run performance comparison against the current champion model.
- **Gate:** Only promote if new model beats champion by a meaningful margin (e.g., >2% improvement in F1 score).
- Human review required for high-stakes deployments.
**5. Model Registry Promotion (CD):**
```
model registry stages: Staging → Production
```
Merge to main automatically promotes the model to "Staging" in the registry. A human approval (or automated performance gate) promotes to "Production."
**6. Deployment (CD):**
- Shadow deployment: run new model alongside production, compare predictions without serving them to users.
- Canary: route 5% of traffic to new model, compare real metrics (conversion, engagement).
- Full rollout if canary is healthy.
**7. Post-deployment Monitoring:**
- Feature drift detection (continuous).
- Prediction quality tracking.
- Business KPI monitoring.
- Automated rollback if performance degrades.
**Tools for complete pipeline:** Kubeflow Pipelines or Vertex AI Pipelines (orchestration), MLflow (tracking + registry), ArgoCD (Kubernetes deployment), Evidently AI (monitoring), Great Expectations (data validation).
Prepare Your MLOps Resume
Make your resume pass ATS screening for competitive MLOps roles.