← Interview Questions
MLOps65+ Questions · Beginner to Expert

MLOps Interview Questions & Answers (2026)

65+ MLOps interview questions with expert answers. Model deployment patterns, feature stores, drift detection, CI/CD for ML, Kubeflow, MLflow, and production ML system design.

Beginner

Q: What is MLOps and why is it important?

MLOps (Machine Learning Operations) is a set of practices that combines ML, DevOps, and Data Engineering to reliably and efficiently deploy and maintain ML models in production. The term was coined by analogy with DevOps — just as DevOps brought engineering discipline to software deployment, MLOps brings it to ML model deployment. **Why most ML projects fail without MLOps:** Research shows that 85–90% of ML projects never make it to production. Of those that do, many degrade silently over time without anyone noticing. The barriers: - Moving from a Jupyter notebook experiment to a reliable production system requires significant engineering. - ML models have a "freshness" problem — the world changes, and models trained on old data become stale. - Debugging a model that is performing poorly in production is much harder than debugging a software bug. - Data pipelines, training pipelines, and serving infrastructure all need to be maintained. **What MLOps solves:** 1. **Reproducibility:** Anyone can reproduce any experiment with the same data and code. 2. **Deployment:** Models go from training to production reliably through automated pipelines. 3. **Monitoring:** Automatic detection of model degradation, data drift, and prediction quality issues. 4. **Retraining:** Automated pipelines that retrain and redeploy models when needed. 5. **Governance:** Audit trails, approval workflows, and compliance for production models. **The ML lifecycle (what MLOps covers):** Data preparation → Feature engineering → Model training → Model evaluation → Model registry → Model deployment → Model monitoring → Retraining trigger → repeat.

Q: What is the difference between a batch prediction pipeline and a real-time inference API?

These are the two primary model serving patterns, each suited to different use cases. **Batch Prediction Pipeline:** - Runs periodically (hourly, daily, weekly) on a large dataset. - Loads the model, processes all records, writes predictions to a database or file, and shuts down. - High throughput, lower infrastructure cost. - Prediction is not immediately available — there is always latency equal to the batch interval. **Use cases:** Recommendation emails sent at midnight, fraud risk scores computed nightly for all accounts, demand forecasting for next week's inventory. **Implementation:** Apache Spark, AWS SageMaker Batch Transform, dbt + model inference UDFs, Airflow DAG. **Real-time Inference API:** - A persistent web service that responds to individual prediction requests within milliseconds. - Keeps the model loaded in memory for fast responses. - Higher infrastructure cost (always-on), requires autoscaling for traffic spikes. - Prediction is immediately available. **Use cases:** Real-time fraud detection at payment time, chatbot responses, image classification during upload, product recommendations while the user is browsing. **Implementation:** FastAPI + model loaded in memory, BentoML, Seldon Core, KServe on Kubernetes, AWS SageMaker Real-Time Endpoints. **Hybrid — Near-real-time (streaming):** A third pattern: predictions computed asynchronously as events arrive, stored in a feature store or cache, and served immediately to the application. Balances latency and cost. Example: precomputing fraud scores for all active card transactions as they arrive via Kafka.
Intermediate

Q: What is a feature store and why is it needed?

A feature store is a centralized repository for storing, sharing, and serving ML features. It solves the "training-serving skew" problem — one of the most common and costly issues in production ML systems. **The problem it solves:** Without a feature store, the same feature (e.g., "user's average order value in the last 30 days") might be computed differently in: - The training pipeline (offline, using a SQL query over historical data) - The serving API (online, using a different query or logic under time pressure) Even small inconsistencies between training and serving features can significantly degrade model performance. This training-serving skew is often invisible and notoriously hard to debug. **What a feature store provides:** **Offline store:** Historical feature data for training. Enables point-in-time correct feature retrieval — ensuring you always retrieve the feature value as it existed at the time of the label, not today's value (which would cause data leakage). **Online store:** Low-latency (sub-millisecond) feature retrieval for real-time inference. Features are pre-computed and stored in Redis, DynamoDB, or a similar cache. **Feature registry:** Catalog of available features with documentation, owners, lineage, and usage statistics. **Feature pipelines:** Automated computation of features from raw data, keeping both stores in sync. **Popular feature stores:** - **Feast (open source):** CNCF project. Supports multiple offline stores (Snowflake, BigQuery, Parquet) and online stores (Redis, DynamoDB). - **Tecton:** Enterprise SaaS feature platform. Supports real-time, near-real-time, and batch features. - **Hopsworks:** Open-core feature store with strong streaming support. - **Databricks Feature Store:** Native Databricks integration, easy if you already use Databricks. - **AWS SageMaker Feature Store:** Managed AWS option, tight SageMaker integration. **When you need a feature store:** When you have multiple models using the same features, when training-serving skew is causing unexplained model degradation, or when your ML team is duplicating feature computation across different notebooks and pipelines.

Q: What is data drift and model drift, and how do you detect and handle them?

Drift is the phenomenon where a model's performance degrades over time because the real-world data it sees in production no longer resembles the data it was trained on. It is one of the most pervasive problems in production ML. **Types of drift:** **Data drift (feature drift / covariate shift):** The distribution of input features changes in production compared to training. Example: a customer churn model trained on pre-pandemic behavior now receives post-pandemic user patterns it has never seen. **Label drift (concept drift / target shift):** The relationship between features and labels changes. The same input features now map to different outputs than during training. Example: a spam filter trained on 2023 email patterns faces entirely new spam tactics in 2026. **Prediction drift:** The distribution of model outputs changes — the model is predicting different classes or values than it did historically, even if the inputs look the same. **Label quality degradation:** Ground truth labels become unavailable, delayed, or unreliable. **Detecting drift:** **Statistical tests for feature drift:** - **Population Stability Index (PSI):** Compare feature distribution between training and production windows. PSI > 0.2 indicates significant drift. - **Kolmogorov-Smirnov test:** For continuous features — tests if two distributions are significantly different. - **Chi-squared test:** For categorical features. - **Jensen-Shannon divergence:** Symmetric measure of distribution similarity. **Monitoring model performance:** - Compare predictions to ground truth (with label latency). For fraud: label is available in days; for product recommendations: label (click/purchase) is available in hours. - Track business metrics (conversion rate, revenue per recommendation) as proxies for model performance when labels are unavailable. **Tools:** - **Evidently AI (open source):** Generates drift reports and dashboards from pandas DataFrames. - **WhyLabs:** Enterprise ML observability platform. - **Arize:** Production ML monitoring and debugging. - **Fiddler:** Explainable AI and model monitoring. **Handling drift:** 1. **Alert and investigate:** Not all drift requires retraining immediately. Investigate whether the drift is meaningful (affecting predictions) or benign. 2. **Retrain with recent data:** Most common response. Schedule regular retraining (weekly, monthly) to keep the model fresh. 3. **Automated retraining pipelines:** Trigger retraining when drift metrics exceed thresholds. 4. **Feature engineering adjustment:** If the drifted feature is available earlier in the data pipeline, consider adding a feature that captures the new distributional pattern. 5. **Segment-specific models:** If drift is concentrated in a specific user segment or geography, train segment-specific models.
Advanced

Q: How do you design a CI/CD pipeline for machine learning models?

ML CI/CD extends traditional software CI/CD with additional concerns: data validation, model training reproducibility, model evaluation gates, and shadow/canary model deployment. **Stages of a production ML CI/CD pipeline:** **1. Data Validation (CI):** Before any training, validate that the incoming data meets expectations: - Schema validation (correct columns, types) - Distribution checks (no sudden shifts in key features) - Missing value thresholds - Row count sanity checks Tools: Great Expectations, TFX Data Validation, Evidently AI. Fail the pipeline if data quality checks fail — prevent garbage-in-garbage-out. **2. Feature Engineering (CI):** - Run feature computation pipeline (dbt, Spark, Airflow). - Validate computed features against the feature store schema. - Store versioned feature snapshots for reproducibility. **3. Model Training (CI):** - Train the model with the validated data. - Log all hyperparameters, data version, code version, and environment to the experiment tracker (MLflow, Weights & Biases). - Ensure training is deterministic (fixed random seeds, deterministic GPU ops). - Output: a model artifact registered in the model registry. **4. Model Evaluation (CI Gate):** The most important stage — determines whether the new model is better than the current production model: - Run evaluation on a held-out test set. - Run fairness and bias checks (e.g., performance parity across demographic groups). - Run performance comparison against the current champion model. - **Gate:** Only promote if new model beats champion by a meaningful margin (e.g., >2% improvement in F1 score). - Human review required for high-stakes deployments. **5. Model Registry Promotion (CD):** ``` model registry stages: Staging → Production ``` Merge to main automatically promotes the model to "Staging" in the registry. A human approval (or automated performance gate) promotes to "Production." **6. Deployment (CD):** - Shadow deployment: run new model alongside production, compare predictions without serving them to users. - Canary: route 5% of traffic to new model, compare real metrics (conversion, engagement). - Full rollout if canary is healthy. **7. Post-deployment Monitoring:** - Feature drift detection (continuous). - Prediction quality tracking. - Business KPI monitoring. - Automated rollback if performance degrades. **Tools for complete pipeline:** Kubeflow Pipelines or Vertex AI Pipelines (orchestration), MLflow (tracking + registry), ArgoCD (Kubernetes deployment), Evidently AI (monitoring), Great Expectations (data validation).

Prepare Your MLOps Resume

Make your resume pass ATS screening for competitive MLOps roles.