# MLOps Production Patterns

Production ML infrastructure patterns for model deployment, monitoring, and lifecycle management.

---

## Table of Contents

- [Model Deployment Pipeline](#model-deployment-pipeline)
- [Feature Store Architecture](#feature-store-architecture)
- [Model Monitoring](#model-monitoring)
- [A/B Testing Infrastructure](#ab-testing-infrastructure)
- [Automated Retraining](#automated-retraining)

---

## Model Deployment Pipeline

### Deployment Workflow

1. Export trained model to standardized format (ONNX, TorchScript, SavedModel)
2. Package model with dependencies in Docker container
3. Deploy to staging environment
4. Run integration tests against staging
5. Deploy canary (5% traffic) to production
6. Monitor latency and error rates for 1 hour
7. Promote to full production if metrics pass
8. **Validation:** p95 latency < 100ms, error rate < 0.1%

### Container Structure

```dockerfile
FROM python:3.11-slim

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model artifacts
COPY model/ /app/model/
COPY src/ /app/src/

# Health check endpoint
HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]
```

### Model Serving Options

| Option | Latency | Throughput | Use Case |
|--------|---------|------------|----------|
| FastAPI + Uvicorn | Low | Medium | REST APIs, small models |
| Triton Inference Server | Very Low | Very High | GPU inference, batching |
| TensorFlow Serving | Low | High | TensorFlow models |
| TorchServe | Low | High | PyTorch models |
| Ray Serve | Medium | High | Complex pipelines, multi-model |

### Kubernetes Deployment

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: model-serving
  template:
    spec:
      containers:
      - name: model
        image: model:v1.0.0
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
```

---

## Feature Store Architecture

### Feature Store Components

| Component | Purpose | Tools |
|-----------|---------|-------|
| Offline Store | Training data, batch features | BigQuery, Snowflake, S3 |
| Online Store | Low-latency serving | Redis, DynamoDB, Feast |
| Feature Registry | Metadata, lineage | Feast, Tecton, Hopsworks |
| Transformation | Feature engineering | Spark, Flink, dbt |

### Feature Pipeline Workflow

1. Define feature schema in registry
2. Implement transformation logic (SQL or Python)
3. Backfill historical features to offline store
4. Schedule incremental updates
5. Materialize to online store for serving
6. Monitor feature freshness and quality
7. **Validation:** Feature values within expected ranges, no nulls in required fields

### Feature Definition Example

```python
from feast import Entity, Feature, FeatureView, FileSource

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
        Feature(name="days_since_last_purchase", dtype=ValueType.INT64),
    ],
    online=True,
    source=FileSource(path="data/user_features.parquet"),
)
```

---

## Model Monitoring

### Monitoring Dimensions

| Dimension | Metrics | Alert Threshold |
|-----------|---------|-----------------|
| Latency | p50, p95, p99 | p95 > 100ms |
| Throughput | requests/sec | < 80% baseline |
| Errors | error rate, 5xx count | > 0.1% |
| Data Drift | PSI, KS statistic | PSI > 0.2 |
| Model Drift | accuracy, AUC decay | > 5% drop |

### Data Drift Detection

```python
from scipy.stats import ks_2samp
import numpy as np

def detect_drift(reference: np.array, current: np.array, threshold: float = 0.05):
    """Detect distribution drift using Kolmogorov-Smirnov test."""
    statistic, p_value = ks_2samp(reference, current)

    drift_detected = p_value < threshold

    return {
        "drift_detected": drift_detected,
        "ks_statistic": statistic,
        "p_value": p_value,
        "threshold": threshold
    }
```

### Monitoring Dashboard Metrics

**Infrastructure:**
- Request latency (p50, p95, p99)
- Requests per second
- Error rate by type
- CPU/memory utilization
- GPU utilization (if applicable)

**Model Performance:**
- Prediction distribution
- Feature value distributions
- Model output confidence
- Ground truth vs predictions (when available)

---

## A/B Testing Infrastructure

### Experiment Workflow

1. Define experiment hypothesis and success metrics
2. Calculate required sample size for statistical power
3. Configure traffic split (control vs treatment)
4. Deploy treatment model alongside control
5. Route traffic based on user/session hash
6. Collect metrics for both variants
7. Run statistical significance test
8. **Validation:** p-value < 0.05, minimum sample size reached

### Traffic Splitting

```python
import hashlib

def get_variant(user_id: str, experiment: str, control_pct: float = 0.5) -> str:
    """Deterministic traffic splitting based on user ID."""
    hash_input = f"{user_id}:{experiment}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    bucket = (hash_value % 100) / 100.0

    return "control" if bucket < control_pct else "treatment"
```

### Metrics Collection

| Metric Type | Examples | Collection Method |
|-------------|----------|-------------------|
| Primary | Conversion rate, revenue | Event logging |
| Secondary | Latency, engagement | Request logs |
| Guardrail | Error rate, crashes | Monitoring system |

---

## Automated Retraining

### Retraining Triggers

| Trigger | Detection Method | Action |
|---------|------------------|--------|
| Scheduled | Cron (weekly/monthly) | Full retrain |
| Performance drop | Accuracy < threshold | Immediate retrain |
| Data drift | PSI > 0.2 | Evaluate, then retrain |
| New data volume | X new samples | Incremental update |

### Retraining Pipeline

1. Trigger detection (schedule, drift, performance)
2. Fetch latest training data from feature store
3. Run training job with hyperparameter config
4. Evaluate model on holdout set
5. Compare against production model
6. If improved: register new model version
7. Deploy to staging for validation
8. Promote to production via canary
9. **Validation:** New model outperforms baseline on key metrics

### MLflow Model Registry Integration

```python
import mlflow

def register_model(model, metrics: dict, model_name: str):
    """Register trained model with MLflow."""
    with mlflow.start_run():
        # Log metrics
        for name, value in metrics.items():
            mlflow.log_metric(name, value)

        # Log model
        mlflow.sklearn.log_model(model, "model")

        # Register in model registry
        model_uri = f"runs:/{mlflow.active_run().info.run_id}/model"
        mlflow.register_model(model_uri, model_name)
```