Reference Architectures & Deployment Guides

RAG (Retrieval-Augmented Generation) Production Architecture

System Overview

End-to-end RAG system with vector database, caching, monitoring, and scaling capabilities for production workloads.

📊 Architecture Diagram: User Request → API Gateway → Cache Layer → Vector DB Query → LLM Inference → Response

API Gateway

FastAPI with rate limiting, authentication, and request validation

Cache Layer

Redis for semantic caching and frequent query results

Vector Database

Pinecone or Weaviate for embedding storage and similarity search

LLM Inference

vLLM or TensorRT-LLM for high-throughput generation

Monitoring

Prometheus + Grafana for metrics, Langfuse for LLM observability

Load Balancer

NGINX or AWS ALB for traffic distribution

Best Practices

Implement semantic caching to reduce LLM calls by 40-60%
Use hybrid search (vector + keyword) for better retrieval accuracy
Set up query rewriting to improve vector search relevance
Monitor retrieval quality with precision and recall metrics
Implement circuit breakers for LLM API failures
Use async processing for non-critical requests

Blue-Green Deployment for ML Models

Standard Operating Procedure

Prepare Green Environment

Deploy new model version to green environment (separate from production blue)

kubectl apply -f model-deployment-green.yaml
kubectl wait --for=condition=ready pod -l version=green

Run Validation Tests

Execute automated tests against green environment: performance, accuracy, and integration tests

python run_validation.py --env=green --threshold=0.95
# Check latency, throughput, and model accuracy

Canary Testing

Route 5-10% of traffic to green environment and monitor for anomalies

# Update service to route 10% traffic to green
kubectl patch service model-service --type=merge \
  -p '{"spec":{"selector":{"version":"green"},"weight":10}}'

Monitor Metrics

Track error rates, latency p95/p99, and business metrics for 15-30 minutes

Error rate should be < 0.5%
P95 latency within 10% of blue
No alerts or anomalies detected

Full Traffic Switch

If canary successful, route 100% traffic to green; otherwise rollback to blue

# Success: Switch all traffic
kubectl patch service model-service --type=merge \
  -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback if issues detected
kubectl patch service model-service --type=merge \
  -p '{"spec":{"selector":{"version":"blue"}}}'

Decommission Old Environment

After 24-48 hours of stable operation, scale down blue environment

kubectl scale deployment model-blue --replicas=0
# Keep for 7 days before full deletion for emergency rollback

Rollback Procedure

Immediately switch traffic back to blue if error rate exceeds threshold
Maintain blue environment for minimum 48 hours post-deployment
Document all deployment changes and rollback triggers
Set up automated rollback on critical metric degradation
Test rollback procedure monthly to ensure readiness

Data Quality Monitoring Architecture

Continuous Data Quality Framework

Automated monitoring and alerting for data pipelines with anomaly detection and root cause analysis.

Schema Validation

Great Expectations for schema enforcement and data type validation

Freshness Checks

Monitor data arrival times and detect delays or missing data

Volume Anomalies

Statistical process control for row count variations

Distribution Drift

Kolmogorov-Smirnov tests for detecting distribution changes

Null Rate Tracking

Monitor null percentages across critical columns

Alert Routing

PagerDuty or Slack integration with severity-based escalation

Example Quality Checks

# Great Expectations suite example
expectation_suite = {
    "expectations": [
        {
            "expectation_type": "expect_column_values_to_not_be_null",
            "column": "user_id",
            "threshold": 0.99
        },
        {
            "expectation_type": "expect_table_row_count_to_be_between",
            "min_value": 900000,
            "max_value": 1100000
        },
        {
            "expectation_type": "expect_column_values_to_be_in_set",
            "column": "status",
            "value_set": ["active", "inactive", "pending"]
        }
    ]
}

Multi-Region ML Deployment

Global Low-Latency Inference Architecture

Distributed model serving across multiple regions with geo-routing and data synchronization.

🌍 Multi-Region Setup: Users → CloudFront/CloudFlare → Regional API Gateways → Regional Model Deployments → Shared Vector DB

Key Considerations

Deploy models in regions closest to user base (US-East, EU-West, AP-Southeast)
Use CDN for caching responses and static assets
Replicate vector databases across regions with eventual consistency
Implement geo-routing at DNS level for automatic region selection
Monitor cross-region latency and failover health checks
Use global load balancer for automatic failover between regions
Sync model versions across regions to ensure consistency

Cost Optimization

Multi-region deployments increase costs by 2-3x. Start with single region, then expand based on user distribution and latency requirements. Use spot instances for batch inference workloads to reduce costs by 60-70%.