Blogs

Deploying LLM Applications: From Prototype to Production

2025-01-2045 min read
Deploying LLM Applications: From Prototype to Production

The complete guide to deploying your LLM application - from a $5 VM to enterprise Kubernetes

What You'll Learn

In this comprehensive guide, we'll cover every deployment option for your LLM applications, progressing from simple to complex:

  1. Quick Prototyping with Streamlit Cloud (5 minutes)
  2. VM Deployment with SSL certificates (2 hours) - The hidden gem for MVPs
  3. Production APIs with FastAPI/Flask on managed platforms (1 day)
  4. Serverless deployment for variable workloads (1 week)
  5. Containerized deployment with Docker (2 weeks)
  6. Enterprise scale with Kubernetes (1 month)

By the end, you'll know exactly which deployment strategy fits your needs, budget, and timeline.

Prerequisites

From our previous lessons, you should already have:

✅ Observability setup (Langfuse) ✅ Response caching implemented ✅ Model selection strategy ✅ Cost optimization techniques

Now let's deploy these optimized applications to production!

Part 1: The Deployment Decision Tree

Before diving into specifics, here's how to choose your deployment strategy:

def choose_deployment_strategy(requirements):
    """
    Find the right deployment approach for your needs
    """

    # Just testing or showing a demo?
    if requirements["users"] < 100 and requirements["internal_only"]:
        return "Streamlit Cloud (Free)"

    # Building an MVP or side project?
    elif requirements["budget"] < 20 and requirements["users"] < 1000:
        return "VM with Nginx/Caddy ($5-20/mo)"

    # Need a production API?
    elif requirements["need_api"] and requirements["users"] < 10000:
        return "FastAPI on Railway/Render ($20-50/mo)"

    # Variable or unpredictable traffic?
    elif requirements["variable_traffic"] and requirements["budget_conscious"]:
        return "Serverless - AWS Lambda/Vercel (Pay per use)"

    # Complex dependencies or need GPUs?
    elif requirements["need_gpu"] or requirements["complex_dependencies"]:
        return "Docker on Cloud Run/ECS ($50-200/mo)"

    # Enterprise with high availability needs?
    elif requirements["enterprise"] and requirements["high_availability"]:
        return "Kubernetes ($200+/mo)"

    else:
        return "Start with a VM, scale later"

Quick Cost Comparison

Deployment TypeMonthly CostSetup TimeBest For
Streamlit Cloud$0-25 minDemos, POCs
VM + Nginx$5-202 hoursMVPs, startups
Managed Platforms$20-5030 minGrowing apps
Serverless$0-100*1 hourVariable traffic
Docker + Cloud$50-2004 hoursComplex apps
Kubernetes$200+DaysEnterprise

*Depends on usage

Part 2: Quick Start - Streamlit Cloud Deployment

Let's start with the simplest option - perfect for demos and internal tools.

When to Use Streamlit Cloud

Perfect for:

  • Proof of concepts
  • Internal dashboards
  • Data science demos
  • Sharing with stakeholders

Not suitable for:

  • Production APIs
  • Mobile apps
  • High-traffic applications
  • Complex authentication needs

Building Your Streamlit App

# app.py
import streamlit as st
import litellm
from dotenv import load_dotenv
import pandas as pd
import plotly.express as px
from datetime import datetime

# Load environment variables
load_dotenv()

# Enable observability from previous lessons
litellm.success_callback = ["langfuse"]
litellm.cache = litellm.Cache()  # Enable caching

# Page config
st.set_page_config(
    page_title="LLM Demo App",
    page_icon="🤖",
    layout="wide"
)

# Sidebar for configuration
with st.sidebar:
    st.header("Configuration")
    model = st.selectbox(
        "Select Model",
        ["gpt-4o-mini", "claude-3-haiku-20240307", "gpt-3.5-turbo"]
    )
    temperature = st.slider("Temperature", 0.0, 2.0, 0.7)
    max_tokens = st.number_input("Max Tokens", 50, 2000, 500)

    # Cost tracking from previous lessons
    if 'total_cost' not in st.session_state:
        st.session_state.total_cost = 0

    st.metric("Session Cost", f"${st.session_state.total_cost:.4f}")

# Main app
st.title("🤖 Production-Ready LLM Demo")
st.markdown("Deployed with all optimizations from our course!")

# Chat interface
if 'messages' not in st.session_state:
    st.session_state.messages = []

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# User input
if prompt := st.chat_input("Ask me anything..."):
    # Add to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})

    with st.chat_message("user"):
        st.markdown(prompt)

    # Generate response
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            try:
                # Use optimizations from previous lessons
                response = litellm.completion(
                    model=model,
                    messages=st.session_state.messages,
                    temperature=temperature,
                    max_tokens=max_tokens,
                    caching=True,  # Enable caching
                    metadata={
                        "app": "streamlit_demo",
                        "user": "demo_user"
                    }
                )

                # Extract response
                answer = response.choices[0].message.content
                st.markdown(answer)

                # Calculate cost
                cost = litellm.completion_cost(completion_response=response)
                st.session_state.total_cost += cost

                # Show metrics
                col1, col2, col3 = st.columns(3)
                with col1:
                    st.metric("Tokens Used", response.usage.total_tokens)
                with col2:
                    st.metric("Response Cost", f"${cost:.4f}")
                with col3:
                    st.metric("Cache Hit", "✅" if response._hidden_params.get("cache_hit") else "❌")

                # Add to history
                st.session_state.messages.append({"role": "assistant", "content": answer})

            except Exception as e:
                st.error(f"Error: {str(e)}")

# Additional features
with st.expander("📊 Usage Analytics"):
    # Create mock analytics data
    df = pd.DataFrame({
        'Time': pd.date_range(start='1/1/2024', periods=24, freq='h'),
        'Requests': [10, 15, 20, 25, 30, 28, 25, 20, 15, 12,
                    10, 8, 5, 3, 5, 8, 12, 18, 25, 30,
                    35, 32, 28, 20],
        'Cost': [0.05, 0.08, 0.10, 0.12, 0.15, 0.14, 0.12, 0.10, 0.08, 0.06,
                0.05, 0.04, 0.03, 0.02, 0.03, 0.04, 0.06, 0.09, 0.12, 0.15,
                0.18, 0.16, 0.14, 0.10]
    })

    fig = px.line(df, x='Time', y=['Requests', 'Cost'],
                  title='Usage Over Time',
                  labels={'value': 'Count', 'variable': 'Metric'})
    st.plotly_chart(fig, use_container_width=True)

# Footer
st.markdown("---")
st.markdown("Built with lessons from LLM Optimization Course")

Deploying to Streamlit Cloud

1. Prepare your repository:

# requirements.txt
streamlit==1.28.0
litellm==1.0.0
langfuse==2.0.0
python-dotenv==1.0.0
pandas==2.0.0
plotly==5.17.0

2. Push to GitHub:

git init
git add .
git commit -m "Initial LLM app"
git remote add origin https://github.com/yourusername/llm-demo
git push -u origin main

3. Deploy on Streamlit Cloud:

  1. Go to share.streamlit.io
  2. Click "New app"
  3. Connect your GitHub repository
  4. Add secrets in the dashboard:
    • OPENAI_API_KEY = "sk-..."
    • LANGFUSE_PUBLIC_KEY = "pk-..."
    • LANGFUSE_SECRET_KEY = "sk-..."
  5. Click "Deploy"!

Result: Your app is live at https://yourapp.streamlit.app in under 5 minutes!

Part 3: VM Deployment - The Hidden Gem for MVPs

This is the most underrated deployment option. For $5-20/month, you can serve thousands of users with full control.

Why VMs Are Perfect for MVPs

Most tutorials jump straight from Streamlit to Kubernetes, missing the sweet spot: a simple VM with Nginx/Caddy. Here's why this is often the best choice:

Advantages:

  • 💰 Fixed cost: $5-20/month regardless of traffic
  • 🚀 Fast deployment: Production-ready in 2 hours
  • 🎛️ Full control: SSH access, custom configurations
  • 📈 Easy scaling: Just upgrade the VM
  • 🔒 Free SSL: Let's Encrypt certificates
  • 🐛 Simple debugging: Just check the logs

Perfect for:

  • MVPs and early-stage startups
  • Side projects that might grow
  • Internal tools for small teams
  • Learning production deployment

Step 1: Choose Your VM Provider

# Quick provider comparison
providers = {
    "DigitalOcean": {
        "price": "$6/mo",
        "specs": "1GB RAM, 25GB SSD, 1TB transfer",
        "pros": "Best tutorials, $200 credit",
        "setup_time": "55 seconds"
    },
    "Hetzner": {
        "price": "€4.51/mo",
        "specs": "2GB RAM, 20GB SSD, 20TB transfer",
        "pros": "Best value in Europe",
        "setup_time": "10 seconds"
    },
    "Linode": {
        "price": "$5/mo",
        "specs": "1GB RAM, 25GB SSD, 1TB transfer",
        "pros": "Cheapest option, $100 credit",
        "setup_time": "60 seconds"
    },
    "Oracle Cloud": {
        "price": "FREE",
        "specs": "1GB RAM, 2 AMD cores, Always free",
        "pros": "Actually free forever",
        "setup_time": "5 minutes"
    }
}

For this guide, we'll use DigitalOcean for its excellent documentation.

Step 2: Initial Server Setup

# Create a new droplet (DigitalOcean's VM)
# Choose: Ubuntu 22.04, Basic, $6/mo, your nearest region

# SSH into your server
ssh root@your_server_ip

# Create a non-root user
adduser deploy
usermod -aG sudo deploy

# Set up basic firewall
ufw allow OpenSSH
ufw allow 80
ufw allow 443
ufw --force enable

# Update system
apt update && apt upgrade -y

# Install essentials
apt install -y python3-pip python3-venv nginx certbot python3-certbot-nginx git htop

# Optional but recommended: fail2ban for security
apt install -y fail2ban
systemctl enable fail2ban
systemctl start fail2ban

Step 3: Deploy Your FastAPI Application

# Switch to deploy user
su - deploy

# Clone your repository
git clone https://github.com/yourusername/llm-api.git
cd llm-api

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Create .env file
cat > .env << 'EOF'
OPENAI_API_KEY=sk-...
LANGFUSE_PUBLIC_KEY=pk-...
LANGFUSE_SECRET_KEY=sk-...
REDIS_URL=redis://localhost:6379
EOF

# Install and setup Redis for caching
sudo apt install -y redis-server
sudo systemctl enable redis-server
sudo systemctl start redis-server

Step 4: Create the FastAPI Application

# main.py - Production-ready FastAPI app
from fastapi import FastAPI, HTTPException, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from typing import List, Optional
import litellm
import uvicorn
from datetime import datetime
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize FastAPI
app = FastAPI(
    title="LLM API",
    description="Production LLM API with all optimizations",
    version="1.0.0"
)

# CORS configuration
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Configure for your domain in production
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Enable optimizations from previous lessons
litellm.success_callback = ["langfuse"]
litellm.cache = litellm.Cache(type="redis")

# Request/Response models
class ChatRequest(BaseModel):
    messages: List[dict]
    model: Optional[str] = "gpt-4o-mini"
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 500

class ChatResponse(BaseModel):
    response: str
    model: str
    tokens_used: int
    cost: float
    cached: bool
    timestamp: str

# Health check endpoint
@app.get("/")
async def root():
    return {
        "status": "healthy",
        "timestamp": datetime.now().isoformat(),
        "version": "1.0.0"
    }

@app.get("/health")
async def health_check():
    return {"status": "healthy"}

# Main chat endpoint
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    try:
        # Call LLM with optimizations
        response = await litellm.acompletion(
            model=request.model,
            messages=request.messages,
            temperature=request.temperature,
            max_tokens=request.max_tokens,
            caching=True
        )

        # Calculate cost
        cost = litellm.completion_cost(completion_response=response)

        return ChatResponse(
            response=response.choices[0].message.content,
            model=response.model,
            tokens_used=response.usage.total_tokens,
            cost=cost,
            cached=response._hidden_params.get("cache_hit", False),
            timestamp=datetime.now().isoformat()
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Error handler
@app.exception_handler(Exception)
async def global_exception_handler(request: Request, exc: Exception):
    return JSONResponse(
        status_code=500,
        content={
            "error": "Internal server error",
            "detail": str(exc) if os.getenv("DEBUG") else "An error occurred"
        }
    )

if __name__ == "__main__":
    uvicorn.run(
        "main:app",
        host="0.0.0.0",
        port=8000,
        reload=False,  # Set to False in production
        workers=4
    )

Step 5: Set Up Systemd Service

# Create systemd service file
sudo nano /etc/systemd/system/llm-api.service
[Unit]
Description=LLM API Service
After=network.target

[Service]
Type=simple
User=deploy
WorkingDirectory=/home/deploy/llm-api
Environment="PATH=/home/deploy/llm-api/venv/bin"
ExecStart=/home/deploy/llm-api/venv/bin/uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable llm-api
sudo systemctl start llm-api

# Check status
sudo systemctl status llm-api

# View logs
sudo journalctl -u llm-api -f

Step 6: Configure Nginx as Reverse Proxy

# sudo nano /etc/nginx/sites-available/llm-api

server {
    listen 80;
    server_name your-domain.com www.your-domain.com;

    # API rate limiting
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
    limit_req zone=api_limit burst=20 nodelay;

    # Proxy settings
    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Timeouts for LLM responses
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }

    # Security headers
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;
}
# Enable the site
sudo ln -s /etc/nginx/sites-available/llm-api /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Step 7: SSL with Certbot (Free HTTPS)

# Get SSL certificate
sudo certbot --nginx -d your-domain.com -d www.your-domain.com

# Auto-renewal is already set up, but verify:
sudo systemctl status certbot.timer

# Test renewal
sudo certbot renew --dry-run

Your Nginx config is automatically updated with SSL!

Alternative: Using Caddy (Even Simpler)

Caddy is an alternative to Nginx that handles SSL automatically:

# Install Caddy
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update
sudo apt install caddy

# Configure Caddy
sudo nano /etc/caddy/Caddyfile
your-domain.com {
    reverse_proxy localhost:8000

    # Automatic HTTPS!
    tls your-email@example.com

    # Compression
    encode gzip

    # Rate limiting
    rate_limit {
        zone api {
            key {remote_host}
            events 10
            window 1s
        }
    }

    # Security headers
    header {
        X-Frame-Options "SAMEORIGIN"
        X-Content-Type-Options "nosniff"
        X-XSS-Protection "1; mode=block"
        -Server
    }
}
# Restart Caddy
sudo systemctl reload caddy

That's it! Caddy automatically gets and renews SSL certificates.

Part 4: Production APIs with Managed Platforms

Once your MVP gains traction, you might want the convenience of managed platforms.

FastAPI on Railway

Railway combines the simplicity of Heroku with modern features:

# railway.toml
[build]
builder = "NIXPACKS"

[deploy]
healthcheckPath = "/health"
healthcheckTimeout = 100
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 10
# Deploy to Railway
railway login
railway link
railway up

# Add environment variables
railway variables set OPENAI_API_KEY=sk-...
railway variables set LANGFUSE_PUBLIC_KEY=pk-...

# Get your URL
railway open

Comparison of Managed Platforms

PlatformProsConsBest For
RailwayGreat DX, fair pricingNewer platformFull-stack apps
RenderSimple, reliableCan be slowSimple APIs
Fly.ioGlobal deploymentComplex for beginnersGlobal apps
HerokuMature, stableExpensiveEnterprise

Part 5: Serverless Deployment

For variable traffic or cost optimization, serverless can be ideal.

AWS Lambda Deployment

# lambda_function.py
import json
import litellm
from mangum import Mangum
from fastapi import FastAPI
import os

# Initialize outside handler for connection reuse
litellm.success_callback = ["langfuse"]

# For using FastAPI with Lambda
app = FastAPI()

@app.get("/health")
def health_check():
    return {"status": "healthy"}

@app.post("/chat")
async def chat(request: dict):
    try:
        messages = request.get("messages", [])
        model = request.get("model", "gpt-4o-mini")

        response = await litellm.acompletion(
            model=model,
            messages=messages,
            timeout=25,  # Lambda timeout buffer
            caching=True
        )

        return {
            "response": response.choices[0].message.content,
            "model": model,
            "cost": litellm.completion_cost(completion_response=response)
        }
    except Exception as e:
        return {"error": str(e)}

# Lambda handler
handler = Mangum(app)

Serverless Framework Deployment

# serverless.yml
service: llm-api-serverless

provider:
  name: aws
  runtime: python3.11
  region: us-east-1
  timeout: 30
  memorySize: 1024
  environment:
    OPENAI_API_KEY: ${ssm:/llm-api/openai-key}
    LANGFUSE_PUBLIC_KEY: ${ssm:/llm-api/langfuse-public}

functions:
  api:
    handler: lambda_function.handler
    events:
      - httpApi:
          path: /{proxy+}
          method: ANY
      - httpApi:
          path: /
          method: ANY
    reservedConcurrency: 10  # Control costs

plugins:
  - serverless-python-requirements
  - serverless-plugin-warmup

custom:
  pythonRequirements:
    dockerizePip: true
    layer: true  # Create a layer for dependencies
  warmup:
    enabled: true
    schedule: rate(5 minutes)  # Keep warm to avoid cold starts
# Deploy
npm install -g serverless
serverless deploy

# Check logs
serverless logs -f api -t

# Remove
serverless remove

Part 6: Docker Deployment

For consistency and portability, Docker is essential.

Multi-Stage Dockerfile for Production

# Dockerfile
# Stage 1: Builder
FROM python:3.11-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements and install dependencies
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Stage 2: Runtime
FROM python:3.11-slim

WORKDIR /app

# Create non-root user
RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app

# Copy Python packages from builder
COPY --from=builder --chown=appuser:appuser /root/.local /home/appuser/.local

# Copy application code
COPY --chown=appuser:appuser . .

# Switch to non-root user
USER appuser

# Update PATH
ENV PATH=/home/appuser/.local/bin:$PATH

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8000/health')" || exit 1

# Expose port
EXPOSE 8000

# Run application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Deploy to Cloud Run (Google Cloud)

# Build and push to Google Container Registry
gcloud builds submit --tag gcr.io/PROJECT_ID/llm-api

# Deploy to Cloud Run
gcloud run deploy llm-api \
  --image gcr.io/PROJECT_ID/llm-api \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars="OPENAI_API_KEY=sk-..." \
  --min-instances=1 \
  --max-instances=10 \
  --memory=2Gi \
  --cpu=2 \
  --timeout=60

Part 7: Kubernetes Deployment (Advanced)

For enterprise-scale deployments with high availability requirements.

Kubernetes Manifests

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-api
  namespace: llm-apps
  labels:
    app: llm-api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: llm-api
  template:
    metadata:
      labels:
        app: llm-api
    spec:
      containers:
      - name: llm-api
        image: your-registry/llm-api:latest
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: llm-secrets
              key: openai-api-key
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5

Part 8: Production Best Practices

Security Checklist

# security_middleware.py
from fastapi import FastAPI, Request, HTTPException
from fastapi.security import HTTPBearer
from slowapi import Limiter
from slowapi.util import get_remote_address
import hashlib
import time

# Rate limiting
limiter = Limiter(key_func=get_remote_address)

def setup_security(app: FastAPI):
    """Add security layers to FastAPI app"""

    # Rate limiting
    app.state.limiter = limiter

    # Security headers middleware
    @app.middleware("http")
    async def add_security_headers(request: Request, call_next):
        response = await call_next(request)
        response.headers["X-Content-Type-Options"] = "nosniff"
        response.headers["X-Frame-Options"] = "DENY"
        response.headers["X-XSS-Protection"] = "1; mode=block"
        response.headers["Strict-Transport-Security"] = "max-age=31536000; includeSubDomains"
        return response

    # Request validation
    @app.middleware("http")
    async def validate_request(request: Request, call_next):
        # Check content type for POST requests
        if request.method == "POST":
            content_type = request.headers.get("content-type")
            if not content_type or "application/json" not in content_type:
                return JSONResponse(
                    status_code=400,
                    content={"error": "Content-Type must be application/json"}
                )

        # Add request ID for tracing
        request_id = request.headers.get("X-Request-ID") or str(time.time())
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id

        return response

Monitoring Setup

# monitoring.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
import psutil

# Metrics
request_count = Counter(
    'llm_api_requests_total',
    'Total number of requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'llm_api_request_duration_seconds',
    'Request duration in seconds',
    ['method', 'endpoint']
)

active_requests = Gauge(
    'llm_api_active_requests',
    'Number of active requests'
)

def setup_monitoring(app: FastAPI):
    """Add monitoring endpoints and middleware"""

    @app.get("/metrics", response_class=PlainTextResponse)
    async def metrics():
        """Prometheus metrics endpoint"""
        return generate_latest()

    @app.get("/health/detail")
    async def health_detail():
        """Detailed health check"""

        # Check system resources
        cpu_percent = psutil.cpu_percent(interval=1)
        memory = psutil.virtual_memory()
        disk = psutil.disk_usage('/')

        health_status = {
            "status": "healthy",
            "checks": {
                "cpu_usage": f"{cpu_percent}%",
                "memory_usage": f"{memory.percent}%",
                "disk_usage": f"{disk.percent}%"
            }
        }

        return health_status

Part 9: Choosing the Right Path

Decision Matrix

def deployment_advisor():
    """Interactive deployment advisor"""

    print("🚀 LLM Deployment Advisor")
    print("-" * 40)

    # Ask questions
    users = int(input("Expected daily users: "))
    budget = int(input("Monthly budget ($): "))
    team_size = int(input("Team size: "))
    has_devops = input("Dedicated DevOps? (y/n): ").lower() == 'y'
    need_gpu = input("Need GPU? (y/n): ").lower() == 'y'

    # Recommendations
    print("\n🎯 Recommendations:")
    print("-" * 40)

    if users < 100 and budget < 10:
        print("✅ Start with Streamlit Cloud (Free)")
        print("   - Deploy in 5 minutes")
        print("   - Perfect for demos")

    elif users < 1000 and budget < 30:
        print("✅ Use a $10 VM with Nginx")
        print("   - Best value for money")
        print("   - Full control")
        print("   - Can handle 1000+ users easily")

    elif users < 10000 and budget < 100:
        print("✅ Use Railway or Render")
        print("   - Managed platform")
        print("   - Auto-scaling available")
        print("   - Good developer experience")

    elif need_gpu:
        print("✅ Use Docker + Cloud GPU")
        print("   - AWS EC2 with GPU")
        print("   - Or Paperspace/Lambda Labs")

    elif users > 10000 or has_devops:
        print("✅ Consider Kubernetes")
        print("   - Use managed K8s (EKS/GKE)")
        print("   - High availability")
        print("   - Complex but powerful")

    else:
        print("✅ Use Serverless (AWS Lambda)")
        print("   - Pay per request")
        print("   - Auto-scaling")
        print("   - No server management")

deployment_advisor()

Migration Path

Streamlit POC → VM + Nginx MVP → Managed Platform → Docker + Cloud → Kubernetes
              ↘ Serverless → Docker + Cloud

Final Recommendations

  1. Start Simple: Don't over-engineer. A $10 VM can handle most MVPs.
  2. VM First: Before considering Kubernetes, try a VM. You'll be surprised how far it can take you.
  3. Monitor Everything: Use the Langfuse integration from day one.
  4. Cache Aggressively: Use the caching strategies from previous lessons.
  5. Security First: Never expose API keys, always use HTTPS.
  6. Automate Deployment: Even for VMs, automate with simple bash scripts.

Summary

We've covered the complete deployment spectrum:

StageSolutionCostComplexityWhen to Use
PrototypeStreamlit$0Day 1-7
MVPVM + Nginx$10/mo⭐⭐Week 2-Month 3
GrowthRailway/Render$20-50/mo⭐⭐Month 3-6
ScaleServerless/Docker$50-200/mo⭐⭐⭐Month 6-12
EnterpriseKubernetes$200+/mo⭐⭐⭐⭐⭐Year 2+

Key Takeaways:

  • VMs are underrated - A simple VM with Nginx can handle thousands of users for $10/month
  • Start with Streamlit for demos, but quickly move to FastAPI for production
  • Most projects never need Kubernetes - Don't add complexity prematurely
  • Serverless isn't always cheaper - At scale, VMs or containers can be more cost-effective
  • Security can't be an afterthought - SSL certificates, API keys, and rate limiting from day one
  • Monitoring pays for itself - You'll save more in optimization than you spend on observability

The best deployment is the one that gets to production fastest, fits your budget, and your team can maintain. Don't let perfect be the enemy of good. A simple VM serving real users beats a perfect Kubernetes setup with no users every time.

Your assignment: Deploy something this week. Start with Streamlit if you must, but get it live. Real users will teach you more than any tutorial.

Good luck, and happy deploying! 🚀

P.S. - If you're reading this and thinking "but what about [complex scenario]?" - you probably don't need it yet. Ship first, optimize later.