Deploying LLM Applications: From Prototype to Production
The complete guide to deploying your LLM application - from a $5 VM to enterprise Kubernetes
What You'll Learn
In this comprehensive guide, we'll cover every deployment option for your LLM applications, progressing from simple to complex:
- Quick Prototyping with Streamlit Cloud (5 minutes)
- VM Deployment with SSL certificates (2 hours) - The hidden gem for MVPs
- Production APIs with FastAPI/Flask on managed platforms (1 day)
- Serverless deployment for variable workloads (1 week)
- Containerized deployment with Docker (2 weeks)
- Enterprise scale with Kubernetes (1 month)
By the end, you'll know exactly which deployment strategy fits your needs, budget, and timeline.
Prerequisites
From our previous lessons, you should already have:
✅ Observability setup (Langfuse) ✅ Response caching implemented ✅ Model selection strategy ✅ Cost optimization techniques
Now let's deploy these optimized applications to production!
Part 1: The Deployment Decision Tree
Before diving into specifics, here's how to choose your deployment strategy:
def choose_deployment_strategy(requirements): """ Find the right deployment approach for your needs """ # Just testing or showing a demo? if requirements["users"] < 100 and requirements["internal_only"]: return "Streamlit Cloud (Free)" # Building an MVP or side project? elif requirements["budget"] < 20 and requirements["users"] < 1000: return "VM with Nginx/Caddy ($5-20/mo)" # Need a production API? elif requirements["need_api"] and requirements["users"] < 10000: return "FastAPI on Railway/Render ($20-50/mo)" # Variable or unpredictable traffic? elif requirements["variable_traffic"] and requirements["budget_conscious"]: return "Serverless - AWS Lambda/Vercel (Pay per use)" # Complex dependencies or need GPUs? elif requirements["need_gpu"] or requirements["complex_dependencies"]: return "Docker on Cloud Run/ECS ($50-200/mo)" # Enterprise with high availability needs? elif requirements["enterprise"] and requirements["high_availability"]: return "Kubernetes ($200+/mo)" else: return "Start with a VM, scale later"
Quick Cost Comparison
Deployment Type | Monthly Cost | Setup Time | Best For |
---|---|---|---|
Streamlit Cloud | $0-2 | 5 min | Demos, POCs |
VM + Nginx | $5-20 | 2 hours | MVPs, startups |
Managed Platforms | $20-50 | 30 min | Growing apps |
Serverless | $0-100* | 1 hour | Variable traffic |
Docker + Cloud | $50-200 | 4 hours | Complex apps |
Kubernetes | $200+ | Days | Enterprise |
*Depends on usage
Part 2: Quick Start - Streamlit Cloud Deployment
Let's start with the simplest option - perfect for demos and internal tools.
When to Use Streamlit Cloud
✅ Perfect for:
- Proof of concepts
- Internal dashboards
- Data science demos
- Sharing with stakeholders
❌ Not suitable for:
- Production APIs
- Mobile apps
- High-traffic applications
- Complex authentication needs
Building Your Streamlit App
# app.py import streamlit as st import litellm from dotenv import load_dotenv import pandas as pd import plotly.express as px from datetime import datetime # Load environment variables load_dotenv() # Enable observability from previous lessons litellm.success_callback = ["langfuse"] litellm.cache = litellm.Cache() # Enable caching # Page config st.set_page_config( page_title="LLM Demo App", page_icon="🤖", layout="wide" ) # Sidebar for configuration with st.sidebar: st.header("Configuration") model = st.selectbox( "Select Model", ["gpt-4o-mini", "claude-3-haiku-20240307", "gpt-3.5-turbo"] ) temperature = st.slider("Temperature", 0.0, 2.0, 0.7) max_tokens = st.number_input("Max Tokens", 50, 2000, 500) # Cost tracking from previous lessons if 'total_cost' not in st.session_state: st.session_state.total_cost = 0 st.metric("Session Cost", f"${st.session_state.total_cost:.4f}") # Main app st.title("🤖 Production-Ready LLM Demo") st.markdown("Deployed with all optimizations from our course!") # Chat interface if 'messages' not in st.session_state: st.session_state.messages = [] # Display chat history for message in st.session_state.messages: with st.chat_message(message["role"]): st.markdown(message["content"]) # User input if prompt := st.chat_input("Ask me anything..."): # Add to chat history st.session_state.messages.append({"role": "user", "content": prompt}) with st.chat_message("user"): st.markdown(prompt) # Generate response with st.chat_message("assistant"): with st.spinner("Thinking..."): try: # Use optimizations from previous lessons response = litellm.completion( model=model, messages=st.session_state.messages, temperature=temperature, max_tokens=max_tokens, caching=True, # Enable caching metadata={ "app": "streamlit_demo", "user": "demo_user" } ) # Extract response answer = response.choices[0].message.content st.markdown(answer) # Calculate cost cost = litellm.completion_cost(completion_response=response) st.session_state.total_cost += cost # Show metrics col1, col2, col3 = st.columns(3) with col1: st.metric("Tokens Used", response.usage.total_tokens) with col2: st.metric("Response Cost", f"${cost:.4f}") with col3: st.metric("Cache Hit", "✅" if response._hidden_params.get("cache_hit") else "❌") # Add to history st.session_state.messages.append({"role": "assistant", "content": answer}) except Exception as e: st.error(f"Error: {str(e)}") # Additional features with st.expander("📊 Usage Analytics"): # Create mock analytics data df = pd.DataFrame({ 'Time': pd.date_range(start='1/1/2024', periods=24, freq='h'), 'Requests': [10, 15, 20, 25, 30, 28, 25, 20, 15, 12, 10, 8, 5, 3, 5, 8, 12, 18, 25, 30, 35, 32, 28, 20], 'Cost': [0.05, 0.08, 0.10, 0.12, 0.15, 0.14, 0.12, 0.10, 0.08, 0.06, 0.05, 0.04, 0.03, 0.02, 0.03, 0.04, 0.06, 0.09, 0.12, 0.15, 0.18, 0.16, 0.14, 0.10] }) fig = px.line(df, x='Time', y=['Requests', 'Cost'], title='Usage Over Time', labels={'value': 'Count', 'variable': 'Metric'}) st.plotly_chart(fig, use_container_width=True) # Footer st.markdown("---") st.markdown("Built with lessons from LLM Optimization Course")
Deploying to Streamlit Cloud
1. Prepare your repository:
# requirements.txt streamlit==1.28.0 litellm==1.0.0 langfuse==2.0.0 python-dotenv==1.0.0 pandas==2.0.0 plotly==5.17.0
2. Push to GitHub:
git init git add . git commit -m "Initial LLM app" git remote add origin https://github.com/yourusername/llm-demo git push -u origin main
3. Deploy on Streamlit Cloud:
- Go to share.streamlit.io
- Click "New app"
- Connect your GitHub repository
- Add secrets in the dashboard:
OPENAI_API_KEY = "sk-..."
LANGFUSE_PUBLIC_KEY = "pk-..."
LANGFUSE_SECRET_KEY = "sk-..."
- Click "Deploy"!
Result: Your app is live at https://yourapp.streamlit.app in under 5 minutes!
Part 3: VM Deployment - The Hidden Gem for MVPs
This is the most underrated deployment option. For $5-20/month, you can serve thousands of users with full control.
Why VMs Are Perfect for MVPs
Most tutorials jump straight from Streamlit to Kubernetes, missing the sweet spot: a simple VM with Nginx/Caddy. Here's why this is often the best choice:
Advantages:
- 💰 Fixed cost: $5-20/month regardless of traffic
- 🚀 Fast deployment: Production-ready in 2 hours
- 🎛️ Full control: SSH access, custom configurations
- 📈 Easy scaling: Just upgrade the VM
- 🔒 Free SSL: Let's Encrypt certificates
- 🐛 Simple debugging: Just check the logs
Perfect for:
- MVPs and early-stage startups
- Side projects that might grow
- Internal tools for small teams
- Learning production deployment
Step 1: Choose Your VM Provider
# Quick provider comparison providers = { "DigitalOcean": { "price": "$6/mo", "specs": "1GB RAM, 25GB SSD, 1TB transfer", "pros": "Best tutorials, $200 credit", "setup_time": "55 seconds" }, "Hetzner": { "price": "€4.51/mo", "specs": "2GB RAM, 20GB SSD, 20TB transfer", "pros": "Best value in Europe", "setup_time": "10 seconds" }, "Linode": { "price": "$5/mo", "specs": "1GB RAM, 25GB SSD, 1TB transfer", "pros": "Cheapest option, $100 credit", "setup_time": "60 seconds" }, "Oracle Cloud": { "price": "FREE", "specs": "1GB RAM, 2 AMD cores, Always free", "pros": "Actually free forever", "setup_time": "5 minutes" } }
For this guide, we'll use DigitalOcean for its excellent documentation.
Step 2: Initial Server Setup
# Create a new droplet (DigitalOcean's VM) # Choose: Ubuntu 22.04, Basic, $6/mo, your nearest region # SSH into your server ssh root@your_server_ip # Create a non-root user adduser deploy usermod -aG sudo deploy # Set up basic firewall ufw allow OpenSSH ufw allow 80 ufw allow 443 ufw --force enable # Update system apt update && apt upgrade -y # Install essentials apt install -y python3-pip python3-venv nginx certbot python3-certbot-nginx git htop # Optional but recommended: fail2ban for security apt install -y fail2ban systemctl enable fail2ban systemctl start fail2ban
Step 3: Deploy Your FastAPI Application
# Switch to deploy user su - deploy # Clone your repository git clone https://github.com/yourusername/llm-api.git cd llm-api # Create virtual environment python3 -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt # Create .env file cat > .env << 'EOF' OPENAI_API_KEY=sk-... LANGFUSE_PUBLIC_KEY=pk-... LANGFUSE_SECRET_KEY=sk-... REDIS_URL=redis://localhost:6379 EOF # Install and setup Redis for caching sudo apt install -y redis-server sudo systemctl enable redis-server sudo systemctl start redis-server
Step 4: Create the FastAPI Application
# main.py - Production-ready FastAPI app from fastapi import FastAPI, HTTPException, Request from fastapi.middleware.cors import CORSMiddleware from fastapi.responses import JSONResponse from pydantic import BaseModel from typing import List, Optional import litellm import uvicorn from datetime import datetime import os from dotenv import load_dotenv # Load environment variables load_dotenv() # Initialize FastAPI app = FastAPI( title="LLM API", description="Production LLM API with all optimizations", version="1.0.0" ) # CORS configuration app.add_middleware( CORSMiddleware, allow_origins=["*"], # Configure for your domain in production allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) # Enable optimizations from previous lessons litellm.success_callback = ["langfuse"] litellm.cache = litellm.Cache(type="redis") # Request/Response models class ChatRequest(BaseModel): messages: List[dict] model: Optional[str] = "gpt-4o-mini" temperature: Optional[float] = 0.7 max_tokens: Optional[int] = 500 class ChatResponse(BaseModel): response: str model: str tokens_used: int cost: float cached: bool timestamp: str # Health check endpoint @app.get("/") async def root(): return { "status": "healthy", "timestamp": datetime.now().isoformat(), "version": "1.0.0" } @app.get("/health") async def health_check(): return {"status": "healthy"} # Main chat endpoint @app.post("/chat", response_model=ChatResponse) async def chat(request: ChatRequest): try: # Call LLM with optimizations response = await litellm.acompletion( model=request.model, messages=request.messages, temperature=request.temperature, max_tokens=request.max_tokens, caching=True ) # Calculate cost cost = litellm.completion_cost(completion_response=response) return ChatResponse( response=response.choices[0].message.content, model=response.model, tokens_used=response.usage.total_tokens, cost=cost, cached=response._hidden_params.get("cache_hit", False), timestamp=datetime.now().isoformat() ) except Exception as e: raise HTTPException(status_code=500, detail=str(e)) # Error handler @app.exception_handler(Exception) async def global_exception_handler(request: Request, exc: Exception): return JSONResponse( status_code=500, content={ "error": "Internal server error", "detail": str(exc) if os.getenv("DEBUG") else "An error occurred" } ) if __name__ == "__main__": uvicorn.run( "main:app", host="0.0.0.0", port=8000, reload=False, # Set to False in production workers=4 )
Step 5: Set Up Systemd Service
# Create systemd service file sudo nano /etc/systemd/system/llm-api.service
[Unit] Description=LLM API Service After=network.target [Service] Type=simple User=deploy WorkingDirectory=/home/deploy/llm-api Environment="PATH=/home/deploy/llm-api/venv/bin" ExecStart=/home/deploy/llm-api/venv/bin/uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target
# Enable and start the service sudo systemctl daemon-reload sudo systemctl enable llm-api sudo systemctl start llm-api # Check status sudo systemctl status llm-api # View logs sudo journalctl -u llm-api -f
Step 6: Configure Nginx as Reverse Proxy
# sudo nano /etc/nginx/sites-available/llm-api server { listen 80; server_name your-domain.com www.your-domain.com; # API rate limiting limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s; limit_req zone=api_limit burst=20 nodelay; # Proxy settings location / { proxy_pass http://127.0.0.1:8000; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection 'upgrade'; proxy_set_header Host $host; proxy_cache_bypass $http_upgrade; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; # Timeouts for LLM responses proxy_connect_timeout 60s; proxy_send_timeout 60s; proxy_read_timeout 60s; } # Security headers add_header X-Frame-Options "SAMEORIGIN" always; add_header X-Content-Type-Options "nosniff" always; add_header X-XSS-Protection "1; mode=block" always; }
# Enable the site sudo ln -s /etc/nginx/sites-available/llm-api /etc/nginx/sites-enabled/ sudo nginx -t sudo systemctl reload nginx
Step 7: SSL with Certbot (Free HTTPS)
# Get SSL certificate sudo certbot --nginx -d your-domain.com -d www.your-domain.com # Auto-renewal is already set up, but verify: sudo systemctl status certbot.timer # Test renewal sudo certbot renew --dry-run
Your Nginx config is automatically updated with SSL!
Alternative: Using Caddy (Even Simpler)
Caddy is an alternative to Nginx that handles SSL automatically:
# Install Caddy sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list sudo apt update sudo apt install caddy # Configure Caddy sudo nano /etc/caddy/Caddyfile
your-domain.com { reverse_proxy localhost:8000 # Automatic HTTPS! tls your-email@example.com # Compression encode gzip # Rate limiting rate_limit { zone api { key {remote_host} events 10 window 1s } } # Security headers header { X-Frame-Options "SAMEORIGIN" X-Content-Type-Options "nosniff" X-XSS-Protection "1; mode=block" -Server } }
# Restart Caddy sudo systemctl reload caddy
That's it! Caddy automatically gets and renews SSL certificates.
Part 4: Production APIs with Managed Platforms
Once your MVP gains traction, you might want the convenience of managed platforms.
FastAPI on Railway
Railway combines the simplicity of Heroku with modern features:
# railway.toml [build] builder = "NIXPACKS" [deploy] healthcheckPath = "/health" healthcheckTimeout = 100 restartPolicyType = "ON_FAILURE" restartPolicyMaxRetries = 10
# Deploy to Railway railway login railway link railway up # Add environment variables railway variables set OPENAI_API_KEY=sk-... railway variables set LANGFUSE_PUBLIC_KEY=pk-... # Get your URL railway open
Comparison of Managed Platforms
Platform | Pros | Cons | Best For |
---|---|---|---|
Railway | Great DX, fair pricing | Newer platform | Full-stack apps |
Render | Simple, reliable | Can be slow | Simple APIs |
Fly.io | Global deployment | Complex for beginners | Global apps |
Heroku | Mature, stable | Expensive | Enterprise |
Part 5: Serverless Deployment
For variable traffic or cost optimization, serverless can be ideal.
AWS Lambda Deployment
# lambda_function.py import json import litellm from mangum import Mangum from fastapi import FastAPI import os # Initialize outside handler for connection reuse litellm.success_callback = ["langfuse"] # For using FastAPI with Lambda app = FastAPI() @app.get("/health") def health_check(): return {"status": "healthy"} @app.post("/chat") async def chat(request: dict): try: messages = request.get("messages", []) model = request.get("model", "gpt-4o-mini") response = await litellm.acompletion( model=model, messages=messages, timeout=25, # Lambda timeout buffer caching=True ) return { "response": response.choices[0].message.content, "model": model, "cost": litellm.completion_cost(completion_response=response) } except Exception as e: return {"error": str(e)} # Lambda handler handler = Mangum(app)
Serverless Framework Deployment
# serverless.yml service: llm-api-serverless provider: name: aws runtime: python3.11 region: us-east-1 timeout: 30 memorySize: 1024 environment: OPENAI_API_KEY: ${ssm:/llm-api/openai-key} LANGFUSE_PUBLIC_KEY: ${ssm:/llm-api/langfuse-public} functions: api: handler: lambda_function.handler events: - httpApi: path: /{proxy+} method: ANY - httpApi: path: / method: ANY reservedConcurrency: 10 # Control costs plugins: - serverless-python-requirements - serverless-plugin-warmup custom: pythonRequirements: dockerizePip: true layer: true # Create a layer for dependencies warmup: enabled: true schedule: rate(5 minutes) # Keep warm to avoid cold starts
# Deploy npm install -g serverless serverless deploy # Check logs serverless logs -f api -t # Remove serverless remove
Part 6: Docker Deployment
For consistency and portability, Docker is essential.
Multi-Stage Dockerfile for Production
# Dockerfile # Stage 1: Builder FROM python:3.11-slim as builder WORKDIR /app # Install build dependencies RUN apt-get update && apt-get install -y \ gcc \ g++ \ && rm -rf /var/lib/apt/lists/* # Copy requirements and install dependencies COPY requirements.txt . RUN pip install --user --no-cache-dir -r requirements.txt # Stage 2: Runtime FROM python:3.11-slim WORKDIR /app # Create non-root user RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app # Copy Python packages from builder COPY /root/.local /home/appuser/.local # Copy application code COPY . . # Switch to non-root user USER appuser # Update PATH ENV PATH=/home/appuser/.local/bin:$PATH # Health check HEALTHCHECK \ CMD python -c "import requests; requests.get('http://localhost:8000/health')" || exit 1 # Expose port EXPOSE 8000 # Run application CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
Deploy to Cloud Run (Google Cloud)
# Build and push to Google Container Registry gcloud builds submit --tag gcr.io/PROJECT_ID/llm-api # Deploy to Cloud Run gcloud run deploy llm-api \ --image gcr.io/PROJECT_ID/llm-api \ --platform managed \ --region us-central1 \ --allow-unauthenticated \ --set-env-vars="OPENAI_API_KEY=sk-..." \ --min-instances=1 \ --max-instances=10 \ --memory=2Gi \ --cpu=2 \ --timeout=60
Part 7: Kubernetes Deployment (Advanced)
For enterprise-scale deployments with high availability requirements.
Kubernetes Manifests
# deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: llm-api namespace: llm-apps labels: app: llm-api spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app: llm-api template: metadata: labels: app: llm-api spec: containers: - name: llm-api image: your-registry/llm-api:latest ports: - containerPort: 8000 name: http env: - name: OPENAI_API_KEY valueFrom: secretKeyRef: name: llm-secrets key: openai-api-key resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "500m" livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 5 periodSeconds: 5
Part 8: Production Best Practices
Security Checklist
# security_middleware.py from fastapi import FastAPI, Request, HTTPException from fastapi.security import HTTPBearer from slowapi import Limiter from slowapi.util import get_remote_address import hashlib import time # Rate limiting limiter = Limiter(key_func=get_remote_address) def setup_security(app: FastAPI): """Add security layers to FastAPI app""" # Rate limiting app.state.limiter = limiter # Security headers middleware @app.middleware("http") async def add_security_headers(request: Request, call_next): response = await call_next(request) response.headers["X-Content-Type-Options"] = "nosniff" response.headers["X-Frame-Options"] = "DENY" response.headers["X-XSS-Protection"] = "1; mode=block" response.headers["Strict-Transport-Security"] = "max-age=31536000; includeSubDomains" return response # Request validation @app.middleware("http") async def validate_request(request: Request, call_next): # Check content type for POST requests if request.method == "POST": content_type = request.headers.get("content-type") if not content_type or "application/json" not in content_type: return JSONResponse( status_code=400, content={"error": "Content-Type must be application/json"} ) # Add request ID for tracing request_id = request.headers.get("X-Request-ID") or str(time.time()) response = await call_next(request) response.headers["X-Request-ID"] = request_id return response
Monitoring Setup
# monitoring.py from prometheus_client import Counter, Histogram, Gauge, generate_latest from fastapi import FastAPI from fastapi.responses import PlainTextResponse import psutil # Metrics request_count = Counter( 'llm_api_requests_total', 'Total number of requests', ['method', 'endpoint', 'status'] ) request_duration = Histogram( 'llm_api_request_duration_seconds', 'Request duration in seconds', ['method', 'endpoint'] ) active_requests = Gauge( 'llm_api_active_requests', 'Number of active requests' ) def setup_monitoring(app: FastAPI): """Add monitoring endpoints and middleware""" @app.get("/metrics", response_class=PlainTextResponse) async def metrics(): """Prometheus metrics endpoint""" return generate_latest() @app.get("/health/detail") async def health_detail(): """Detailed health check""" # Check system resources cpu_percent = psutil.cpu_percent(interval=1) memory = psutil.virtual_memory() disk = psutil.disk_usage('/') health_status = { "status": "healthy", "checks": { "cpu_usage": f"{cpu_percent}%", "memory_usage": f"{memory.percent}%", "disk_usage": f"{disk.percent}%" } } return health_status
Part 9: Choosing the Right Path
Decision Matrix
def deployment_advisor(): """Interactive deployment advisor""" print("🚀 LLM Deployment Advisor") print("-" * 40) # Ask questions users = int(input("Expected daily users: ")) budget = int(input("Monthly budget ($): ")) team_size = int(input("Team size: ")) has_devops = input("Dedicated DevOps? (y/n): ").lower() == 'y' need_gpu = input("Need GPU? (y/n): ").lower() == 'y' # Recommendations print("\n🎯 Recommendations:") print("-" * 40) if users < 100 and budget < 10: print("✅ Start with Streamlit Cloud (Free)") print(" - Deploy in 5 minutes") print(" - Perfect for demos") elif users < 1000 and budget < 30: print("✅ Use a $10 VM with Nginx") print(" - Best value for money") print(" - Full control") print(" - Can handle 1000+ users easily") elif users < 10000 and budget < 100: print("✅ Use Railway or Render") print(" - Managed platform") print(" - Auto-scaling available") print(" - Good developer experience") elif need_gpu: print("✅ Use Docker + Cloud GPU") print(" - AWS EC2 with GPU") print(" - Or Paperspace/Lambda Labs") elif users > 10000 or has_devops: print("✅ Consider Kubernetes") print(" - Use managed K8s (EKS/GKE)") print(" - High availability") print(" - Complex but powerful") else: print("✅ Use Serverless (AWS Lambda)") print(" - Pay per request") print(" - Auto-scaling") print(" - No server management") deployment_advisor()
Migration Path
Streamlit POC → VM + Nginx MVP → Managed Platform → Docker + Cloud → Kubernetes ↘ Serverless → Docker + Cloud
Final Recommendations
- Start Simple: Don't over-engineer. A $10 VM can handle most MVPs.
- VM First: Before considering Kubernetes, try a VM. You'll be surprised how far it can take you.
- Monitor Everything: Use the Langfuse integration from day one.
- Cache Aggressively: Use the caching strategies from previous lessons.
- Security First: Never expose API keys, always use HTTPS.
- Automate Deployment: Even for VMs, automate with simple bash scripts.
Summary
We've covered the complete deployment spectrum:
Stage | Solution | Cost | Complexity | When to Use |
---|---|---|---|---|
Prototype | Streamlit | $0 | ⭐ | Day 1-7 |
MVP | VM + Nginx | $10/mo | ⭐⭐ | Week 2-Month 3 |
Growth | Railway/Render | $20-50/mo | ⭐⭐ | Month 3-6 |
Scale | Serverless/Docker | $50-200/mo | ⭐⭐⭐ | Month 6-12 |
Enterprise | Kubernetes | $200+/mo | ⭐⭐⭐⭐⭐ | Year 2+ |
Key Takeaways:
- VMs are underrated - A simple VM with Nginx can handle thousands of users for $10/month
- Start with Streamlit for demos, but quickly move to FastAPI for production
- Most projects never need Kubernetes - Don't add complexity prematurely
- Serverless isn't always cheaper - At scale, VMs or containers can be more cost-effective
- Security can't be an afterthought - SSL certificates, API keys, and rate limiting from day one
- Monitoring pays for itself - You'll save more in optimization than you spend on observability
The best deployment is the one that gets to production fastest, fits your budget, and your team can maintain. Don't let perfect be the enemy of good. A simple VM serving real users beats a perfect Kubernetes setup with no users every time.
Your assignment: Deploy something this week. Start with Streamlit if you must, but get it live. Real users will teach you more than any tutorial.
Good luck, and happy deploying! 🚀
P.S. - If you're reading this and thinking "but what about [complex scenario]?" - you probably don't need it yet. Ship first, optimize later.