LLM Cost Optimization: From Observability to Production
Stop burning money on AI tokens. Learn the essential techniques that cut LLM costs by 90% while maintaining quality.
The $10,000 Wake-Up Call
Last month, a startup discovered their GPT-4 bill had quietly reached $10,000. They were caching nothing, logging nothing, and using GPT-4 for everythingโeven simple tasks like extracting emails from text.
Sound familiar? You're not alone. Most LLM applications in production waste 80-90% of their token budget. This guide shows you exactly how to fix that.
What You'll Learn
We'll cover two parts:
Part 1: Essentials (This post)
- Observability with Langfuse
- Response caching
- Smart model selection
- Structured outputs
Part 2: Advanced (Next post)
- Prompt caching
- Semantic caching
- Streaming optimizations
- Fine-tuning economics
- Batch Calling
Let's start with the foundation: you can't optimize what you can't measure.
1. Observability: Your Cost Dashboard
Before optimizing anything, you need visibility. Langfuse is an open-source LLM observability platform that tracks every request, cost, and error.
Setting Up Langfuse
Option 1: Cloud (Easiest)
# Sign up at https://cloud.langfuse.com # Get your API keys from the dashboard
Option 2: Self-Hosted with Docker
# Clone and run Langfuse locally git clone https://github.com/langfuse/langfuse.git cd langfuse docker-compose up
The Docker setup spins up:
- Langfuse server on http://localhost:3000
- PostgreSQL database for storing traces
Integrating with Your Code
Create a
.env
# For Langfuse Cloud LANGFUSE_PUBLIC_KEY=pk-lf-... LANGFUSE_SECRET_KEY=sk-lf-... LANGFUSE_HOST=https://cloud.langfuse.com # For self-hosted LANGFUSE_PUBLIC_KEY=your-public-key LANGFUSE_SECRET_KEY=your-secret-key LANGFUSE_HOST=http://localhost:3000 # Your LLM API keys OPENAI_API_KEY=sk-...
Now, the magicโjust 3 lines to add observability:
import litellm from dotenv import load_dotenv load_dotenv() # Enable Langfuse logging litellm.success_callback = ["langfuse"] litellm.failure_callback = ["langfuse"] # Your existing code works as-is! response = litellm.completion( model="gpt-4o-mini", # Note: using mini, not gpt-4 messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello! How are you today?"}, ] ) print(response.choices[0].message.content)
Visit your Langfuse dashboard. You'll see:
- Request details: Model, tokens, latency
- Costs: Calculated automatically per request
- Traces: Full conversation history
- Errors: Failed requests and reasons
Advanced Langfuse Integration
For more control, you can add metadata and custom trace information:
# Add metadata for better tracking response = litellm.completion( model="gpt-4o-mini", messages=messages, metadata={ # Langfuse-specific parameters "generation_name": "customer-support-chat", "trace_id": "trace-123", "trace_user_id": "user-456", "session_id": "session-789", "tags": ["support", "billing"], # Custom metadata "user_tier": "premium", "feature": "chat", "environment": "production" } ) # Now you can filter in Langfuse: # - Which features cost the most? # - Which users generate most tokens? # - What's your error rate by model?
๐ Documentation:
2. Response Caching: The 100% Discount
The easiest optimization? Don't make the same request twice.
Simple In-Memory Cache with LiteLLM
LiteLLM has built-in caching support:
import litellm from litellm import Cache # Enable Redis caching litellm.cache = Cache( type="redis", # or "local" for in-memory host="localhost", port=6379, ) # Alternative: In-memory cache (for development) litellm.cache = Cache() def get_completion(messages, model="gpt-4o-mini"): """Cached LLM completion""" response = litellm.completion( model=model, messages=messages, caching=True, # Enable caching for this request ttl=3600, # Cache for 1 hour ) return response # First call: hits the API (costs money) response1 = get_completion([ {"role": "user", "content": "What's the capital of France?"} ]) print("Cost:", litellm.completion_cost(completion_response=response1)) # Second call: hits cache (costs $0) response2 = get_completion([ {"role": "user", "content": "What's the capital of France?"} ]) # Cache hit will be logged in Langfuse metadata
Production Redis Setup with Custom Logic
import redis import json import hashlib from typing import Optional import litellm class LLMCache: def __init__(self, redis_url="redis://localhost:6379"): self.redis = redis.from_url(redis_url) self.ttl = 3600 # 1 hour default def _generate_key(self, model: str, messages: list, **kwargs) -> str: """Create unique cache key from request""" cache_dict = { "model": model, "messages": messages, **kwargs # temperature, max_tokens, etc. } cache_str = json.dumps(cache_dict, sort_keys=True) return f"llm_cache:{hashlib.md5(cache_str.encode()).hexdigest()}" def get(self, model: str, messages: list, **kwargs) -> Optional[dict]: """Try to get from cache""" key = self._generate_key(model, messages, **kwargs) cached = self.redis.get(key) if cached: return json.loads(cached) return None def set(self, model: str, messages: list, response: dict, **kwargs): """Store in cache""" key = self._generate_key(model, messages, **kwargs) self.redis.setex( key, self.ttl, json.dumps(response) ) # Use with Langfuse tracking cache = LLMCache() def cached_completion(model, messages, **kwargs): # Check cache first cached_response = cache.get(model, messages, **kwargs) if cached_response: print("๐ฐ Cache hit - saved money!") # Log cache hit to Langfuse return litellm.completion( model=model, messages=messages, metadata={"cache_hit": True, "cached_response": True}, mock_response=cached_response # Use cached response ) # Cache miss - make real request response = litellm.completion( model=model, messages=messages, metadata={"cache_hit": False}, **kwargs ) # Store in cache cache.set(model, messages, response.model_dump(), **kwargs) return response
When to Cache vs Not Cache
def should_cache(messages, temperature=0.7): """Determine if request should be cached""" # Don't cache if high temperature (creative tasks) if temperature > 0.8: return False # Don't cache time-sensitive queries time_sensitive_words = ["today", "now", "current", "latest"] message_text = str(messages).lower() if any(word in message_text for word in time_sensitive_words): return False # Cache these types of requests cacheable_patterns = [ "extract", "classify", "summarize", "translate", "list", "format", "convert", "explain" ] return any(pattern in message_text for pattern in cacheable_patterns) # Apply caching conditionally def smart_completion(messages, **kwargs): use_cache = should_cache(messages, kwargs.get("temperature", 0.7)) return litellm.completion( messages=messages, caching=use_cache, metadata={"caching_enabled": use_cache}, **kwargs )
๐ Documentation:
3. Model Selection: Right Tool, Right Job
GPT-4 for everything is like using a Ferrari for grocery runs. Here's how to choose:
Model Routing with LiteLLM
import litellm from enum import Enum from typing import List, Dict class TaskComplexity(Enum): SIMPLE = "gpt-4o-mini" # $0.15 per 1M tokens MEDIUM = "claude-3-haiku-20240307" # $0.25 per 1M tokens COMPLEX = "gpt-4o" # $2.50 per 1M tokens VISION = "gpt-4o" # For images class ModelRouter: def __init__(self): # Simple keyword-based routing to start self.simple_tasks = [ "extract", "classify", "summarize", "translate", "list", "format", "convert" ] self.complex_tasks = [ "analyze", "debug", "create", "design", "strategy", "solve", "explain why" ] def route(self, messages: List[Dict]) -> str: """Determine best model for the task""" user_message = str(messages[-1].get("content", "")).lower() # Check for images if any("image" in str(msg) for msg in messages): return TaskComplexity.VISION.value # Check task complexity if any(task in user_message for task in self.simple_tasks): return TaskComplexity.SIMPLE.value elif any(task in user_message for task in self.complex_tasks): return TaskComplexity.COMPLEX.value else: # Default to medium return TaskComplexity.MEDIUM.value def route_with_llm(self, messages: List[Dict]) -> str: """Use a small model to route to the right model""" routing_prompt = """ Classify this task complexity: SIMPLE: extraction, classification, formatting MEDIUM: writing, analysis, explanations COMPLEX: reasoning, math, code generation, strategy User request: {request} Return only: SIMPLE, MEDIUM, or COMPLEX """ # Use cheap model for routing response = litellm.completion( model="gpt-4o-mini", messages=[{ "role": "user", "content": routing_prompt.format( request=messages[-1]["content"] ) }], max_tokens=10, temperature=0 ) classification = response.choices[0].message.content.strip() # Map to actual model model_map = { "SIMPLE": TaskComplexity.SIMPLE.value, "MEDIUM": TaskComplexity.MEDIUM.value, "COMPLEX": TaskComplexity.COMPLEX.value } return model_map.get(classification, TaskComplexity.MEDIUM.value) # Usage router = ModelRouter() def smart_completion(messages, **kwargs): """Automatically routes to appropriate model""" model = router.route(messages) print(f"๐ Routed to: {model}") response = litellm.completion( model=model, messages=messages, metadata={ "routed_model": model, "routing_method": "keyword" }, **kwargs ) # Log the cost cost = litellm.completion_cost(completion_response=response) print(f"๐ฐ Cost: ${cost:.4f}") return response
Cost Comparison Tool
import litellm def compare_model_costs(): """Compare costs across different models""" test_prompt = "Write a 3-paragraph story about a robot" models = [ "gpt-4o-mini", "claude-3-haiku-20240307", "gpt-4o", "claude-3-5-sonnet-20241022" ] results = [] for model in models: try: response = litellm.completion( model=model, messages=[{"role": "user", "content": test_prompt}], ) # Calculate cost using LiteLLM's built-in function cost = litellm.completion_cost(completion_response=response) results.append({ "model": model, "tokens": response.usage.total_tokens, "cost": cost }) print(f"{model:30} | Tokens: {response.usage.total_tokens:4} | Cost: ${cost:.4f}") except Exception as e: print(f"Error with {model}: {e}") # Show cost difference if results: cheapest = min(results, key=lambda x: x["cost"]) most_expensive = max(results, key=lambda x: x["cost"]) print(f"\n๐ก Using {cheapest['model']} instead of {most_expensive['model']} saves {most_expensive['cost']/cheapest['cost']:.0f}x on cost!") # Run comparison compare_model_costs()
๐ Documentation:
4. Structured Outputs: Make Small Models Reliable
Small models become powerful when you give them structure:
Using LiteLLM with JSON Response Format
import litellm import json from pydantic import BaseModel from typing import List, Optional class ExtractedInfo(BaseModel): """Define expected output structure""" name: Optional[str] email: Optional[str] phone: Optional[str] company: Optional[str] requirements: List[str] def extract_with_structure(text: str): """Use structured output for reliable extraction""" # Method 1: JSON Mode (OpenAI models) response = litellm.completion( model="gpt-4o-mini", # Small model works great! messages=[ { "role": "system", "content": f"""Extract information and return as JSON. Schema: {ExtractedInfo.model_json_schema()}""" }, { "role": "user", "content": f"Extract contact info from: {text}" } ], response_format={"type": "json_object"}, # Force JSON output temperature=0, # Deterministic metadata={"task": "extraction", "structured": True} ) # Parse and validate result = json.loads(response.choices[0].message.content) return ExtractedInfo(**result) # Method 2: Function Calling for structured extraction def extract_with_function_calling(text: str): """Use function calling for guaranteed structure""" response = litellm.completion( model="gpt-4o-mini", messages=[ {"role": "user", "content": f"Extract contact info from: {text}"} ], tools=[{ "type": "function", "function": { "name": "save_contact", "description": "Save extracted contact information", "parameters": ExtractedInfo.model_json_schema() } }], tool_choice="auto" ) # Parse function call arguments if response.choices[0].message.tool_calls: args = json.loads(response.choices[0].message.tool_calls[0].function.arguments) return ExtractedInfo(**args) return None # Example usage email_text = """ Hi, I'm John Smith from TechCorp. You can reach me at john@techcorp.com or call 555-1234. We need API integration, real-time updates, and analytics dashboard. """ # Try both methods result1 = extract_with_structure(email_text) print(f"JSON Mode Result: {result1}") result2 = extract_with_function_calling(email_text) print(f"Function Calling Result: {result2}")
Reliable Classification with Small Models
import litellm from enum import Enum from typing import List class CustomerIntent(str, Enum): BILLING = "billing" TECHNICAL = "technical" SALES = "sales" GENERAL = "general" def classify_intent(message: str) -> CustomerIntent: """Classify customer intent using structured output""" response = litellm.completion( model="gpt-4o-mini", # Tiny model, perfect for classification messages=[ { "role": "system", "content": """Classify the customer intent into one of these categories: - billing: Payment, invoices, refunds - technical: Bugs, errors, integration issues - sales: Pricing, upgrades, new features - general: Other inquiries Respond with only the category name.""" }, { "role": "user", "content": message } ], max_tokens=10, temperature=0, metadata={"task": "classification"} ) intent = response.choices[0].message.content.strip().lower() return CustomerIntent(intent) # Batch classification with caching def classify_batch(messages: List[str]): """Classify multiple messages efficiently""" results = [] for message in messages: intent = classify_intent(message) results.append({ "message": message[:50] + "...", "intent": intent, "model": "gpt-4o-mini", "cost": "$0.000015" # Approximate }) return results # Test it test_messages = [ "I can't log into my account", "How much does the enterprise plan cost?", "Please refund my last payment", "When will the new feature be available?" ] classifications = classify_batch(test_messages) for c in classifications: print(f"Intent: {c['intent']:10} | Message: {c['message']}")
๐ Documentation:
5. Putting It All Together
Here's a production-ready setup combining everything:
import litellm from litellm import Cache from dotenv import load_dotenv import json from datetime import datetime from typing import Optional, List, Dict # Setup load_dotenv() litellm.success_callback = ["langfuse"] litellm.failure_callback = ["langfuse"] litellm.cache = Cache(type="redis") class OptimizedLLM: def __init__(self): self.router = ModelRouter() def complete(self, messages: List[Dict], use_cache: bool = True, force_model: Optional[str] = None, **kwargs) -> dict: """ Optimized completion with all techniques: - Automatic model routing - Response caching - Cost tracking - Langfuse observability """ # 1. Determine if we should cache should_cache = use_cache and should_cache(messages, kwargs.get("temperature", 0.7)) # 2. Route to appropriate model (unless forced) model = force_model or self.router.route(messages) print(f"๐ Using model: {model}") # 3. Make request with all optimizations response = litellm.completion( model=model, messages=messages, caching=should_cache, metadata={ "routed_model": model, "caching_enabled": should_cache, "timestamp": datetime.now().isoformat(), "optimization_version": "v1" }, **kwargs ) # 4. Calculate and log cost savings cost = litellm.completion_cost(completion_response=response) # Compare with GPT-4 cost (for savings calculation) gpt4_cost = self._estimate_gpt4_cost(response.usage) savings = ((gpt4_cost - cost) / gpt4_cost * 100) if gpt4_cost > 0 else 0 print(f"๐ฐ Cost: ${cost:.4f} (Saved {savings:.0f}% vs GPT-4)") return response def _estimate_gpt4_cost(self, usage) -> float: """Estimate what this would cost with GPT-4""" # GPT-4o pricing (approximate) input_cost = usage.prompt_tokens * 0.0025 / 1000 output_cost = usage.completion_tokens * 0.01 / 1000 return input_cost + output_cost # Initialize llm = OptimizedLLM() # Example 1: Simple query (will use small model + caching) response = llm.complete([ {"role": "user", "content": "What's the capital of France?"} ]) # Example 2: Complex query (will use larger model) response = llm.complete([ {"role": "user", "content": "Debug this recursive Python function and explain the issue"} ])
Results You Can Expect
After implementing these techniques:
Metric | Before | After | Improvement |
---|---|---|---|
Average cost per request | $0.035 | $0.003 | 91% reduction |
Cache hit rate | 0% | 40% | โพ๏ธ |
Average latency | 2.5s | 0.8s | 68% faster |
Error rate | 5% | 0.5% | 90% fewer errors |
Monthly LLM costs | $10,000 | $900 | $9,100 saved |
Your Action Plan
Day 1: Observability (15 minutes)
- โ Sign up for Langfuse Cloud or run locally
- โ Add 2 lines to enable logging
- โ See your first traces
Day 2: Caching (30 minutes)
- โ Set up Redis or use in-memory cache
- โ Add caching=True to your completions
- โ Watch costs drop on repeat queries
Day 3: Model Routing (1 hour)
- โ Implement the ModelRouter class
- โ Identify your task types
- โ Route simple tasks to small models
Day 4: Structured Outputs
- โ Convert extractions to JSON mode
- โ Add response validation
- โ Use gpt-4o-mini reliably
Quick Reference
# Complete setup in one file import litellm from litellm import Cache from dotenv import load_dotenv # 1. Setup observability load_dotenv() litellm.success_callback = ["langfuse"] litellm.failure_callback = ["langfuse"] # 2. Enable caching litellm.cache = Cache(type="redis") # 3. Smart completion with all optimizations def optimized_completion(messages, **kwargs): # Auto-select model based on complexity model = select_model_for_task(messages) response = litellm.completion( model=model, messages=messages, caching=True, # Cache responses metadata={"optimized": True}, # Track in Langfuse **kwargs ) # Log the savings cost = litellm.completion_cost(completion_response=response) print(f"Cost: ${cost:.4f} | Model: {model}") return response
What's Next?
Part 2 will cover advanced optimizations:
- Prompt caching: Reuse computed tokens (90% savings) - Preview the docs
- Semantic caching: Handle paraphrased queries
- Streaming optimizations: Improve perceived performance
- Batch processing: Reduce overhead for bulk operations
- Fine-tuning economics: When it beats prompting
Start with observability today. You can't optimize what you can't measure.
Resources & Documentation
- LiteLLM Documentation: docs.litellm.ai
- Langfuse Documentation: langfuse.com/docs
- GitHub Repos: LiteLLM | Langfuse
- Support: LiteLLM Discord | Langfuse Discord
Have questions? Reduced your costs? Let me know in the comments!