Blogs

LLM Cost Optimization: From Observability to Production

2025-01-2025 min read
LLM Cost Optimization: From Observability to Production

Stop burning money on AI tokens. Learn the essential techniques that cut LLM costs by 90% while maintaining quality.

The $10,000 Wake-Up Call

Last month, a startup discovered their GPT-4 bill had quietly reached $10,000. They were caching nothing, logging nothing, and using GPT-4 for everythingโ€”even simple tasks like extracting emails from text.

Sound familiar? You're not alone. Most LLM applications in production waste 80-90% of their token budget. This guide shows you exactly how to fix that.

What You'll Learn

We'll cover two parts:

Part 1: Essentials (This post)

  • Observability with Langfuse
  • Response caching
  • Smart model selection
  • Structured outputs

Part 2: Advanced (Next post)

  • Prompt caching
  • Semantic caching
  • Streaming optimizations
  • Fine-tuning economics
  • Batch Calling

Let's start with the foundation: you can't optimize what you can't measure.

1. Observability: Your Cost Dashboard

Before optimizing anything, you need visibility. Langfuse is an open-source LLM observability platform that tracks every request, cost, and error.

Setting Up Langfuse

Option 1: Cloud (Easiest)

# Sign up at https://cloud.langfuse.com
# Get your API keys from the dashboard

Option 2: Self-Hosted with Docker

# Clone and run Langfuse locally
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker-compose up

The Docker setup spins up:

Integrating with Your Code

Create a

.env
file:

# For Langfuse Cloud
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com

# For self-hosted
LANGFUSE_PUBLIC_KEY=your-public-key
LANGFUSE_SECRET_KEY=your-secret-key
LANGFUSE_HOST=http://localhost:3000

# Your LLM API keys
OPENAI_API_KEY=sk-...

Now, the magicโ€”just 3 lines to add observability:

import litellm
from dotenv import load_dotenv

load_dotenv()

# Enable Langfuse logging
litellm.success_callback = ["langfuse"]
litellm.failure_callback = ["langfuse"]

# Your existing code works as-is!
response = litellm.completion(
    model="gpt-4o-mini",  # Note: using mini, not gpt-4
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello! How are you today?"},
    ]
)
print(response.choices[0].message.content)

Visit your Langfuse dashboard. You'll see:

  • Request details: Model, tokens, latency
  • Costs: Calculated automatically per request
  • Traces: Full conversation history
  • Errors: Failed requests and reasons

Advanced Langfuse Integration

For more control, you can add metadata and custom trace information:

# Add metadata for better tracking
response = litellm.completion(
    model="gpt-4o-mini",
    messages=messages,
    metadata={
        # Langfuse-specific parameters
        "generation_name": "customer-support-chat",
        "trace_id": "trace-123",
        "trace_user_id": "user-456",
        "session_id": "session-789",
        "tags": ["support", "billing"],

        # Custom metadata
        "user_tier": "premium",
        "feature": "chat",
        "environment": "production"
    }
)

# Now you can filter in Langfuse:
# - Which features cost the most?
# - Which users generate most tokens?
# - What's your error rate by model?

๐Ÿ“š Documentation:

2. Response Caching: The 100% Discount

The easiest optimization? Don't make the same request twice.

Simple In-Memory Cache with LiteLLM

LiteLLM has built-in caching support:

import litellm
from litellm import Cache

# Enable Redis caching
litellm.cache = Cache(
    type="redis",  # or "local" for in-memory
    host="localhost",
    port=6379,
)

# Alternative: In-memory cache (for development)
litellm.cache = Cache()

def get_completion(messages, model="gpt-4o-mini"):
    """Cached LLM completion"""
    response = litellm.completion(
        model=model,
        messages=messages,
        caching=True,  # Enable caching for this request
        ttl=3600,      # Cache for 1 hour
    )
    return response

# First call: hits the API (costs money)
response1 = get_completion([
    {"role": "user", "content": "What's the capital of France?"}
])
print("Cost:", litellm.completion_cost(completion_response=response1))

# Second call: hits cache (costs $0)
response2 = get_completion([
    {"role": "user", "content": "What's the capital of France?"}
])
# Cache hit will be logged in Langfuse metadata

Production Redis Setup with Custom Logic

import redis
import json
import hashlib
from typing import Optional
import litellm

class LLMCache:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.ttl = 3600  # 1 hour default

    def _generate_key(self, model: str, messages: list, **kwargs) -> str:
        """Create unique cache key from request"""
        cache_dict = {
            "model": model,
            "messages": messages,
            **kwargs  # temperature, max_tokens, etc.
        }
        cache_str = json.dumps(cache_dict, sort_keys=True)
        return f"llm_cache:{hashlib.md5(cache_str.encode()).hexdigest()}"

    def get(self, model: str, messages: list, **kwargs) -> Optional[dict]:
        """Try to get from cache"""
        key = self._generate_key(model, messages, **kwargs)
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None

    def set(self, model: str, messages: list, response: dict, **kwargs):
        """Store in cache"""
        key = self._generate_key(model, messages, **kwargs)
        self.redis.setex(
            key,
            self.ttl,
            json.dumps(response)
        )

# Use with Langfuse tracking
cache = LLMCache()

def cached_completion(model, messages, **kwargs):
    # Check cache first
    cached_response = cache.get(model, messages, **kwargs)
    if cached_response:
        print("๐Ÿ’ฐ Cache hit - saved money!")
        # Log cache hit to Langfuse
        return litellm.completion(
            model=model,
            messages=messages,
            metadata={"cache_hit": True, "cached_response": True},
            mock_response=cached_response  # Use cached response
        )

    # Cache miss - make real request
    response = litellm.completion(
        model=model,
        messages=messages,
        metadata={"cache_hit": False},
        **kwargs
    )

    # Store in cache
    cache.set(model, messages, response.model_dump(), **kwargs)
    return response

When to Cache vs Not Cache

def should_cache(messages, temperature=0.7):
    """Determine if request should be cached"""

    # Don't cache if high temperature (creative tasks)
    if temperature > 0.8:
        return False

    # Don't cache time-sensitive queries
    time_sensitive_words = ["today", "now", "current", "latest"]
    message_text = str(messages).lower()
    if any(word in message_text for word in time_sensitive_words):
        return False

    # Cache these types of requests
    cacheable_patterns = [
        "extract", "classify", "summarize", "translate",
        "list", "format", "convert", "explain"
    ]

    return any(pattern in message_text for pattern in cacheable_patterns)

# Apply caching conditionally
def smart_completion(messages, **kwargs):
    use_cache = should_cache(messages, kwargs.get("temperature", 0.7))

    return litellm.completion(
        messages=messages,
        caching=use_cache,
        metadata={"caching_enabled": use_cache},
        **kwargs
    )

๐Ÿ“š Documentation:

3. Model Selection: Right Tool, Right Job

GPT-4 for everything is like using a Ferrari for grocery runs. Here's how to choose:

Model Routing with LiteLLM

import litellm
from enum import Enum
from typing import List, Dict

class TaskComplexity(Enum):
    SIMPLE = "gpt-4o-mini"           # $0.15 per 1M tokens
    MEDIUM = "claude-3-haiku-20240307"  # $0.25 per 1M tokens
    COMPLEX = "gpt-4o"               # $2.50 per 1M tokens
    VISION = "gpt-4o"                # For images

class ModelRouter:
    def __init__(self):
        # Simple keyword-based routing to start
        self.simple_tasks = [
            "extract", "classify", "summarize", "translate",
            "list", "format", "convert"
        ]
        self.complex_tasks = [
            "analyze", "debug", "create", "design",
            "strategy", "solve", "explain why"
        ]

    def route(self, messages: List[Dict]) -> str:
        """Determine best model for the task"""
        user_message = str(messages[-1].get("content", "")).lower()

        # Check for images
        if any("image" in str(msg) for msg in messages):
            return TaskComplexity.VISION.value

        # Check task complexity
        if any(task in user_message for task in self.simple_tasks):
            return TaskComplexity.SIMPLE.value
        elif any(task in user_message for task in self.complex_tasks):
            return TaskComplexity.COMPLEX.value
        else:
            # Default to medium
            return TaskComplexity.MEDIUM.value

    def route_with_llm(self, messages: List[Dict]) -> str:
        """Use a small model to route to the right model"""
        routing_prompt = """
        Classify this task complexity:
        SIMPLE: extraction, classification, formatting
        MEDIUM: writing, analysis, explanations
        COMPLEX: reasoning, math, code generation, strategy

        User request: {request}

        Return only: SIMPLE, MEDIUM, or COMPLEX
        """

        # Use cheap model for routing
        response = litellm.completion(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": routing_prompt.format(
                    request=messages[-1]["content"]
                )
            }],
            max_tokens=10,
            temperature=0
        )

        classification = response.choices[0].message.content.strip()

        # Map to actual model
        model_map = {
            "SIMPLE": TaskComplexity.SIMPLE.value,
            "MEDIUM": TaskComplexity.MEDIUM.value,
            "COMPLEX": TaskComplexity.COMPLEX.value
        }

        return model_map.get(classification, TaskComplexity.MEDIUM.value)

# Usage
router = ModelRouter()

def smart_completion(messages, **kwargs):
    """Automatically routes to appropriate model"""
    model = router.route(messages)

    print(f"๐Ÿ“Š Routed to: {model}")

    response = litellm.completion(
        model=model,
        messages=messages,
        metadata={
            "routed_model": model,
            "routing_method": "keyword"
        },
        **kwargs
    )

    # Log the cost
    cost = litellm.completion_cost(completion_response=response)
    print(f"๐Ÿ’ฐ Cost: ${cost:.4f}")

    return response

Cost Comparison Tool

import litellm

def compare_model_costs():
    """Compare costs across different models"""

    test_prompt = "Write a 3-paragraph story about a robot"

    models = [
        "gpt-4o-mini",
        "claude-3-haiku-20240307",
        "gpt-4o",
        "claude-3-5-sonnet-20241022"
    ]

    results = []
    for model in models:
        try:
            response = litellm.completion(
                model=model,
                messages=[{"role": "user", "content": test_prompt}],
            )

            # Calculate cost using LiteLLM's built-in function
            cost = litellm.completion_cost(completion_response=response)

            results.append({
                "model": model,
                "tokens": response.usage.total_tokens,
                "cost": cost
            })

            print(f"{model:30} | Tokens: {response.usage.total_tokens:4} | Cost: ${cost:.4f}")
        except Exception as e:
            print(f"Error with {model}: {e}")

    # Show cost difference
    if results:
        cheapest = min(results, key=lambda x: x["cost"])
        most_expensive = max(results, key=lambda x: x["cost"])

        print(f"\n๐Ÿ’ก Using {cheapest['model']} instead of {most_expensive['model']} saves {most_expensive['cost']/cheapest['cost']:.0f}x on cost!")

# Run comparison
compare_model_costs()

๐Ÿ“š Documentation:

4. Structured Outputs: Make Small Models Reliable

Small models become powerful when you give them structure:

Using LiteLLM with JSON Response Format

import litellm
import json
from pydantic import BaseModel
from typing import List, Optional

class ExtractedInfo(BaseModel):
    """Define expected output structure"""
    name: Optional[str]
    email: Optional[str]
    phone: Optional[str]
    company: Optional[str]
    requirements: List[str]

def extract_with_structure(text: str):
    """Use structured output for reliable extraction"""

    # Method 1: JSON Mode (OpenAI models)
    response = litellm.completion(
        model="gpt-4o-mini",  # Small model works great!
        messages=[
            {
                "role": "system",
                "content": f"""Extract information and return as JSON.
                Schema: {ExtractedInfo.model_json_schema()}"""
            },
            {
                "role": "user",
                "content": f"Extract contact info from: {text}"
            }
        ],
        response_format={"type": "json_object"},  # Force JSON output
        temperature=0,  # Deterministic
        metadata={"task": "extraction", "structured": True}
    )

    # Parse and validate
    result = json.loads(response.choices[0].message.content)
    return ExtractedInfo(**result)

# Method 2: Function Calling for structured extraction
def extract_with_function_calling(text: str):
    """Use function calling for guaranteed structure"""

    response = litellm.completion(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": f"Extract contact info from: {text}"}
        ],
        tools=[{
            "type": "function",
            "function": {
                "name": "save_contact",
                "description": "Save extracted contact information",
                "parameters": ExtractedInfo.model_json_schema()
            }
        }],
        tool_choice="auto"
    )

    # Parse function call arguments
    if response.choices[0].message.tool_calls:
        args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
        return ExtractedInfo(**args)

    return None

# Example usage
email_text = """
Hi, I'm John Smith from TechCorp. You can reach me at
john@techcorp.com or call 555-1234. We need API integration,
real-time updates, and analytics dashboard.
"""

# Try both methods
result1 = extract_with_structure(email_text)
print(f"JSON Mode Result: {result1}")

result2 = extract_with_function_calling(email_text)
print(f"Function Calling Result: {result2}")

Reliable Classification with Small Models

import litellm
from enum import Enum
from typing import List

class CustomerIntent(str, Enum):
    BILLING = "billing"
    TECHNICAL = "technical"
    SALES = "sales"
    GENERAL = "general"

def classify_intent(message: str) -> CustomerIntent:
    """Classify customer intent using structured output"""

    response = litellm.completion(
        model="gpt-4o-mini",  # Tiny model, perfect for classification
        messages=[
            {
                "role": "system",
                "content": """Classify the customer intent into one of these categories:
                - billing: Payment, invoices, refunds
                - technical: Bugs, errors, integration issues
                - sales: Pricing, upgrades, new features
                - general: Other inquiries

                Respond with only the category name."""
            },
            {
                "role": "user",
                "content": message
            }
        ],
        max_tokens=10,
        temperature=0,
        metadata={"task": "classification"}
    )

    intent = response.choices[0].message.content.strip().lower()
    return CustomerIntent(intent)

# Batch classification with caching
def classify_batch(messages: List[str]):
    """Classify multiple messages efficiently"""

    results = []
    for message in messages:
        intent = classify_intent(message)
        results.append({
            "message": message[:50] + "...",
            "intent": intent,
            "model": "gpt-4o-mini",
            "cost": "$0.000015"  # Approximate
        })

    return results

# Test it
test_messages = [
    "I can't log into my account",
    "How much does the enterprise plan cost?",
    "Please refund my last payment",
    "When will the new feature be available?"
]

classifications = classify_batch(test_messages)
for c in classifications:
    print(f"Intent: {c['intent']:10} | Message: {c['message']}")

๐Ÿ“š Documentation:

5. Putting It All Together

Here's a production-ready setup combining everything:

import litellm
from litellm import Cache
from dotenv import load_dotenv
import json
from datetime import datetime
from typing import Optional, List, Dict

# Setup
load_dotenv()
litellm.success_callback = ["langfuse"]
litellm.failure_callback = ["langfuse"]
litellm.cache = Cache(type="redis")

class OptimizedLLM:
    def __init__(self):
        self.router = ModelRouter()

    def complete(self,
                messages: List[Dict],
                use_cache: bool = True,
                force_model: Optional[str] = None,
                **kwargs) -> dict:
        """
        Optimized completion with all techniques:
        - Automatic model routing
        - Response caching
        - Cost tracking
        - Langfuse observability
        """

        # 1. Determine if we should cache
        should_cache = use_cache and should_cache(messages, kwargs.get("temperature", 0.7))

        # 2. Route to appropriate model (unless forced)
        model = force_model or self.router.route(messages)
        print(f"๐Ÿ“Š Using model: {model}")

        # 3. Make request with all optimizations
        response = litellm.completion(
            model=model,
            messages=messages,
            caching=should_cache,
            metadata={
                "routed_model": model,
                "caching_enabled": should_cache,
                "timestamp": datetime.now().isoformat(),
                "optimization_version": "v1"
            },
            **kwargs
        )

        # 4. Calculate and log cost savings
        cost = litellm.completion_cost(completion_response=response)

        # Compare with GPT-4 cost (for savings calculation)
        gpt4_cost = self._estimate_gpt4_cost(response.usage)
        savings = ((gpt4_cost - cost) / gpt4_cost * 100) if gpt4_cost > 0 else 0

        print(f"๐Ÿ’ฐ Cost: ${cost:.4f} (Saved {savings:.0f}% vs GPT-4)")

        return response

    def _estimate_gpt4_cost(self, usage) -> float:
        """Estimate what this would cost with GPT-4"""
        # GPT-4o pricing (approximate)
        input_cost = usage.prompt_tokens * 0.0025 / 1000
        output_cost = usage.completion_tokens * 0.01 / 1000
        return input_cost + output_cost

# Initialize
llm = OptimizedLLM()

# Example 1: Simple query (will use small model + caching)
response = llm.complete([
    {"role": "user", "content": "What's the capital of France?"}
])

# Example 2: Complex query (will use larger model)
response = llm.complete([
    {"role": "user", "content": "Debug this recursive Python function and explain the issue"}
])

Results You Can Expect

After implementing these techniques:

MetricBeforeAfterImprovement
Average cost per request$0.035$0.00391% reduction
Cache hit rate0%40%โ™พ๏ธ
Average latency2.5s0.8s68% faster
Error rate5%0.5%90% fewer errors
Monthly LLM costs$10,000$900$9,100 saved

Your Action Plan

Day 1: Observability (15 minutes)

  • โœ“ Sign up for Langfuse Cloud or run locally
  • โœ“ Add 2 lines to enable logging
  • โœ“ See your first traces

Day 2: Caching (30 minutes)

  • โœ“ Set up Redis or use in-memory cache
  • โœ“ Add caching=True to your completions
  • โœ“ Watch costs drop on repeat queries

Day 3: Model Routing (1 hour)

  • โœ“ Implement the ModelRouter class
  • โœ“ Identify your task types
  • โœ“ Route simple tasks to small models

Day 4: Structured Outputs

  • โœ“ Convert extractions to JSON mode
  • โœ“ Add response validation
  • โœ“ Use gpt-4o-mini reliably

Quick Reference

# Complete setup in one file
import litellm
from litellm import Cache
from dotenv import load_dotenv

# 1. Setup observability
load_dotenv()
litellm.success_callback = ["langfuse"]
litellm.failure_callback = ["langfuse"]

# 2. Enable caching
litellm.cache = Cache(type="redis")

# 3. Smart completion with all optimizations
def optimized_completion(messages, **kwargs):
    # Auto-select model based on complexity
    model = select_model_for_task(messages)

    response = litellm.completion(
        model=model,
        messages=messages,
        caching=True,  # Cache responses
        metadata={"optimized": True},  # Track in Langfuse
        **kwargs
    )

    # Log the savings
    cost = litellm.completion_cost(completion_response=response)
    print(f"Cost: ${cost:.4f} | Model: {model}")

    return response

What's Next?

Part 2 will cover advanced optimizations:

  • Prompt caching: Reuse computed tokens (90% savings) - Preview the docs
  • Semantic caching: Handle paraphrased queries
  • Streaming optimizations: Improve perceived performance
  • Batch processing: Reduce overhead for bulk operations
  • Fine-tuning economics: When it beats prompting

Start with observability today. You can't optimize what you can't measure.

Resources & Documentation

Have questions? Reduced your costs? Let me know in the comments!