LLM Engineering | Running local LLMs and APIs- LLM Engineer | AI Agent Engineer

Token optimization makes your LLM apps cheaper, faster, and more scalable. Two practical patterns from the course:

Cache – reuse responses instead of recomputing them.
Thin system prompt – keep the base prompt small, expand it only when needed.

1️⃣ Cache with LiteLLM

Cache stores responses for identical requests. If the same prompt comes again, you get an instant response with zero extra tokens.

import litellm
from litellm import completion

# Enable in-memory cache (good for local dev, notebooks, small apps)
litellm.cache = True

messages = [
    {"role": "user", "content": "Give me a one-paragraph summary of transformers."},
]

response = completion(model="openai/gpt-4.1", messages=messages)
print(response.choices[0].message.content)

For production, you usually plug in Redis to share cache across workers and instances.

import litellm

litellm.cache = {
    "type": "redis",
    "redis_host": "localhost",
    "redis_port": 6379,
}

# All LiteLLM calls below will now read/write from Redis-backed cache

This is especially useful for:

Chatbots answering the same questions repeatedly.
RAG systems where users re-query similar documents.
Background jobs and batch processing pipelines.

2️⃣ Thin System Prompt (Dynamic Expansion)

Instead of sending one huge system prompt every time, use a small base prompt and expand it only when the user query needs extra rules or domain knowledge.

This reduces tokens per request and keeps the model focused.

import openai

MODEL = "gpt-4.1"
system_message = "You are a helpful retail assistant that answers questions about store products."

def chat(message, history):
    # Normalize history into OpenAI format
    history = [
        {"role": h["role"], "content": h["content"]}
        for h in history
    ]

    # Start with a thin base system message
    relevant_system_message = system_message

    # Add extra constraints only when needed
    if "belt" in message.lower():
        relevant_system_message += (
            " The store does not sell belts; if you are asked for belts, "
            "be sure to point out other items that are currently on sale instead."
        )

    messages = (
        [{"role": "system", "content": relevant_system_message}]
        + history
        + [{"role": "user", "content": message}]
    )

    # Stream the response back to the client
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=messages,
        stream=True,
    )

    response = ""
    for chunk in stream:
        response += chunk.choices[0].delta.content or ""
        yield response

Pattern:

Base system prompt = generic behavior.
Inspect the user message for keywords, intent, or domain.
Append only the relevant extra rules.
Send the final, compact system prompt with the conversation.

Combined with caching, this gives you:

Fewer tokens per call.
Lower latency and cost.
Cleaner prompts that are easier to maintain and reason about.

LLM Engineering | Token optimization

Make your LLM apps cheaper, faster, and more scalable

1️⃣ Cache with LiteLLM

2️⃣ Thin System Prompt (Dynamic Expansion)

More LLM Engineering articles