Token optimization makes your LLM apps cheaper, faster, and more scalable. Two practical patterns from the course:
- Cache – reuse responses instead of recomputing them.
- Thin system prompt – keep the base prompt small, expand it only when needed.
1️⃣ Cache with LiteLLM
Cache stores responses for identical requests. If the same prompt comes again, you get an instant response with zero extra tokens.
import litellm
from litellm import completion
# Enable in-memory cache (good for local dev, notebooks, small apps)
litellm.cache = True
messages = [
{"role": "user", "content": "Give me a one-paragraph summary of transformers."},
]
response = completion(model="openai/gpt-4.1", messages=messages)
print(response.choices[0].message.content)
For production, you usually plug in Redis to share cache across workers and instances.
import litellm
litellm.cache = {
"type": "redis",
"redis_host": "localhost",
"redis_port": 6379,
}
# All LiteLLM calls below will now read/write from Redis-backed cache
This is especially useful for:
- Chatbots answering the same questions repeatedly.
- RAG systems where users re-query similar documents.
- Background jobs and batch processing pipelines.
2️⃣ Thin System Prompt (Dynamic Expansion)
Instead of sending one huge system prompt every time, use a small base prompt and expand it only when the user query needs extra rules or domain knowledge.
This reduces tokens per request and keeps the model focused.
import openai
MODEL = "gpt-4.1"
system_message = "You are a helpful retail assistant that answers questions about store products."
def chat(message, history):
# Normalize history into OpenAI format
history = [
{"role": h["role"], "content": h["content"]}
for h in history
]
# Start with a thin base system message
relevant_system_message = system_message
# Add extra constraints only when needed
if "belt" in message.lower():
relevant_system_message += (
" The store does not sell belts; if you are asked for belts, "
"be sure to point out other items that are currently on sale instead."
)
messages = (
[{"role": "system", "content": relevant_system_message}]
+ history
+ [{"role": "user", "content": message}]
)
# Stream the response back to the client
stream = openai.chat.completions.create(
model=MODEL,
messages=messages,
stream=True,
)
response = ""
for chunk in stream:
response += chunk.choices[0].delta.content or ""
yield response
Pattern:
- Base system prompt = generic behavior.
- Inspect the user message for keywords, intent, or domain.
- Append only the relevant extra rules.
- Send the final, compact system prompt with the conversation.
Combined with caching, this gives you:
- Fewer tokens per call.
- Lower latency and cost.
- Cleaner prompts that are easier to maintain and reason about.
More LLM Engineering articles
- Building a Private RAG System with LangChain, Chroma, and Local LLMs – private, enterprise-ready RAG and vector database pipeline.
- LLM Engineering | Running local LLMs and APIs – Ollama, OpenAI, Anthropic, OpenRouter, LangChain, and LiteLLM.