Series Index
- Prerequisites
- Populate the Vector Database
- Vector Retriever
- RAG Implementation
- Chat UI
- Evaluation
- Performance improvements
4. RAG implementation
In addition to the Retriever, we will need our Auto-regresive (conversational) LLM:
llm = ChatOpenAI(temperature=0, model_name=MODEL)
Together with the Retriever we defined in the previous post, we have everything we need to build the RAG:
SYSTEM_PROMPT_TEMPLATE = """
You are a knowledgeable, friendly assistant representing Desert Leaves.
You are chatting internally with a technician from Desert Leaves.
If relevant, use the given context to answer any question.
If you don't know the answer, say so.
Context:
{context}
"""
def fetch_context(question: str) -> list[Document]:
"""
Retrieve relevant context documents for a question.
"""
return retriever.invoke(question, k=RETRIEVAL_K)
def combined_question(question: str, history: list[dict] = []) -> str:
"""
Combine all the user's messages into a single string.
"""
prior = "\n".join(m["content"] for m in history if m["role"] == "user")
return prior + "\n" + question
def answer_question(question: str, history: list[dict] = []) -> tuple[str, list[Document]]:
"""
Answer the given question with RAG; return the answer and the context documents.
"""
combined = combined_question(question, history)
docs = fetch_context(combined)
context = "\n\n".join(doc.page_content for doc in docs)
system_prompt = SYSTEM_PROMPT.format(context=context)
messages = [SystemMessage(content=system_prompt)]
messages.extend(convert_to_messages(history))
messages.append(HumanMessage(content=question))
response = llm.invoke(messages)
return response.content, docs
5. UI with Gradio
gr.ChatInterface(answer_question).launch()
As simple as that
Nice work! Now we have our own private RAG system.
Now a little tune-up for production in the next sections: Part 6: Evaluation (available soon)
More LLM Engineering articles
- LLM Engineering | Token optimization – caching, thin system prompts, and cost-optimized production usage.
- LLM Engineering | Running local LLMs and APIs – Ollama, OpenAI, Anthropic, OpenRouter, LangChain, and LiteLLM.