Building a Private RAG System with LangChain, Chroma, and Local LLMs

Series Index

Part 5: Evaluation

5.1 Curate a Test Set

Generate a test set in JSONL (a list of JSON formatted questions). Check the synthetic data generator post (not available yet).

We will use a Pydantic Model to define the structure of each test question:

class TestQuestion(BaseModel):
    """A test question with expected keywords and reference answer."""

    question: str = Field(description="The question to ask the RAG system")
    keywords: list[str] = Field(description="Keywords that must appear in retrieved context")
    reference_answer: str = Field(description="The reference answer for this question")
    category: str = Field(description="Question category (e.g., direct_fact, spanning, temporal)")

5.2 Evaluate Retrieval

In this step we will evaluate the Vector Store Retriever (from the vector database) and check how good it is picking relevant chunks.

We will evaluate this on several metrics:

Calculate MRR - Mean Reciprocal Rank
Calculate nDCG - Normalized Discounted Cumulative Gains)
Calculate keyword coverage - Percentage

And here is the code:

def calculate_mrr(keyword: str, retrieved_docs: list) -> float:
    """Calculate reciprocal rank for a single keyword (case-insensitive)."""
    keyword_lower = keyword.lower()
    for rank, doc in enumerate(retrieved_docs, start=1):
        if keyword_lower in doc.page_content.lower():
            return 1.0 / rank
    return 0.0

def calculate_dcg(relevances: list[int], k: int) -> float:
    """Calculate Discounted Cumulative Gain."""
    dcg = 0.0
    for i in range(min(k, len(relevances))):
        dcg += relevances[i] / math.log2(i + 2)  # i+2 because rank starts at 1
    return dcg

def calculate_keyword_coverage(mrr_scores: List[float], keywords: List[str]):
    """Return keyword coverage percentage."""
    keywords_found = sum(1 for score in mrr_scores if score > 0)
    total_keywords = len(keywords)
    if total_keywords == 0:
        return 0.0
    return (keywords_found / total_keywords) * 100

Putting all together will look like this:

# First create the model for the answer format
class RetrievalEval(BaseModel):
    """Evaluation metrics for retrieval performance."""

    mrr: float = Field(description="Mean Reciprocal Rank - average across all keywords")
    ndcg: float = Field(description="Normalized Discounted Cumulative Gain (binary relevance)")
    keywords_found: int = Field(description="Number of keywords found in top-k results")
    total_keywords: int = Field(description="Total number of keywords to find")
    keyword_coverage: float = Field(description="Percentage of keywords found")


def evaluate_retrieval(test: TestQuestion, k: int = 10) -> RetrievalEval:
    """
    Evaluate retrieval performance for a test question.

    Args:
        test: TestQuestion object containing question and keywords
        k: Number of top documents to retrieve (default 10)

    Returns:
        RetrievalEval object with MRR, nDCG, and keyword coverage metrics
    """
    # Retrieve documents using shared answer module
    retrieved_docs = fetch_context(test.question)

    # Calculate MRR (average across all keywords)
    mrr_scores = [calculate_mrr(keyword, retrieved_docs) for keyword in test.keywords]
    avg_mrr = sum(mrr_scores) / len(mrr_scores) if mrr_scores else 0.0

    # Calculate nDCG (average across all keywords)
    ndcg_scores = [calculate_ndcg(keyword, retrieved_docs, k) for keyword in test.keywords]
    avg_ndcg = sum(ndcg_scores) / len(ndcg_scores) if ndcg_scores else 0.0

    # Calculate keyword coverage
    keyword_coverage = calculate_keyword_coverage(mrr_scores, test.keywords)

    return RetrievalEval(
        mrr=avg_mrr,
        ndcg=avg_ndcg,
        keywords_found=keywords_found,
        total_keywords=total_keywords,
        keyword_coverage=keyword_coverage,
    )

5.3 Evaluate answer

We will need the help of an LLM here. So we will use LLM-as-a-judge. We will evaluate the answers on the following topics:

Accuracy
Completeness
Relevance

Same as the Retrieval Answer. We will create a Pydantic model:

class AnswerEval(BaseModel):
    """LLM-as-a-judge evaluation of answer quality."""

    feedback: str = Field(
        description="Concise feedback on the answer quality, comparing it to the reference answer and evaluating based on the retrieved context"
    )
    accuracy: float = Field(
        description="How factually correct is the answer compared to the reference answer? 1 (wrong. any wrong answer must score 1) to 5 (ideal - perfectly accurate). An acceptable answer would score 3."
    )
    completeness: float = Field(
        description="How complete is the answer in addressing all aspects of the question? 1 (very poor - missing key information) to 5 (ideal - all the information from the reference answer is provided completely). Only answer 5 if ALL information from the reference answer is included."
    )
    relevance: float = Field(
        description="How relevant is the answer to the specific question asked? 1 (very poor - off-topic) to 5 (ideal - directly addresses question and gives no additional information). Only answer 5 if the answer is completely relevant to the question and gives no additional information."
    )

Answer evaluation method:

def evaluate_answer(test: TestQuestion) -> tuple[AnswerEval, str, list]:
    """
    Evaluate answer quality using LLM-as-a-judge (async).

    Args:
        test: TestQuestion object containing question and reference answer

    Returns:
        Tuple of (AnswerEval object, generated_answer string, retrieved_docs list)
    """
    # Get RAG response using shared answer module
    generated_answer, retrieved_docs = answer_question(test.question)

    # LLM judge prompt
    judge_messages = [
        {
            "role": "system",
            "content": "You are an expert evaluator assessing the quality of answers. Evaluate the generated answer by comparing it to the reference answer. Only give 5/5 scores for perfect answers.",
        },
        {
            "role": "user",
            "content": f"""Question:
                        {test.question}

                        Generated Answer:
                        {generated_answer}

                        Reference Answer:
                        {test.reference_answer}

                        Please evaluate the generated answer on three dimensions:
                        1. Accuracy: How factually correct is it compared to the reference answer? Only give 5/5 scores for perfect answers.
                        2. Completeness: How thoroughly does it address all aspects of the question, covering all the information from the reference answer?
                        3. Relevance: How well does it directly answer the specific question asked, giving no additional information?

                        Provide detailed feedback and scores from 1 (very poor) to 5 (ideal) for each dimension. If the answer is wrong, then the accuracy score must be 1.""",
        },
    ]

    # Call LLM judge with structured outputs (async)
    judge_response = completion(model=MODEL, messages=judge_messages, response_format=AnswerEval)

    answer_eval = AnswerEval.model_validate_json(judge_response.choices[0].message.content)

    return answer_eval, generated_answer, retrieved_docs

Done. Once we have reached this point, we have a baseline to beat. We can improve our numbers by experimenting with other chunk sizes or other text splitters (https://api.python.langchain.com/en/latest/text_splitters/index.html). Also, trying other embedding models such as OpenAI’s text-embedding-3-large.

Here is the baseline:

MRR: 0.76
nDCG: 0.81
Keyword coverage: 88%
Accuracy: 3.8/5
Completeness: 3.5/5
Relevance: 4.2/5

5.4 Optimize

Next, we will tune chunking strategies, retriever settings, and embedding models to push these baseline numbers up.

Building a Private RAG System with Ollama, LangChain and Chroma

Part 5: Evaluation

Series Index

Part 5: Evaluation

5.1 Curate a Test Set

5.2 Evaluate Retrieval

5.3 Evaluate answer

5.4 Optimize

More LLM Engineering articles