AI AgentTechnical Deep Dive

RAG 评估:如何衡量 RAG 的效果?

发布时间2026/02/28
分类AI Agent
预计阅读8 分钟
作者吴长龙
*

RAG 效果好不好,不能只靠「感觉」。本文介绍 RAG 评估的核心指标、评估方法论以及实用工具,让你的 RAG 优化有据可循。

01.内容

# RAG 评估:如何衡量 RAG 的效果?

「我觉得这个 RAG 效果不错」——这种主观判断往往不可靠。

科学的评估是 RAG 优化的基础。本文介绍 RAG 评估的完整方法论,包括检索质量、生成质量、端到端指标以及实用的评估工具。

02.1. RAG 评估的两个层面

1.1 检索评估

评估检索到的文档是否相关:

  • 相关文档是否被检索到?
  • 检索到的文档是否都相关?
  • 相关文档的排名是否靠前?

1.2 生成评估

评估最终答案是否正确:

  • 答案是否基于检索到的文档?
  • 答案是否准确回答了问题?
  • 答案是否包含引用来源?

03.2. 检索评估指标

2.1 基础指标

指标英文含义公式
精确率Precision检索结果中相关的比例TP / (TP + FP)
召回率Recall相关文档被检索到的比例TP / (TP + FN)
F1F1 Score精确率和召回率的调和平均2 × P × R / (P + R)
python snippetpython
def calculate_precision_recall(retrieved: list, relevant: list):
    """计算精确率和召回率"""
    retrieved_set = set(retrieved)
    relevant_set = set(relevant)
    
    tp = len(retrieved_set & relevant_set)
    fp = len(retrieved_set - relevant_set)
    fn = len(relevant_set - retrieved_set)
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    return {"precision": precision, "recall": recall, "f1": f1}

2.2 排名指标

指标英文含义公式
MRRMean Reciprocal Rank第一个相关结果的倒数Σ(1/rank_i) / N
MAPMean Average Precision平均精度的均值Σ(AP_i) / N
NDCGNormalized DCG考虑排名的相关性得分DCG / IDCG
python snippetpython
import numpy as np

def calculate_mrr(retrieved_rankings: list[list[int]]) -> float:
    """计算 MRR"""
    reciprocal_ranks = []
    
    for rankings in retrieved_rankings:
        for rank, doc_id in enumerate(rankings, 1):
            if doc_id in relevant_docs:  # 相关文档
                reciprocal_ranks.append(1 / rank)
                break
        else:
            reciprocal_ranks.append(0)
    
    return np.mean(reciprocal_ranks)

def calculate_ndcg(retrieved: list, relevant: list, k: int = 10) -> float:
    """计算 NDCG@k"""
    # DCG
    dcg = 0
    for i, doc in enumerate(retrieved[:k], 1):
        if doc in relevant:
            dcg += 1 / np.log2(i + 1)
    
    # IDCG
    idcg = sum(1 / np.log2(i + 1) for i in range(1, min(len(relevant), k) + 1))
    
    return dcg / idcg if idcg > 0 else 0

2.3 Hit Rate

python snippetpython
def calculate_hit_rate(retrieved: list[list], relevant: list[list]) -> float:
    """计算 Hit Rate:至少检索到一个相关文档的比例"""
    hits = 0
    
    for ret, rel in zip(retrieved, relevant):
        ret_set = set(ret)
        rel_set = set(rel)
        if ret_set & rel_set:  # 有交集
            hits += 1
    
    return hits / len(retrieved) if retrieved else 0

04.3. 生成评估指标

3.1 答案质量指标

指标说明
Faithfulness答案是否基于检索到的文档(不幻觉)
Answer Relevance答案是否与问题相关
Context Precision检索内容是否对答案有用
Context Recall检索内容是否覆盖答案所需信息
python snippetpython
# 使用 LangChain 评估
from langchain.evaluation import load_evaluator

#  faithfulness 评估
faithfulness_eval = load_evaluator("labeled_criteria", criteria="faithfulness")

result = faithfulness_eval.evaluate_strings(
    prediction="答案内容",
    input="问题",
    context="检索到的上下文"
)

#  Answer Relevance 评估
relevance_eval = load_evaluator("answer_relevance")

result = relevance_eval.evaluate_strings(
    prediction="答案内容",
    input="问题"
)

3.2 LLM-as-Judge

用 LLM 来评估答案质量:

python snippetpython
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o")

judge_prompt = PromptTemplate.from_template("""
你是一个专业的 AI 评估员。请根据以下标准评估答案质量。

问题:{question}
检索到的上下文:{context}
生成的答案:{answer}

评估标准:
1. 准确性:答案是否正确回答了问题?
2. 完整性:答案是否涵盖了问题的所有方面?
3.  faithfulness:答案是否基于提供的上下文?
4. 清晰性:答案是否表达清晰?

请给出 1-10 分的整体评分,并说明理由。

评分:""")

def evaluate_with_llm(question: str, context: str, answer: str) -> dict:
    prompt = judge_prompt.format(
        question=question,
        context=context,
        answer=answer
    )
    
    result = llm.invoke(prompt)
    
    # 解析评分
    # ... (可以进一步处理)
    
    return {
        "evaluation": result.content,
        "question": question,
        "answer": answer
    }

3.3 引用准确性

python snippetpython
def evaluate_citation(answer: str, contexts: list[str]) -> dict:
    """评估答案是否正确引用了上下文"""
    citations_found = 0
    total_claims = extract_claims(answer)
    
    for claim in total_claims:
        # 检查claim是否能在context中找到
        if any(claim in ctx for ctx in contexts):
            citations_found += 1
    
    citation_rate = citations_found / len(total_claims) if total_claims else 0
    
    return {
        "citation_rate": citation_rate,
        "claims_with_citation": citations_found,
        "total_claims": len(total_claims)
    }

05.4. 端到端评估

4.1 RAGAS 指标

RAGAS(RAG Assessment)是一套专门评估 RAG 的指标:

python snippetpython
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# 准备测试数据
test_data = {
    "question": ["问题1", "问题2"],
    "answer": ["答案1", "答案2"],
    "contexts": [["上下文1"], ["上下文2"]],
    "ground_truth": ["真实答案1", "真实答案2"]
}

dataset = Dataset.from_dict(test_data)

# 评估
results = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ]
)

# 查看结果
print(results)

4.2 手动评估流程

python snippetpython
class RAGEvaluator:
    def __init__(self, rag_system, test_cases: list):
        self.rag = rag_system
        self.test_cases = test_cases
    
    def evaluate_retrieval(self) -> dict:
        """评估检索质量"""
        results = []
        
        for case in self.test_cases:
            retrieved = self.rag.retrieve(case["question"])
            relevant = case["relevant_docs"]
            
            # 计算各项指标
            metrics = calculate_precision_recall(retrieved, relevant)
            metrics["question"] = case["question"]
            results.append(metrics)
        
        return {
            "avg_precision": np.mean([r["precision"] for r in results]),
            "avg_recall": np.mean([r["recall"] for r in results]),
            "avg_f1": np.mean([r["f1"] for r in results]),
            "details": results
        }
    
    def evaluate_generation(self) -> dict:
        """评估生成质量"""
        results = []
        
        for case in self.test_cases:
            answer = self.rag.answer(case["question"])
            context = self.rag.retrieve(case["question"])
            
            # LLM 评估
            eval_result = evaluate_with_llm(
                case["question"],
                "\n".join(context),
                answer
            )
            
            results.append({
                "question": case["question"],
                "answer": answer,
                "evaluation": eval_result["evaluation"]
            })
        
        return {"details": results}
    
    def evaluate_end_to_end(self) -> dict:
        """端到端评估"""
        results = []
        
        for case in self.test_cases:
            answer = self.rag.answer(case["question"])
            
            # 与真实答案对比
            similarity = compute_similarity(answer, case["ground_truth"])
            
            results.append({
                "question": case["question"],
                "answer": answer,
                "ground_truth": case["ground_truth"],
                "similarity": similarity
            })
        
        return {
            "avg_similarity": np.mean([r["similarity"] for r in results]),
            "details": results
        }

06.5. 评估工具

5.1 RAGAS

专门为 RAG 设计的评估框架:

bash snippetbash
pip install ragas
python snippetpython
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# 评估 RAG 系统
results = evaluate(
    dataset=test_dataset,
    metrics=[faithfulness, answer_relevancy]
)

5.2 LangSmith

LangChain 的评估平台:

python snippetpython
from langsmith import Client

client = Client()

# 记录实验
experiment_results = client.run_experiment(
    project_name="rag-optimization",
    inputs=test_cases,
    outputs=rag_outputs
)

5.3 自建评估系统

python snippetpython
# 构建评估 Dashboard
import streamlit as st

def evaluation_dashboard(evaluator: RAGEvaluator):
    st.title("RAG 评估 Dashboard")
    
    # 检索指标
    retrieval_metrics = evaluator.evaluate_retrieval()
    st.metric("平均精确率", f"{retrieval_metrics['avg_precision']:.2%}")
    st.metric("平均召回率", f"{retrieval_metrics['avg_recall']:.2%}")
    
    # 生成指标
    generation_metrics = evaluator.evaluate_generation()
    
    # 详细结果表格
    st.dataframe(evaluation_results)

07.6. 测试用例设计

6.1 测试用例类型

类型说明示例
简单事实直接能从文档中找到"公司成立于哪年?"
推理需要简单推理"根据文档,AI 市场规模是多少?"
多跳需要结合多个文档"A 公司的产品和 B 公司相比..."
边界故意找不到答案"文档中没有的信息"

6.2 构建测试集

python snippetpython
test_cases = [
    {
        "question": "公司的主营业务是什么?",
        "relevant_docs": ["doc_1", "doc_2"],
        "ground_truth": "公司主要提供AI助手和数据..."
    },
    {
        "question": "2024年收入是多少?",
        "relevant_docs": ["doc_3"],
        "ground_truth": "2024年收入为1亿元"
    },
    # ... 更多测试用例
]

08.7. 持续监控

7.1 生产环境监控

python snippetpython
import logging
from datetime import datetime

class RAGMonitor:
    def __init__(self):
        self.logger = logging.getLogger("rag_monitor")
        self.metrics = []
    
    def log_request(self, question: str, answer: str, contexts: list, latency: float):
        """记录每次请求"""
        self.metrics.append({
            "timestamp": datetime.now(),
            "question": question,
            "answer_length": len(answer),
            "num_contexts": len(contexts),
            "latency": latency
        })
    
    def get_stats(self) -> dict:
        """获取统计信息"""
        import pandas as pd
        
        df = pd.DataFrame(self.metrics)
        
        return {
            "total_requests": len(df),
            "avg_latency": df["latency"].mean(),
            "avg_answer_length": df["answer_length"].mean(),
            "avg_contexts": df["num_contexts"].mean()
        }

7.2 告警机制

python snippetpython
class RAGAlert:
    def __init__(self, thresholdrecall=0.6):
        self.threshold = threshold
    
    def check(self, metrics: dict):
        if metrics["avg_recall"] < self.threshold:
            send_alert(f"检索召回率低于阈值: {metrics['avg_recall']:.2%}")

09.8. 总结

评估层面关键指标工具
检索Precision, Recall, MRR, NDCG自建脚本
生成Faithfulness, RelevanceRAGAS, LLM-as-Judge
端到端答案相似度,用户满意度LangSmith
监控延迟、召回率、命中率自建 Dashboard

最佳实践:

  • 先建立基础评估指标
  • 用 LLM-as-Judge 补充主观评估
  • 构建测试集,覆盖不同场景
  • 持续监控,及时发现问题

RAG 评估是一个持续的过程。好的评估体系能让你的优化方向明确,不再「凭感觉」调整。

下一篇文章我们将进入多模态 Agent 领域——让 AI 也能「看」世界。