AI AgentTechnical Deep Dive

RAG 进阶:Chunking、Embedding 优化

发布时间2026/02/22
分类AI Agent
预计阅读8 分钟
作者吴长龙
*

基础的 RAG 能用,但效果不一定好。本文深入讲解分块策略和 Embedding 模型的优化技巧,让检索效果上一个台阶。

01.内容

# RAG 进阶:Chunking、Embedding 优化

上一篇文章我们实现了最小可用的 RAG。但实际应用中,很多人会遇到这些问题:

  • 检索不到相关内容
  • 召回的内容不完整
  • 答案质量不高

这些问题的根源往往在于 Chunking(分块)Embedding(向量化) 两个环节。本文深入讲解这两个环节的优化技巧。

02.1. Chunking 优化

1.1 为什么分块很重要?

分块决定了 检索的粒度

  • 块太小 → 丢失上下文,LLM 难以理解
  • 块太大 → 包含噪声,检索不精准,超出 LLM 上下文

1.2 固定大小分块(改进版)

python snippetpython
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 基础版
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

# 改进版:保留段落结构
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", "。", "!", "?", " ", ""],
    keep_separator=True
)

1.3 按语义分块

python snippetpython
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# 基于语义的分割
semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="gradient",
    breakpoint_threshold_amount=0.5
)

# 会自动在语义转折点分割
chunks = semantic_splitter.split_text(长文本)

1.4 Markdown/代码分块

python snippetpython
from langchain.text_splitter import MarkdownHeaderTextSplitter, Language

# Markdown 分块
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "title"),
        ("##", "section"),
        ("###", "subsection")
    ]
)

# 代码分块
code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=200,
    chunk_overlap=30
)

1.5 Parent Document 检索

python snippetpython
from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# 大块(parent)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# 小块(child)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=300)

# 向量存储
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())

# Parent Document 检索器
retriever = ParentDocumentRetriever(
    vectorstores={"default": vectorstore},
    docstore=InMemoryStore(),
    parent_splitter=parent_splitter,
    child_splitter=child_splitter
)

03.2. Embedding 优化

2.1 选择合适的 Embedding 模型

模型维度特点推荐场景
text-embedding-3-small1536便宜,速度快成本敏感
text-embedding-3-large3072效果最好精度优先
bge-large-zh1024中文优化中文场景
bge-m31024多语言跨语言
python snippetpython
# 推荐:中文场景用 BGE
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-zh-v1.5",
    model_kwargs={'device': 'cpu'}
)

# 向量化
vector = embeddings.embed_query("中文测试")

2.2 Matryoshka 表示学习

用更少的维度获得更好的效果:

python snippetpython
from langchain_community.embeddings import HuggingFaceEmbeddings

# 使用 Matryoshka 嵌入,可以动态调整维度
embeddings = HuggingFaceEmbeddings(
    model_name="dwzhu/e5-base-4m",
    model_kwargs={'device': 'cpu'}
)

# 获取完整向量
full_vector = embeddings.embed_query("text")

# 截取前 256 维(仍然保持较好效果)
reduced_vector = full_vector[:256]

2.3 动态 Embedding

针对不同类型内容使用不同策略:

python snippetpython
class DynamicEmbedding:
    def __init__(self):
        self.code_embedding = HuggingFaceEmbeddings(
            model_name="microsoft/codebert-base"
        )
        self.text_embedding = OpenAIEmbeddings(
            model="text-embedding-3-small"
        )
    
    def embed(self, text: str) -> list[float]:
        if self.is_code(text):
            return self.code_embedding.embed_query(text)
        return self.text_embedding.embed_query(text)
    
    def is_code(self, text: str) -> bool:
        # 简单判断:包含代码特征
        code_indicators = ['def ', 'class ', 'import ', 'function ', '=>']
        return any(ind in text for ind in code_indicators)

04.3. 索引优化

3.1 分层索引

python snippetpython
# 建立两层索引

# 1. 摘要层(粗筛)
summary_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
summary_docs = summary_splitter.create_documents(docs)
summary_vectorstore = FAISS.from_documents(summary_docs, embeddings)

# 2. 详情层(精筛)
detail_splitter = RecursiveCharacterTextSplitter(chunk_size=300)
detail_docs = detail_splitter.create_documents(docs)
detail_vectorstore = FAISS.from_documents(detail_docs, embeddings)

# 分层检索
def hierarchical_search(query: str, k: int = 3) -> list[Document]:
    # 第一层:找相关文档
    summary_results = summary_vectorstore.similarity_search(query, k=k*2)
    
    # 第二层:在相关文档内细搜
    relevant_texts = [doc.page_content for doc in summary_results]
    combined_text = "\n".join(relevant_texts)
    
    # 重新分块细搜
    detail_chunks = detail_splitter.create_documents([combined_text])
    detail_vectorstore.add_documents(detail_chunks)
    
    return detail_vectorstore.similarity_search(query, k=k)

3.2 关键词增强

python snippetpython
def keyword_enhanced_search(query: str, k: int = 3) -> list[Document]:
    # 提取关键词
    keywords = extract_keywords(query)
    
    # 扩展查询
    expanded_query = f"{query} {' '.join(keywords)}"
    
    # 检索
    return vectorstore.similarity_search(expanded_query, k=k)

def extract_keywords(text: str) -> list[str]:
    """简单关键词提取"""
    # 可以用 NLP 库或 LLM
    stopwords = {"的", "了", "是", "在", "和", "有"}
    words = jieba.cut(text)
    return [w for w in words if w not in stopwords and len(w) > 1]

3.3 元数据过滤

python snippetpython
from langchain.schema import Document

# 为文档添加元数据
docs_with_metadata = [
    Document(
        page_content="文档内容",
        metadata={
            "source": "user_manual",
            "category": "faq",
            "year": 2024,
            "author": "team"
        }
    )
]

# 创建带元数据的向量存储
vectorstore = FAISS.from_documents(docs_with_metadata, embeddings)

# 检索时过滤
results = vectorstore.similarity_search(
    query="问题",
    k=3,
    filter={"category": "faq", "year": {"$gte": 2023}}
)

05.4. 检索优化

4.1 混合检索

python snippetpython
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# 向量检索
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# 关键词检索
keyword_retriever = BM25Retriever.from_documents(docs)
keyword_retriever.k = 3

# 混合
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, keyword_retriever],
    weights=[0.6, 0.4]  # 向量权重更高
)

# 使用
results = ensemble.get_relevant_documents("查询")

4.2 重排序

python snippetpython
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# 重排序模型
cross_encoder = HuggingFaceCrossEncoder(
    model_name="BAAI/bge-reranker-base"
)

def rerank_documents(query: str, documents: list[Document], top_k: int = 3) -> list[Document]:
    # 计算相关性分数
    doc_texts = [doc.page_content for doc in documents]
    pairs = [(query, text) for text in doc_texts]
    scores = cross_encoder.predict(pairs)
    
    # 排序
    scored = zip(documents, scores)
    sorted_docs = sorted(scored, key=lambda x: x[1], reverse=True)
    
    return [doc for doc, score in sorted_docs[:top_k]]

# 使用流程
initial_docs = vectorstore.similarity_search(query, k=10)
reranked_docs = rerank_documents(query, initial_docs, top_k=3)

4.3 查询扩展/改写

python snippetpython
def rewrite_query(query: str) -> str:
    """LLM 改写查询"""
    prompt = f"""将用户问题改写成更适合检索的形式。
    添加同义词、相关术语,去除口语化表达。
    
    原始问题:{query}
    改写后:"""
    
    return llm.invoke(prompt).content

def expand_query_with_context(query: str) -> list[str]:
    """生成多个查询变体"""
    prompt = f"""基于用户问题,生成3个不同的检索查询。
    涵盖不同角度和同义词。
    
    问题:{query}
    
    输出(每行一个):"""
    
    result = llm.invoke(prompt).content
    queries = result.strip().split("\n")
    return queries[:3]

# 使用
queries = expand_query_with_context("你们公司怎么收费?")
all_docs = []
for q in queries:
    docs = vectorstore.similarity_search(q, k=3)
    all_docs.extend(docs)

# 去重
unique_docs = list({doc.page_content: doc for doc in all_docs}.values())

06.5. 评估与迭代

5.1 检索质量评估

python snippetpython
def evaluate_retrieval(retriever, test_cases: list[dict]) -> dict:
    """评估检索效果"""
    results = []
    
    for case in test_cases:
        query = case["query"]
        relevant_docs = case["relevant_docs"]
        
        retrieved = retriever.get_relevant_documents(query)
        
        # 计算指标
        retrieved_contents = [doc.page_content for doc in retrieved]
        
        hits = sum(1 for rel in relevant_docs if any(rel in r for r in retrieved_contents))
        
        results.append({
            "query": query,
            "hits": hits,
            "total_relevant": len(relevant_docs),
            "recall": hits / len(relevant_docs) if relevant_docs else 0
        })
    
    # 汇总
    avg_recall = sum(r["recall"] for r in results) / len(results)
    return {"avg_recall": avg_recall, "details": results}

5.2 A/B 测试

python snippetpython
# 对比不同配置的效果
configs = [
    {"chunk_size": 300, "embedding": "text-embedding-3-small"},
    {"chunk_size": 500, "embedding": "bge-large-zh"},
    {"chunk_size": 300, "embedding": "bge-large-zh", "use_rerank": True}
]

for config in configs:
    # 构建 RAG
    rag = build_rag(config)
    
    # 评估
    metrics = evaluate_retrieval(rag.retriever, test_cases)
    
    print(f"Config: {config}")
    print(f"Recall: {metrics['avg_recall']:.2%}")

07.6. 总结

优化方向技巧效果
Chunking语义分块更合理的粒度
ChunkingParent Document兼顾上下文
Embedding选择合适模型更好的向量质量
Embedding动态嵌入适配不同内容
索引分层索引粗筛+精筛
检索混合检索平衡精确/召回
检索重排序提升相关性
检索查询改写适配检索表达

RAG 优化是一个迭代的过程,建议:

  • 先跑通基础版
  • 用真实数据评估
  • 针对薄弱环节优化
  • 持续监控和迭代

下一篇文章我们将讨论 RAG 评估:如何衡量 RAG 的效果?,建立科学的评估体系。