AI AgentTechnical Deep Dive
RAG 进阶:Chunking、Embedding 优化
发布时间2026/02/22
分类AI Agent
预计阅读8 分钟
作者吴长龙
*
基础的 RAG 能用,但效果不一定好。本文深入讲解分块策略和 Embedding 模型的优化技巧,让检索效果上一个台阶。
01.内容
# RAG 进阶:Chunking、Embedding 优化
上一篇文章我们实现了最小可用的 RAG。但实际应用中,很多人会遇到这些问题:
- •检索不到相关内容
- •召回的内容不完整
- •答案质量不高
这些问题的根源往往在于 Chunking(分块) 和 Embedding(向量化) 两个环节。本文深入讲解这两个环节的优化技巧。
02.1. Chunking 优化
1.1 为什么分块很重要?
分块决定了 检索的粒度:
- •块太小 → 丢失上下文,LLM 难以理解
- •块太大 → 包含噪声,检索不精准,超出 LLM 上下文
1.2 固定大小分块(改进版)
python snippetpython
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 基础版
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
# 改进版:保留段落结构
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", "。", "!", "?", " ", ""],
keep_separator=True
)1.3 按语义分块
python snippetpython
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# 基于语义的分割
semantic_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="gradient",
breakpoint_threshold_amount=0.5
)
# 会自动在语义转折点分割
chunks = semantic_splitter.split_text(长文本)1.4 Markdown/代码分块
python snippetpython
from langchain.text_splitter import MarkdownHeaderTextSplitter, Language
# Markdown 分块
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "title"),
("##", "section"),
("###", "subsection")
]
)
# 代码分块
code_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=200,
chunk_overlap=30
)1.5 Parent Document 检索
python snippetpython
from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
# 大块(parent)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# 小块(child)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=300)
# 向量存储
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
# Parent Document 检索器
retriever = ParentDocumentRetriever(
vectorstores={"default": vectorstore},
docstore=InMemoryStore(),
parent_splitter=parent_splitter,
child_splitter=child_splitter
)03.2. Embedding 优化
2.1 选择合适的 Embedding 模型
| 模型 | 维度 | 特点 | 推荐场景 |
|---|---|---|---|
| text-embedding-3-small | 1536 | 便宜,速度快 | 成本敏感 |
| text-embedding-3-large | 3072 | 效果最好 | 精度优先 |
| bge-large-zh | 1024 | 中文优化 | 中文场景 |
| bge-m3 | 1024 | 多语言 | 跨语言 |
python snippetpython
# 推荐:中文场景用 BGE
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-large-zh-v1.5",
model_kwargs={'device': 'cpu'}
)
# 向量化
vector = embeddings.embed_query("中文测试")2.2 Matryoshka 表示学习
用更少的维度获得更好的效果:
python snippetpython
from langchain_community.embeddings import HuggingFaceEmbeddings
# 使用 Matryoshka 嵌入,可以动态调整维度
embeddings = HuggingFaceEmbeddings(
model_name="dwzhu/e5-base-4m",
model_kwargs={'device': 'cpu'}
)
# 获取完整向量
full_vector = embeddings.embed_query("text")
# 截取前 256 维(仍然保持较好效果)
reduced_vector = full_vector[:256]2.3 动态 Embedding
针对不同类型内容使用不同策略:
python snippetpython
class DynamicEmbedding:
def __init__(self):
self.code_embedding = HuggingFaceEmbeddings(
model_name="microsoft/codebert-base"
)
self.text_embedding = OpenAIEmbeddings(
model="text-embedding-3-small"
)
def embed(self, text: str) -> list[float]:
if self.is_code(text):
return self.code_embedding.embed_query(text)
return self.text_embedding.embed_query(text)
def is_code(self, text: str) -> bool:
# 简单判断:包含代码特征
code_indicators = ['def ', 'class ', 'import ', 'function ', '=>']
return any(ind in text for ind in code_indicators)04.3. 索引优化
3.1 分层索引
python snippetpython
# 建立两层索引
# 1. 摘要层(粗筛)
summary_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
summary_docs = summary_splitter.create_documents(docs)
summary_vectorstore = FAISS.from_documents(summary_docs, embeddings)
# 2. 详情层(精筛)
detail_splitter = RecursiveCharacterTextSplitter(chunk_size=300)
detail_docs = detail_splitter.create_documents(docs)
detail_vectorstore = FAISS.from_documents(detail_docs, embeddings)
# 分层检索
def hierarchical_search(query: str, k: int = 3) -> list[Document]:
# 第一层:找相关文档
summary_results = summary_vectorstore.similarity_search(query, k=k*2)
# 第二层:在相关文档内细搜
relevant_texts = [doc.page_content for doc in summary_results]
combined_text = "\n".join(relevant_texts)
# 重新分块细搜
detail_chunks = detail_splitter.create_documents([combined_text])
detail_vectorstore.add_documents(detail_chunks)
return detail_vectorstore.similarity_search(query, k=k)3.2 关键词增强
python snippetpython
def keyword_enhanced_search(query: str, k: int = 3) -> list[Document]:
# 提取关键词
keywords = extract_keywords(query)
# 扩展查询
expanded_query = f"{query} {' '.join(keywords)}"
# 检索
return vectorstore.similarity_search(expanded_query, k=k)
def extract_keywords(text: str) -> list[str]:
"""简单关键词提取"""
# 可以用 NLP 库或 LLM
stopwords = {"的", "了", "是", "在", "和", "有"}
words = jieba.cut(text)
return [w for w in words if w not in stopwords and len(w) > 1]3.3 元数据过滤
python snippetpython
from langchain.schema import Document
# 为文档添加元数据
docs_with_metadata = [
Document(
page_content="文档内容",
metadata={
"source": "user_manual",
"category": "faq",
"year": 2024,
"author": "team"
}
)
]
# 创建带元数据的向量存储
vectorstore = FAISS.from_documents(docs_with_metadata, embeddings)
# 检索时过滤
results = vectorstore.similarity_search(
query="问题",
k=3,
filter={"category": "faq", "year": {"$gte": 2023}}
)05.4. 检索优化
4.1 混合检索
python snippetpython
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# 向量检索
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# 关键词检索
keyword_retriever = BM25Retriever.from_documents(docs)
keyword_retriever.k = 3
# 混合
ensemble = EnsembleRetriever(
retrievers=[vector_retriever, keyword_retriever],
weights=[0.6, 0.4] # 向量权重更高
)
# 使用
results = ensemble.get_relevant_documents("查询")4.2 重排序
python snippetpython
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# 重排序模型
cross_encoder = HuggingFaceCrossEncoder(
model_name="BAAI/bge-reranker-base"
)
def rerank_documents(query: str, documents: list[Document], top_k: int = 3) -> list[Document]:
# 计算相关性分数
doc_texts = [doc.page_content for doc in documents]
pairs = [(query, text) for text in doc_texts]
scores = cross_encoder.predict(pairs)
# 排序
scored = zip(documents, scores)
sorted_docs = sorted(scored, key=lambda x: x[1], reverse=True)
return [doc for doc, score in sorted_docs[:top_k]]
# 使用流程
initial_docs = vectorstore.similarity_search(query, k=10)
reranked_docs = rerank_documents(query, initial_docs, top_k=3)4.3 查询扩展/改写
python snippetpython
def rewrite_query(query: str) -> str:
"""LLM 改写查询"""
prompt = f"""将用户问题改写成更适合检索的形式。
添加同义词、相关术语,去除口语化表达。
原始问题:{query}
改写后:"""
return llm.invoke(prompt).content
def expand_query_with_context(query: str) -> list[str]:
"""生成多个查询变体"""
prompt = f"""基于用户问题,生成3个不同的检索查询。
涵盖不同角度和同义词。
问题:{query}
输出(每行一个):"""
result = llm.invoke(prompt).content
queries = result.strip().split("\n")
return queries[:3]
# 使用
queries = expand_query_with_context("你们公司怎么收费?")
all_docs = []
for q in queries:
docs = vectorstore.similarity_search(q, k=3)
all_docs.extend(docs)
# 去重
unique_docs = list({doc.page_content: doc for doc in all_docs}.values())06.5. 评估与迭代
5.1 检索质量评估
python snippetpython
def evaluate_retrieval(retriever, test_cases: list[dict]) -> dict:
"""评估检索效果"""
results = []
for case in test_cases:
query = case["query"]
relevant_docs = case["relevant_docs"]
retrieved = retriever.get_relevant_documents(query)
# 计算指标
retrieved_contents = [doc.page_content for doc in retrieved]
hits = sum(1 for rel in relevant_docs if any(rel in r for r in retrieved_contents))
results.append({
"query": query,
"hits": hits,
"total_relevant": len(relevant_docs),
"recall": hits / len(relevant_docs) if relevant_docs else 0
})
# 汇总
avg_recall = sum(r["recall"] for r in results) / len(results)
return {"avg_recall": avg_recall, "details": results}5.2 A/B 测试
python snippetpython
# 对比不同配置的效果
configs = [
{"chunk_size": 300, "embedding": "text-embedding-3-small"},
{"chunk_size": 500, "embedding": "bge-large-zh"},
{"chunk_size": 300, "embedding": "bge-large-zh", "use_rerank": True}
]
for config in configs:
# 构建 RAG
rag = build_rag(config)
# 评估
metrics = evaluate_retrieval(rag.retriever, test_cases)
print(f"Config: {config}")
print(f"Recall: {metrics['avg_recall']:.2%}")07.6. 总结
| 优化方向 | 技巧 | 效果 |
|---|---|---|
| Chunking | 语义分块 | 更合理的粒度 |
| Chunking | Parent Document | 兼顾上下文 |
| Embedding | 选择合适模型 | 更好的向量质量 |
| Embedding | 动态嵌入 | 适配不同内容 |
| 索引 | 分层索引 | 粗筛+精筛 |
| 检索 | 混合检索 | 平衡精确/召回 |
| 检索 | 重排序 | 提升相关性 |
| 检索 | 查询改写 | 适配检索表达 |
RAG 优化是一个迭代的过程,建议:
- •先跑通基础版
- •用真实数据评估
- •针对薄弱环节优化
- •持续监控和迭代
下一篇文章我们将讨论 RAG 评估:如何衡量 RAG 的效果?,建立科学的评估体系。