Advanced RAG: Chunking Strategies and Retrieval Optimization Trade-offs
Most RAG pipelines fail at retrieval, not generation. This article covers five chunking strategies, hybrid search, reranking pipelines, and a production-ready decision framework.
Advanced RAG: Chunking Strategies and Retrieval Optimization Trade-offs
You built a RAG pipeline. You chunked your documents, generated embeddings, stored them in a vector database, and wired up a top-k retriever to feed context into your LLM. The problem: answer quality is inconsistent. Sometimes spot-on, sometimes completely off-topic. You swapped embedding models, upgraded to a larger LLM, and the improvement was marginal.
The problem is almost certainly not at the generation step. It's at retrieval. And retrieval quality is capped the moment you choose how to split your documents.
Why Retrieval Is the Bottleneck
RAG quality follows a dependency chain: chunking -> embedding -> retrieval -> reranking -> generation. If any link in this chain is weak, every downstream step is compensating for damage that's already done. In practice, 90% of retrieval quality issues trace back to the chunking strategy and retrieval method you chose.
Here's a concrete example. Say your knowledge base contains API documentation where a single passage describes a function's parameters and return values. With fixed 512-token chunking, the parameter description and return value description might end up in separate chunks. A user asks "what does this function return?" The embedding retrieval hits the parameter chunk but misses the return value information.
This isn't an embedding model problem. It isn't a vector database problem. It's a chunking problem -- the semantic completeness of the original text was destroyed.
Five Chunking Strategies, Compared
1. Fixed-Size with Overlap
The simplest approach: split by character count or token count, with overlap between adjacent chunks.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)
chunks = splitter.split_text(your_document)
print(f"Generated {len(chunks)} chunks, avg length {sum(len(c) for c in chunks) / len(chunks):.0f}")
Good for: Plain text corpora, log files, documents with no discernible structure. Use as a baseline to verify your pipeline works end-to-end.
Bad for: Structured documents (API docs, legal contracts, technical specs), documents with tables or code blocks. Fixed splitting will mercilessly cut through table rows, code blocks, and logical paragraph boundaries.
2. Recursive Character Splitting
The default in both LlamaIndex and LangChain. Splits recursively by a prioritized list of separators: double newlines first, then single newlines, then periods, and so on.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""],
chunk_size=1000,
chunk_overlap=100,
)
chunks = splitter.split_text(your_document)
Why it's better than fixed-size: It respects natural language boundaries (paragraphs > sentences > words). In most cases, it won't split a sentence in half.
Why it's still limited: It doesn't understand document structure. In a Markdown file, content under ## API Reference and content under ## Getting Started might end up in the same chunk -- which is noise for retrieval purposes.
3. Semantic Chunking
Instead of splitting by size, split by semantic similarity: compute embeddings for adjacent sentences, and cut when similarity drops below a threshold.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=75,
)
chunks = splitter.split_text(your_document)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {len(chunk)} characters")
Strength: Each chunk is semantically coherent. Retrieved chunks are more likely to fully answer the user's question.
Cost: You need to embed every sentence, making this 5-10x slower. Suitable for offline preprocessing, not for real-time splitting. Threshold selection significantly affects results -- too low and chunks are fragmented, too high and they're unwieldy.
In practice, semantic chunking works best for long-form literature (papers, reports) and Q&A datasets. For structured documents, there are better options.
4. Structure-Aware Splitting
Uses the document's own structural markers (Markdown headings, HTML tags, code block boundaries, table rows) as split points. This is the most recommended approach for production use today.
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Step 1: Split by Markdown heading hierarchy
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_chunks = md_splitter.split_text(markdown_document)
# Step 2: For oversized sections, recursively split further
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
)
final_chunks = text_splitter.split_documents(md_chunks)
# Each chunk automatically carries heading metadata
for chunk in final_chunks[:3]:
print(f"Metadata: {chunk.metadata}")
print(f"Content: {chunk.page_content[:100]}...")
print("---")
For code documentation, use a language-aware splitter:
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1500,
chunk_overlap=200,
)
code_chunks = python_splitter.split_text(python_source_code)
Key advantage: Chunks carry structural metadata (which section, which function) that can be used for filtering and boosting during retrieval. No other splitting method provides this.
Haystack excels here -- its PreProcessor component natively supports splitting by paragraph, heading, or sentence level while automatically preserving metadata.
5. Late Chunking
An emerging pattern from 2024: embed the entire document first, then chunk in embedding space. The idea is that long-context embedding models (like Jina Embeddings v3) can handle 8192 tokens of input. You get document-level semantic representations first, then partition in embedding space.
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
def late_chunking(document: str, chunk_size: int = 500) -> list[dict]:
"""Embed the full document first, then chunk in token embedding space"""
inputs = tokenizer(document, return_tensors="pt", truncation=True, max_length=8192)
with torch.no_grad():
outputs = model(**inputs)
# Per-token embeddings
token_embeddings = outputs.last_hidden_state.squeeze(0) # [seq_len, hidden_dim]
# Group and mean-pool in token embedding space
tokens = inputs["input_ids"].squeeze(0)
chunk_embeddings = []
for i in range(0, len(tokens), chunk_size):
chunk_emb = token_embeddings[i : i + chunk_size].mean(dim=0)
chunk_text = tokenizer.decode(tokens[i : i + chunk_size], skip_special_tokens=True)
chunk_embeddings.append({
"text": chunk_text,
"embedding": chunk_emb.numpy(),
})
return chunk_embeddings
chunks = late_chunking(your_document, chunk_size=500)
print(f"Generated {len(chunks)} late-chunked embeddings")
Theoretical advantage: Chunk embeddings retain full-document context, solving the fundamental "context loss after splitting" problem.
Practical limitations: Requires a long-context embedding model, higher GPU memory, and higher inference latency. Currently best suited for small-scale, high-precision use cases. The EmbedAnything project provides efficient embedding pipelines that can serve as infrastructure for this approach.
Retrieval Optimization: Hybrid Search and Reranking
Choosing the right chunking strategy solves "what to retrieve." But "how to retrieve" matters equally. Pure vector similarity search sometimes loses to simple keyword matching.
Hybrid Search: Dense + Sparse
Hybrid search combines semantic retrieval (dense) with keyword retrieval (sparse / BM25), merging results via Reciprocal Rank Fusion (RRF).
from rank_bm25 import BM25Okapi
import numpy as np
def hybrid_search(
query: str,
chunks: list[str],
dense_embeddings: np.ndarray, # pre-computed chunk embeddings
bm25: BM25Okapi,
embedding_model,
top_k: int = 10,
alpha: float = 0.5, # weight for dense, 1-alpha for sparse
) -> list[tuple[int, float]]:
"""Hybrid search: semantic + keyword, fused with RRF"""
# Dense retrieval
query_emb = np.array(embedding_model.embed_query(query)).reshape(1, -1)
dense_scores = np.dot(dense_embeddings, query_emb.T).flatten()
dense_ranks = np.argsort(-dense_scores)
# Sparse retrieval (BM25)
tokenized_query = query.lower().split()
sparse_scores = bm25.get_scores(tokenized_query)
sparse_ranks = np.argsort(-sparse_scores)
# Reciprocal Rank Fusion
rrf_scores = {}
for rank, idx in enumerate(dense_ranks):
rrf_scores[idx] = rrf_scores.get(idx, 0) + alpha / (1 + rank)
for rank, idx in enumerate(sparse_ranks):
rrf_scores[idx] = rrf_scores.get(idx, 0) + (1 - alpha) / (1 + rank)
sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:top_k]
# Preprocessing
tokenized_corpus = [chunk.lower().split() for chunk in chunks]
bm25_index = BM25Okapi(tokenized_corpus)
When hybrid search pays off the most: Queries containing exact terminology, product names, or error codes. BM25 crushes semantic search on exact matches, while semantic search handles fuzzy queries and synonyms better. They complement each other.
LlamaIndex supports hybrid search natively -- configure a VectorIndex + BM25 retriever and set RRF fusion parameters. Lantern, as a PostgreSQL vector extension, also supports dense + sparse joint queries at the SQL level.
Reranking Pipeline
After initial retrieval (whether pure vector or hybrid) retrieves top-k candidates (say, 20), a cross-encoder model re-scores them to select the most relevant top-n (say, 5).
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[str], top_n: int = 5) -> list[dict]:
"""Re-rank retrieval results with a cross-encoder"""
pairs = [(query, chunk) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(
[{"text": chunk, "score": float(score)} for chunk, score in zip(chunks, scores)],
key=lambda x: x["score"],
reverse=True,
)
return ranked[:top_n]
initial_results = ["chunk1 text...", "chunk2 text...", "chunk3 text..."]
reranked = rerank("user query here", initial_results, top_n=3)
for r in reranked:
print(f"Score: {r['score']:.4f} | {r['text'][:80]}")
Key insight: Reranking is the single highest-ROI retrieval optimization. A small cross-encoder (MiniLM-level) typically delivers 10-20% retrieval precision improvement at the cost of 10-50ms additional latency.
Query Expansion
User queries are often too short or too vague. Use an LLM to generate sub-queries from different angles, retrieve for each, and merge results.
from openai import OpenAI
client = OpenAI()
def expand_query(original_query: str, n_sub_queries: int = 3) -> list[str]:
"""Generate sub-queries from different angles for better retrieval coverage"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"You are a query expansion assistant. Given the user's original query, "
f"generate {n_sub_queries} sub-queries from different angles to improve "
"retrieval coverage. One sub-query per line, no numbering, no explanation."
),
},
{"role": "user", "content": original_query},
],
temperature=0.3,
max_tokens=200,
)
sub_queries = response.choices[0].message.content.strip().split("\n")
return [original_query] + [q.strip() for q in sub_queries if q.strip()]
expanded = expand_query("how to optimize RAG retrieval")
for q in expanded:
print(f" -> {q}")
Query expansion works best when users tend to ask short, vague questions. But be aware: each sub-query triggers a separate retrieval, so latency and cost multiply by N. In production, typically limit to 3-5 sub-queries.
Production Pipeline: Putting It All Together
Here is a complete, runnable chunking + retrieval pipeline combining structure-aware splitting, hybrid search, and reranking.
"""
Production-grade RAG retrieval pipeline
Dependencies: pip install langchain langchain-openai rank-bm25 sentence-transformers
"""
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
# --- Configuration ---
CHUNK_SIZE = 800
CHUNK_OVERLAP = 100
INITIAL_RETRIEVAL_K = 20
FINAL_TOP_N = 5
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
# --- Step 1: Structure-aware chunking ---
def chunk_documents(documents: list[str]) -> list[dict]:
"""Split by Markdown structure, preserve metadata"""
header_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
sub_splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)
all_chunks = []
for doc in documents:
md_chunks = header_splitter.split_text(doc)
for chunk in md_chunks:
if len(chunk.page_content) > CHUNK_SIZE:
sub_chunks = sub_splitter.split_documents([chunk])
all_chunks.extend(sub_chunks)
else:
all_chunks.append(chunk)
return [
{"text": c.page_content, "metadata": c.metadata}
for c in all_chunks
]
# --- Step 2: Build Embedding + BM25 index ---
def build_index(chunks: list[dict]):
"""Build dense and sparse indexes simultaneously"""
texts = [c["text"] for c in chunks]
embedder = OpenAIEmbeddings(model="text-embedding-3-small")
dense_embeddings = np.array(embedder.embed_documents(texts))
tokenized = [text.lower().split() for text in texts]
bm25 = BM25Okapi(tokenized)
return dense_embeddings, bm25, embedder
# --- Step 3: Hybrid search + Reranking ---
def retrieve(
query: str,
chunks: list[dict],
dense_embeddings: np.ndarray,
bm25: BM25Okapi,
embedder: OpenAIEmbeddings,
) -> list[dict]:
"""Full pipeline: hybrid search -> reranking -> return top-n"""
texts = [c["text"] for c in chunks]
# Dense retrieval
query_emb = np.array(embedder.embed_query(query)).reshape(1, -1)
dense_scores = np.dot(dense_embeddings, query_emb.T).flatten()
dense_ranks = np.argsort(-dense_scores)
# Sparse retrieval
sparse_scores = bm25.get_scores(query.lower().split())
sparse_ranks = np.argsort(-sparse_scores)
# RRF fusion
rrf_scores = {}
for rank, idx in enumerate(dense_ranks[:INITIAL_RETRIEVAL_K]):
rrf_scores[idx] = rrf_scores.get(idx, 0) + 0.5 / (1 + rank)
for rank, idx in enumerate(sparse_ranks[:INITIAL_RETRIEVAL_K]):
rrf_scores[idx] = rrf_scores.get(idx, 0) + 0.5 / (1 + rank)
# Take top-k after fusion
candidates = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
candidate_texts = [texts[idx] for idx, _ in candidates[:INITIAL_RETRIEVAL_K]]
# Cross-encoder reranking
reranker = CrossEncoder(RERANK_MODEL)
pairs = [(query, text) for text in candidate_texts]
rerank_scores = reranker.predict(pairs)
results = []
for (idx, _), score in zip(candidates[:INITIAL_RETRIEVAL_K], rerank_scores):
results.append((idx, float(score)))
results.sort(key=lambda x: x[1], reverse=True)
return [
{**chunks[idx], "relevance_score": score}
for idx, score in results[:FINAL_TOP_N]
]
# --- Run ---
if __name__ == "__main__":
sample_docs = [
"# RAG Optimization Guide\n\n## Overview\n\nRAG combines information retrieval with text generation.\n\n## Chunking Strategies\n\nChoosing the right chunking strategy is critical for RAG quality.\n\n### Fixed-Size Splitting\n\nThe simplest approach, splitting by fixed token count.\n\n### Semantic Chunking\n\nSplit by semantic similarity to ensure coherent chunks.",
"# Vector Database Comparison\n\n## Overview\n\nComparing mainstream vector databases on performance and features.\n\n## Pinecone\n\nFully managed vector database service.\n\n## Milvus\n\nOpen-source high-performance vector database.",
]
chunks = chunk_documents(sample_docs)
print(f"Split into {len(chunks)} chunks")
dense_emb, bm25_idx, embedder = build_index(chunks)
print("Index built")
results = retrieve("how to split documents for RAG", chunks, dense_emb, bm25_idx, embedder)
for r in results:
print(f"\n[Score: {r['relevance_score']:.4f}] Metadata: {r['metadata']}")
print(f" {r['text'][:120]}...")
Decision Framework: Which Strategy for Which Document Type
| Document Type | Recommended Chunking | Why | Retrieval Method |
|---|---|---|---|
| API docs / technical manuals | Structure-aware (headings + code block boundaries) | Naturally splits by function/class/module, metadata enables filtering | Hybrid + Reranking |
| Legal contracts / clause documents | Recursive character + clause number regex | Clauses are logically independent but vary in length | Dense + Reranking |
| Research papers / reports | Semantic chunking | Argumentation is coherent; semantic boundaries are more accurate than structural ones | Hybrid |
| FAQ / Q&A pairs | Split per Q&A pair, never split a pair | Each Q&A is naturally one retrieval unit | BM25 alone is fine |
| Logs / unstructured text | Fixed-size + overlap | No structure to exploit; fixed splitting is the only option | BM25 + keyword filtering |
| Tabular data | Split by row, keep headers as metadata | Table rows are natural split points; headers provide semantics | Dense + metadata filtering |
| Mixed-format documents | Structure-aware + special handling for tables/code | Different sections need different strategies | Hybrid + Reranking |
General advice: If you're unsure about your document type, start with recursive character splitting + hybrid search. This is the safest baseline. Then analyze retrieval quality on failing cases and optimize incrementally.
The AutoRAG project offers automated RAG parameter search, helping you test different chunking parameters and retrieval strategy combinations automatically -- saving manual tuning time.
Three Production Pitfalls
Pitfall 1: Ignoring Chunk Metadata
Metadata attached to each chunk after splitting (source document, section title, page number) is the most overlooked retrieval enhancement.
# Wrong: only store text + embedding
vector_store.add(text=chunk, embedding=emb)
# Right: store and leverage metadata
vector_store.add(
text=chunk,
embedding=emb,
metadata={
"source": "api-docs",
"section": "authentication",
"doc_type": "api_reference",
}
)
# Use metadata filtering during retrieval
results = vector_store.search(
query="how to authenticate",
filter={"doc_type": "api_reference"}, # only search API docs
top_k=10,
)
Metadata filtering can dramatically narrow the search scope before vector similarity is even computed, improving both precision and latency. Pathway's llm-app framework handles this well -- its data pipeline natively supports attaching rich metadata at index time.
Pitfall 2: Reranker Becomes a Latency Bottleneck
Cross-encoder reranking is a sequential operation. 20 candidates x 10ms each = 200ms additional latency. In systems targeting P95 < 500ms, that's nearly half your budget.
Solutions:
- Reduce initial retrieval top-k (from 30 to 15)
- Use a smaller reranker model (MiniLM instead of large)
- Batch inference for the reranker
- Consider lightweight LLM reranking (output rank only, no text generation)
Pitfall 3: Embedding Model Mismatch with Query Distribution
You indexed documents with a multilingual embedding model, but user queries mix languages (e.g., "RAG chunking strategy"). Many embedding models degrade on mixed-language input.
Solutions:
- Generate bilingual chunks at index time (LLM-translate, then embed each separately)
- Generate bilingual sub-queries during query expansion
- Choose embedding models explicitly supporting cross-lingual retrieval (e.g., multilingual-e5, BGE-M3)
Summary
- Chunking caps your retrieval quality. Time spent choosing the right chunking strategy has higher ROI than time spent swapping embedding models. Start with structure-aware splitting and adjust from there.
- Hybrid search is the production standard. Pure vector search loses to BM25 on exact matches. Pure BM25 loses to vectors on semantic understanding. Fusing both is the most robust approach.
- Reranking is the highest-ROI single optimization. Adding a cross-encoder reranker typically improves retrieval precision by 10-20% at the cost of 10-50ms latency.
- Metadata is not optional. Attach source, section, and type metadata at index time. Use metadata filtering during retrieval. The improvement is immediate and measurable.
- There is no universal chunking strategy. API docs, research papers, FAQs, and legal contracts require different splitting approaches. Choose based on your document type, not based on what some tutorial recommends.
Projects in this article
Haystack
25.2k ⭐Haystack is an enterprise-grade framework for RAG and search applications, covering document processing, retrieval, generation, and evaluation end to end.
LlamaIndex
49.3k ⭐LlamaIndex is a data framework that provides the data connection layer for LLM applications, with strong RAG capabilities across diverse data sources and vector databases.
EmbedAnything
1.2k ⭐EmbedAnything is a highly performant, modular, and memory-safe embedding inference and indexing framework built in Rust, providing production-ready RAG ingestion and indexing pipelines for local and cloud deployment.
AutoRAG
4.8k ⭐AutoRAG is an open-source RAG evaluation and optimization framework using AutoML-style automation to help developers automatically find the best RAG pipeline configurations and benchmark them.
Lantern
881 ⭐A PostgreSQL vector database extension for building AI applications, adding high-performance vector search capabilities to PostgreSQL with support for generating and indexing embeddings directly in the database.
Pathway LLM App
59.8k ⭐Ready-to-run cloud templates for RAG, AI pipelines and enterprise search with live data, always in sync with Sharepoint, Google Drive, S3, Kafka and more.