Advanced RAG: Chunking Strategies and Retrieval Optimization Trade-offs

You built a RAG pipeline. You chunked your documents, generated embeddings, stored them in a vector database, and wired up a top-k retriever to feed context into your LLM. The problem: answer quality is inconsistent. Sometimes spot-on, sometimes completely off-topic. You swapped embedding models, upgraded to a larger LLM, and the improvement was marginal.

The problem is almost certainly not at the generation step. It's at retrieval. And retrieval quality is capped the moment you choose how to split your documents.

Why Retrieval Is the Bottleneck

RAG quality follows a dependency chain: chunking -> embedding -> retrieval -> reranking -> generation. If any link in this chain is weak, every downstream step is compensating for damage that's already done. In practice, 90% of retrieval quality issues trace back to the chunking strategy and retrieval method you chose.

Here's a concrete example. Say your knowledge base contains API documentation where a single passage describes a function's parameters and return values. With fixed 512-token chunking, the parameter description and return value description might end up in separate chunks. A user asks "what does this function return?" The embedding retrieval hits the parameter chunk but misses the return value information.

This isn't an embedding model problem. It isn't a vector database problem. It's a chunking problem -- the semantic completeness of the original text was destroyed.

Five Chunking Strategies, Compared

1. Fixed-Size with Overlap

The simplest approach: split by character count or token count, with overlap between adjacent chunks.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

chunks = splitter.split_text(your_document)
print(f"Generated {len(chunks)} chunks, avg length {sum(len(c) for c in chunks) / len(chunks):.0f}")

Good for: Plain text corpora, log files, documents with no discernible structure. Use as a baseline to verify your pipeline works end-to-end.

Bad for: Structured documents (API docs, legal contracts, technical specs), documents with tables or code blocks. Fixed splitting will mercilessly cut through table rows, code blocks, and logical paragraph boundaries.

2. Recursive Character Splitting

The default in both LlamaIndex and LangChain. Splits recursively by a prioritized list of separators: double newlines first, then single newlines, then periods, and so on.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""],
    chunk_size=1000,
    chunk_overlap=100,
)

chunks = splitter.split_text(your_document)

Why it's better than fixed-size: It respects natural language boundaries (paragraphs > sentences > words). In most cases, it won't split a sentence in half.

Why it's still limited: It doesn't understand document structure. In a Markdown file, content under ## API Reference and content under ## Getting Started might end up in the same chunk -- which is noise for retrieval purposes.

3. Semantic Chunking

Instead of splitting by size, split by semantic similarity: compute embeddings for adjacent sentences, and cut when similarity drops below a threshold.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75,
)

chunks = splitter.split_text(your_document)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} characters")

Strength: Each chunk is semantically coherent. Retrieved chunks are more likely to fully answer the user's question.

Cost: You need to embed every sentence, making this 5-10x slower. Suitable for offline preprocessing, not for real-time splitting. Threshold selection significantly affects results -- too low and chunks are fragmented, too high and they're unwieldy.

In practice, semantic chunking works best for long-form literature (papers, reports) and Q&A datasets. For structured documents, there are better options.

4. Structure-Aware Splitting

Uses the document's own structural markers (Markdown headings, HTML tags, code block boundaries, table rows) as split points. This is the most recommended approach for production use today.

from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Step 1: Split by Markdown heading hierarchy
headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_chunks = md_splitter.split_text(markdown_document)

# Step 2: For oversized sections, recursively split further
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
)
final_chunks = text_splitter.split_documents(md_chunks)

# Each chunk automatically carries heading metadata
for chunk in final_chunks[:3]:
    print(f"Metadata: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:100]}...")
    print("---")

For code documentation, use a language-aware splitter:

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1500,
    chunk_overlap=200,
)
code_chunks = python_splitter.split_text(python_source_code)

Key advantage: Chunks carry structural metadata (which section, which function) that can be used for filtering and boosting during retrieval. No other splitting method provides this.

Haystack excels here -- its PreProcessor component natively supports splitting by paragraph, heading, or sentence level while automatically preserving metadata.

5. Late Chunking

An emerging pattern from 2024: embed the entire document first, then chunk in embedding space. The idea is that long-context embedding models (like Jina Embeddings v3) can handle 8192 tokens of input. You get document-level semantic representations first, then partition in embedding space.

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)

def late_chunking(document: str, chunk_size: int = 500) -> list[dict]:
    """Embed the full document first, then chunk in token embedding space"""
    inputs = tokenizer(document, return_tensors="pt", truncation=True, max_length=8192)
    with torch.no_grad():
        outputs = model(**inputs)

    # Per-token embeddings
    token_embeddings = outputs.last_hidden_state.squeeze(0)  # [seq_len, hidden_dim]

    # Group and mean-pool in token embedding space
    tokens = inputs["input_ids"].squeeze(0)
    chunk_embeddings = []
    for i in range(0, len(tokens), chunk_size):
        chunk_emb = token_embeddings[i : i + chunk_size].mean(dim=0)
        chunk_text = tokenizer.decode(tokens[i : i + chunk_size], skip_special_tokens=True)
        chunk_embeddings.append({
            "text": chunk_text,
            "embedding": chunk_emb.numpy(),
        })
    return chunk_embeddings

chunks = late_chunking(your_document, chunk_size=500)
print(f"Generated {len(chunks)} late-chunked embeddings")

Theoretical advantage: Chunk embeddings retain full-document context, solving the fundamental "context loss after splitting" problem.

Practical limitations: Requires a long-context embedding model, higher GPU memory, and higher inference latency. Currently best suited for small-scale, high-precision use cases. The EmbedAnything project provides efficient embedding pipelines that can serve as infrastructure for this approach.

Retrieval Optimization: Hybrid Search and Reranking

Choosing the right chunking strategy solves "what to retrieve." But "how to retrieve" matters equally. Pure vector similarity search sometimes loses to simple keyword matching.

Hybrid Search: Dense + Sparse

Hybrid search combines semantic retrieval (dense) with keyword retrieval (sparse / BM25), merging results via Reciprocal Rank Fusion (RRF).

from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_search(
    query: str,
    chunks: list[str],
    dense_embeddings: np.ndarray,  # pre-computed chunk embeddings
    bm25: BM25Okapi,
    embedding_model,
    top_k: int = 10,
    alpha: float = 0.5,  # weight for dense, 1-alpha for sparse
) -> list[tuple[int, float]]:
    """Hybrid search: semantic + keyword, fused with RRF"""
    # Dense retrieval
    query_emb = np.array(embedding_model.embed_query(query)).reshape(1, -1)
    dense_scores = np.dot(dense_embeddings, query_emb.T).flatten()
    dense_ranks = np.argsort(-dense_scores)

    # Sparse retrieval (BM25)
    tokenized_query = query.lower().split()
    sparse_scores = bm25.get_scores(tokenized_query)
    sparse_ranks = np.argsort(-sparse_scores)

    # Reciprocal Rank Fusion
    rrf_scores = {}
    for rank, idx in enumerate(dense_ranks):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + alpha / (1 + rank)
    for rank, idx in enumerate(sparse_ranks):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + (1 - alpha) / (1 + rank)

    sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_results[:top_k]

# Preprocessing
tokenized_corpus = [chunk.lower().split() for chunk in chunks]
bm25_index = BM25Okapi(tokenized_corpus)

When hybrid search pays off the most: Queries containing exact terminology, product names, or error codes. BM25 crushes semantic search on exact matches, while semantic search handles fuzzy queries and synonyms better. They complement each other.

LlamaIndex supports hybrid search natively -- configure a VectorIndex + BM25 retriever and set RRF fusion parameters. Lantern, as a PostgreSQL vector extension, also supports dense + sparse joint queries at the SQL level.

Reranking Pipeline

After initial retrieval (whether pure vector or hybrid) retrieves top-k candidates (say, 20), a cross-encoder model re-scores them to select the most relevant top-n (say, 5).

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_n: int = 5) -> list[dict]:
    """Re-rank retrieval results with a cross-encoder"""
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)

    ranked = sorted(
        [{"text": chunk, "score": float(score)} for chunk, score in zip(chunks, scores)],
        key=lambda x: x["score"],
        reverse=True,
    )
    return ranked[:top_n]

initial_results = ["chunk1 text...", "chunk2 text...", "chunk3 text..."]
reranked = rerank("user query here", initial_results, top_n=3)
for r in reranked:
    print(f"Score: {r['score']:.4f} | {r['text'][:80]}")

Key insight: Reranking is the single highest-ROI retrieval optimization. A small cross-encoder (MiniLM-level) typically delivers 10-20% retrieval precision improvement at the cost of 10-50ms additional latency.

Query Expansion

User queries are often too short or too vague. Use an LLM to generate sub-queries from different angles, retrieve for each, and merge results.

from openai import OpenAI

client = OpenAI()

def expand_query(original_query: str, n_sub_queries: int = 3) -> list[str]:
    """Generate sub-queries from different angles for better retrieval coverage"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a query expansion assistant. Given the user's original query, "
                    f"generate {n_sub_queries} sub-queries from different angles to improve "
                    "retrieval coverage. One sub-query per line, no numbering, no explanation."
                ),
            },
            {"role": "user", "content": original_query},
        ],
        temperature=0.3,
        max_tokens=200,
    )
    sub_queries = response.choices[0].message.content.strip().split("\n")
    return [original_query] + [q.strip() for q in sub_queries if q.strip()]

expanded = expand_query("how to optimize RAG retrieval")
for q in expanded:
    print(f"  -> {q}")

Query expansion works best when users tend to ask short, vague questions. But be aware: each sub-query triggers a separate retrieval, so latency and cost multiply by N. In production, typically limit to 3-5 sub-queries.

Production Pipeline: Putting It All Together

Here is a complete, runnable chunking + retrieval pipeline combining structure-aware splitting, hybrid search, and reranking.

"""
Production-grade RAG retrieval pipeline
Dependencies: pip install langchain langchain-openai rank-bm25 sentence-transformers
"""
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# --- Configuration ---
CHUNK_SIZE = 800
CHUNK_OVERLAP = 100
INITIAL_RETRIEVAL_K = 20
FINAL_TOP_N = 5
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"

# --- Step 1: Structure-aware chunking ---
def chunk_documents(documents: list[str]) -> list[dict]:
    """Split by Markdown structure, preserve metadata"""
    header_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
    )
    sub_splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
    )

    all_chunks = []
    for doc in documents:
        md_chunks = header_splitter.split_text(doc)
        for chunk in md_chunks:
            if len(chunk.page_content) > CHUNK_SIZE:
                sub_chunks = sub_splitter.split_documents([chunk])
                all_chunks.extend(sub_chunks)
            else:
                all_chunks.append(chunk)

    return [
        {"text": c.page_content, "metadata": c.metadata}
        for c in all_chunks
    ]

# --- Step 2: Build Embedding + BM25 index ---
def build_index(chunks: list[dict]):
    """Build dense and sparse indexes simultaneously"""
    texts = [c["text"] for c in chunks]
    embedder = OpenAIEmbeddings(model="text-embedding-3-small")
    dense_embeddings = np.array(embedder.embed_documents(texts))

    tokenized = [text.lower().split() for text in texts]
    bm25 = BM25Okapi(tokenized)

    return dense_embeddings, bm25, embedder

# --- Step 3: Hybrid search + Reranking ---
def retrieve(
    query: str,
    chunks: list[dict],
    dense_embeddings: np.ndarray,
    bm25: BM25Okapi,
    embedder: OpenAIEmbeddings,
) -> list[dict]:
    """Full pipeline: hybrid search -> reranking -> return top-n"""
    texts = [c["text"] for c in chunks]

    # Dense retrieval
    query_emb = np.array(embedder.embed_query(query)).reshape(1, -1)
    dense_scores = np.dot(dense_embeddings, query_emb.T).flatten()
    dense_ranks = np.argsort(-dense_scores)

    # Sparse retrieval
    sparse_scores = bm25.get_scores(query.lower().split())
    sparse_ranks = np.argsort(-sparse_scores)

    # RRF fusion
    rrf_scores = {}
    for rank, idx in enumerate(dense_ranks[:INITIAL_RETRIEVAL_K]):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 0.5 / (1 + rank)
    for rank, idx in enumerate(sparse_ranks[:INITIAL_RETRIEVAL_K]):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 0.5 / (1 + rank)

    # Take top-k after fusion
    candidates = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    candidate_texts = [texts[idx] for idx, _ in candidates[:INITIAL_RETRIEVAL_K]]

    # Cross-encoder reranking
    reranker = CrossEncoder(RERANK_MODEL)
    pairs = [(query, text) for text in candidate_texts]
    rerank_scores = reranker.predict(pairs)

    results = []
    for (idx, _), score in zip(candidates[:INITIAL_RETRIEVAL_K], rerank_scores):
        results.append((idx, float(score)))
    results.sort(key=lambda x: x[1], reverse=True)

    return [
        {**chunks[idx], "relevance_score": score}
        for idx, score in results[:FINAL_TOP_N]
    ]

# --- Run ---
if __name__ == "__main__":
    sample_docs = [
        "# RAG Optimization Guide\n\n## Overview\n\nRAG combines information retrieval with text generation.\n\n## Chunking Strategies\n\nChoosing the right chunking strategy is critical for RAG quality.\n\n### Fixed-Size Splitting\n\nThe simplest approach, splitting by fixed token count.\n\n### Semantic Chunking\n\nSplit by semantic similarity to ensure coherent chunks.",
        "# Vector Database Comparison\n\n## Overview\n\nComparing mainstream vector databases on performance and features.\n\n## Pinecone\n\nFully managed vector database service.\n\n## Milvus\n\nOpen-source high-performance vector database.",
    ]

    chunks = chunk_documents(sample_docs)
    print(f"Split into {len(chunks)} chunks")
    dense_emb, bm25_idx, embedder = build_index(chunks)
    print("Index built")

    results = retrieve("how to split documents for RAG", chunks, dense_emb, bm25_idx, embedder)
    for r in results:
        print(f"\n[Score: {r['relevance_score']:.4f}] Metadata: {r['metadata']}")
        print(f"  {r['text'][:120]}...")

Decision Framework: Which Strategy for Which Document Type

Document Type	Recommended Chunking	Why	Retrieval Method
API docs / technical manuals	Structure-aware (headings + code block boundaries)	Naturally splits by function/class/module, metadata enables filtering	Hybrid + Reranking
Legal contracts / clause documents	Recursive character + clause number regex	Clauses are logically independent but vary in length	Dense + Reranking
Research papers / reports	Semantic chunking	Argumentation is coherent; semantic boundaries are more accurate than structural ones	Hybrid
FAQ / Q&A pairs	Split per Q&A pair, never split a pair	Each Q&A is naturally one retrieval unit	BM25 alone is fine
Logs / unstructured text	Fixed-size + overlap	No structure to exploit; fixed splitting is the only option	BM25 + keyword filtering
Tabular data	Split by row, keep headers as metadata	Table rows are natural split points; headers provide semantics	Dense + metadata filtering
Mixed-format documents	Structure-aware + special handling for tables/code	Different sections need different strategies	Hybrid + Reranking

General advice: If you're unsure about your document type, start with recursive character splitting + hybrid search. This is the safest baseline. Then analyze retrieval quality on failing cases and optimize incrementally.

The AutoRAG project offers automated RAG parameter search, helping you test different chunking parameters and retrieval strategy combinations automatically -- saving manual tuning time.

Three Production Pitfalls

Pitfall 1: Ignoring Chunk Metadata

Metadata attached to each chunk after splitting (source document, section title, page number) is the most overlooked retrieval enhancement.

# Wrong: only store text + embedding
vector_store.add(text=chunk, embedding=emb)

# Right: store and leverage metadata
vector_store.add(
    text=chunk,
    embedding=emb,
    metadata={
        "source": "api-docs",
        "section": "authentication",
        "doc_type": "api_reference",
    }
)

# Use metadata filtering during retrieval
results = vector_store.search(
    query="how to authenticate",
    filter={"doc_type": "api_reference"},  # only search API docs
    top_k=10,
)

Metadata filtering can dramatically narrow the search scope before vector similarity is even computed, improving both precision and latency. Pathway's llm-app framework handles this well -- its data pipeline natively supports attaching rich metadata at index time.

Pitfall 2: Reranker Becomes a Latency Bottleneck

Cross-encoder reranking is a sequential operation. 20 candidates x 10ms each = 200ms additional latency. In systems targeting P95 < 500ms, that's nearly half your budget.

Solutions:

Reduce initial retrieval top-k (from 30 to 15)
Use a smaller reranker model (MiniLM instead of large)
Batch inference for the reranker
Consider lightweight LLM reranking (output rank only, no text generation)

Pitfall 3: Embedding Model Mismatch with Query Distribution

You indexed documents with a multilingual embedding model, but user queries mix languages (e.g., "RAG chunking strategy"). Many embedding models degrade on mixed-language input.

Solutions:

Generate bilingual chunks at index time (LLM-translate, then embed each separately)
Generate bilingual sub-queries during query expansion
Choose embedding models explicitly supporting cross-lingual retrieval (e.g., multilingual-e5, BGE-M3)

Summary

Chunking caps your retrieval quality. Time spent choosing the right chunking strategy has higher ROI than time spent swapping embedding models. Start with structure-aware splitting and adjust from there.
Hybrid search is the production standard. Pure vector search loses to BM25 on exact matches. Pure BM25 loses to vectors on semantic understanding. Fusing both is the most robust approach.
Reranking is the highest-ROI single optimization. Adding a cross-encoder reranker typically improves retrieval precision by 10-20% at the cost of 10-50ms latency.
Metadata is not optional. Attach source, section, and type metadata at index time. Use metadata filtering during retrieval. The improvement is immediate and measurable.
There is no universal chunking strategy. API docs, research papers, FAQs, and legal contracts require different splitting approaches. Choose based on your document type, not based on what some tutorial recommends.

Advanced RAG: Chunking Strategies and Retrieval Optimization Trade-offs

Advanced RAG: Chunking Strategies and Retrieval Optimization Trade-offs

Why Retrieval Is the Bottleneck

Five Chunking Strategies, Compared

1. Fixed-Size with Overlap

2. Recursive Character Splitting

3. Semantic Chunking

4. Structure-Aware Splitting

5. Late Chunking

Retrieval Optimization: Hybrid Search and Reranking

Hybrid Search: Dense + Sparse

Reranking Pipeline

Query Expansion

Production Pipeline: Putting It All Together

Decision Framework: Which Strategy for Which Document Type

Three Production Pitfalls

Pitfall 1: Ignoring Chunk Metadata

Pitfall 2: Reranker Becomes a Latency Bottleneck

Pitfall 3: Embedding Model Mismatch with Query Distribution

Summary

Projects in this article

Haystack

LlamaIndex

EmbedAnything

AutoRAG

Lantern

Pathway LLM App