Chunking Strategies
Chunking is the most under-appreciated decision in RAG engineering. Too large: the retrieved chunk buries the answer in irrelevant context and wastes tokens. Too small: you lose surrounding context and the chunk is meaningless in isolation. The right chunk size depends on the embedding model's optimal input length, the structure of your documents, and your retrieval recall@k target.
Chunking Strategy Comparison
| Strategy | Chunk size control | Preserves sentences | Best for | |---|---|---|---| | Fixed character size | Exact character count | No | Fast baseline, unstructured text | | Fixed token size | Exact token count | No | LLM context budgeting | | Sentence boundary | Varies | Yes | Prose, articles, reports | | Recursive splitting | Attempts hierarchy | Partial | Mixed structured/unstructured | | Semantic chunking | Varies | Yes | High-precision retrieval |
Fixed-Size Chunking
Overlap prevents answers from being split across chunk boundaries. A typical overlap is 10–15% of the chunk size.
Sentence-Boundary Chunking
Sentence chunking produces semantically complete units. Never split mid-sentence — embedding models trained on sentences generalise poorly to sentence fragments.
Recursive Splitting with LangChain
The recursive splitter tries \n\n first (paragraph breaks), then \n (line breaks), then . (sentence ends), degrading gracefully. This works well for markdown and HTML-stripped documents.
Metadata Injection
Retrieval precision improves when chunks carry metadata that can be used in filtered queries:
Metadata enables filtered retrieval: "only retrieve chunks from documents published after 2023-01-01" or "only from the legal subdirectory." This is significantly more efficient than post-retrieval filtering.
Choosing Chunk Size
Summary
- Chunk size is one of the highest-leverage RAG parameters; benchmark recall@k at multiple chunk sizes before committing.
- Fixed-size chunking is fast but crosses sentence boundaries, reducing embedding quality.
- Sentence-boundary chunking produces semantically coherent chunks and improves embedding alignment.
- Recursive splitting gracefully degrades through paragraph → sentence → word separators, suitable for mixed documents.
- Inject document metadata at chunk creation time to enable filtered retrieval and reduce irrelevant results.