Lesson 2

Chunking Strategies

14 min

Chunking is the most under-appreciated decision in RAG engineering. Too large: the retrieved chunk buries the answer in irrelevant context and wastes tokens. Too small: you lose surrounding context and the chunk is meaningless in isolation. The right chunk size depends on the embedding model's optimal input length, the structure of your documents, and your retrieval recall@k target.

Chunking Strategy Comparison

| Strategy | Chunk size control | Preserves sentences | Best for | |---|---|---|---| | Fixed character size | Exact character count | No | Fast baseline, unstructured text | | Fixed token size | Exact token count | No | LLM context budgeting | | Sentence boundary | Varies | Yes | Prose, articles, reports | | Recursive splitting | Attempts hierarchy | Partial | Mixed structured/unstructured | | Semantic chunking | Varies | Yes | High-precision retrieval |

Fixed-Size Chunking

python

def fixed_chunk(text: str, size: int = 512, overlap: int = 64) -> list[str]:
    chunks = []
    start  = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap   # slide with overlap
    return chunks

Overlap prevents answers from being split across chunk boundaries. A typical overlap is 10–15% of the chunk size.

Sentence-Boundary Chunking

python

import spacy

nlp = spacy.load("en_core_web_sm")

def sentence_chunk(text: str, max_tokens: int = 256) -> list[str]:
    doc    = nlp(text)
    chunks, current, current_len = [], [], 0

    for sent in doc.sents:
        sent_len = len(sent)
        if current_len + sent_len > max_tokens and current:
            chunks.append(" ".join(t.text for t in current))
            current, current_len = [], 0
        current.extend(sent)
        current_len += sent_len

    if current:
        chunks.append(" ".join(t.text for t in current))
    return chunks

Sentence chunking produces semantically complete units. Never split mid-sentence — embedding models trained on sentences generalise poorly to sentence fragments.

Recursive Splitting with LangChain

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=80,
    separators=["\n\n", "\n", ". ", " ", ""],  # try each separator in order
    length_function=len,
)

chunks = splitter.split_text(long_document)

The recursive splitter tries \n\n first (paragraph breaks), then \n (line breaks), then . (sentence ends), degrading gracefully. This works well for markdown and HTML-stripped documents.

Metadata Injection

Retrieval precision improves when chunks carry metadata that can be used in filtered queries:

python

from dataclasses import dataclass

@dataclass
class Chunk:
    text:       str
    doc_id:     str
    source_url: str
    page_num:   int
    section:    str
    chunk_idx:  int

def build_chunks_with_metadata(pages: list[dict]) -> list[Chunk]:
    chunks = []
    for page in pages:
        raw_chunks = sentence_chunk(page["text"])
        for i, text in enumerate(raw_chunks):
            chunks.append(Chunk(
                text       = text,
                doc_id     = page["doc_id"],
                source_url = page["url"],
                page_num   = page["page"],
                section    = page["section"],
                chunk_idx  = i,
            ))
    return chunks

Metadata enables filtered retrieval: "only retrieve chunks from documents published after 2023-01-01" or "only from the legal subdirectory." This is significantly more efficient than post-retrieval filtering.

Choosing Chunk Size

python

# Rule of thumb: target 60-80% of the embedding model's optimal input length
EMBEDDING_MAX_TOKENS = {
    "all-MiniLM-L6-v2":          256,
    "text-embedding-3-small":     8191,
    "bge-large-en-v1.5":          512,
}

model = "bge-large-en-v1.5"
target_chunk_tokens = int(EMBEDDING_MAX_TOKENS[model] * 0.7)  # → 358 tokens

Summary

Chunk size is one of the highest-leverage RAG parameters; benchmark recall@k at multiple chunk sizes before committing.
Fixed-size chunking is fast but crosses sentence boundaries, reducing embedding quality.
Sentence-boundary chunking produces semantically coherent chunks and improves embedding alignment.
Recursive splitting gracefully degrades through paragraph → sentence → word separators, suitable for mixed documents.
Inject document metadata at chunk creation time to enable filtered retrieval and reduce irrelevant results.

Why RAG?Embedding Models