Tokens, Embeddings & Vector Space
Before a model can process text, the text must become numbers. This happens in two stages: tokenisation → embedding. Understanding both is essential for writing efficient prompts and debugging unexpected model behaviour.
Tokenisation
A tokeniser splits text into chunks called tokens. Most modern LLMs use Byte Pair Encoding (BPE), which builds a vocabulary of common subword units by iteratively merging the most frequent pairs.
Key observations:
- Common English words → 1 token
- Rare words, names, and code identifiers → multiple tokens
- Numbers are often split digit-by-digit:
1234→["12", "34"] - Whitespace is included in tokens:
" hello"≠"hello"
Why Tokenisation Matters for You
Counting costs: API pricing is per token, not per character. A 10,000-word document is roughly 13,000–15,000 tokens.
Arithmetic errors: 9.9 > 9.11 confuses some models because 9.11 is tokenised as ["9", ".", "1", "1"] — four separate tokens without inherent numeric meaning.
Code vs prose: Code is token-inefficient. function getUserById(id: string) takes more tokens than its meaning suggests.
Embeddings
After tokenisation, each token ID is looked up in an embedding table — a matrix of learned float vectors. A token ID becomes a dense vector of, say, 4096 numbers.
These vectors encode meaning. Similar concepts land near each other in this high-dimensional space:
This is not a toy property — it's how the model stores and retrieves factual associations.
The Attention Layer and Contextual Embeddings
The raw embedding table gives each token a fixed vector regardless of context. The transformer's attention layers transform these into contextual embeddings — vectors that encode not just the token's identity but its meaning in the current sequence.
After attention, the vector for "bank" in these two sentences will point in different directions in the embedding space — the model has resolved the ambiguity.
Practical Implications
- Token budgeting: Use a tokeniser before sending large inputs. Don't guess.
- Semantic search: Store document embeddings; query with an embedding of the question. Cosine similarity finds relevant chunks. This is the foundation of RAG.
- Unexpected failures: If a model fails on a specific word or name, check how it tokenises — fragmented tokenisation often explains unusual failures.
In the next lesson, we'll put this knowledge to use and make our first real API call.