NLP & Transformers — From Tokenization to Fine-Tuning
38 min
The Text Pipeline
Every NLP system transforms raw human language into numbers a model can reason about. The canonical pipeline:
Raw text → Normalize → Tokenize → Vectorize → Model → Decode → Output
Each stage makes tradeoffs. Normalization loses information (case, punctuation) but reduces noise. Tokenization determines vocabulary size and OOV handling. Vectorization decides whether words carry semantic meaning. Understanding these tradeoffs is the difference between engineers who copy-paste transformers and engineers who can debug why their model underperforms on domain-specific text.
Part 1: Text Preprocessing Pipeline
Normalization, Tokenization, Stop Words
The first stage of any NLP pipeline cleans and standardizes raw text. These transforms run in Pyodide — no external libraries needed.
Normalization, Tokenization, Stop Words
Click Run to execute — Python runs in your browser via WebAssembly
Stemming, Lemmatization, and N-grams
Stemming aggressively chops suffixes. Lemmatization uses morphological rules to return dictionary base forms. N-grams capture local word context.
Stemming, Lemmatization, N-grams
Click Run to execute — Python runs in your browser via WebAssembly
Vocabulary Building and Text Vectorization from Scratch
Vocabulary Builder + TF-IDF from Scratch
Click Run to execute — Python runs in your browser via WebAssembly
Part 2: Classical NLP
Word Frequency Analysis and Zipf's Law
Zipf's Law is one of the most surprising facts in language: word frequency is inversely proportional to rank. The most common word appears roughly twice as often as the second, three times as often as the third.
Word Frequency Analysis and Zipf's Law
Click Run to execute — Python runs in your browser via WebAssembly
Named Entity Recognition and Sentiment Analysis from Scratch
Rule-Based NER + Lexicon Sentiment
Click Run to execute — Python runs in your browser via WebAssembly
Naive Bayes Text Classifier from Scratch
Naive Bayes is the workhorse of classical text classification. "Naive" because it assumes feature independence — which is false for text — but works surprisingly well anyway.
The math: given a document d and class c:
P(c|d) ∝ P(c) × Π P(w|c) for each word w in d
In log space (to avoid underflow):
log P(c|d) ∝ log P(c) + Σ log P(w|c)
Naive Bayes Text Classifier from Scratch
Click Run to execute — Python runs in your browser via WebAssembly
Levenshtein Edit Distance for Spelling Correction
Edit Distance and Spell Correction
Click Run to execute — Python runs in your browser via WebAssembly
Part 3: Word Embeddings
Before transformers, dense vector representations solved the fundamental problem with bag-of-words: no semantic meaning. In BoW, "cat" and "feline" are completely unrelated — they share zero vocabulary overlap.
Why BoW Fails
python
# Two semantically identical sentences get very different BoW vectors:doc1 = "The automobile is fast"doc2 = "The car is rapid"# BoW with these 6 unique words:# [automobile, fast, car, rapid, the, is]# doc1: [1, 1, 0, 0, 1, 1]# doc2: [0, 0, 1, 1, 1, 1]# Cosine similarity ≈ 0.5 (not even close to 1.0)
Word embeddings fix this by placing semantically similar words near each other in a continuous vector space.
Word2Vec: CBOW and Skip-gram
Word2Vec (Mikolov et al., 2013) trains on a simple self-supervised task using a sliding context window over text:
Skip-gram: Given center word, predict context words.
The embeddings are never directly supervised — they emerge as a byproduct of the prediction task. After training billions of sentences, semantic structure emerges geometrically.
python
# Using gensim (requires pip install gensim)from gensim.models import Word2Vecimport gensim.downloader as api# Train from scratch on custom textsentences = [ ["machine", "learning", "models", "learn", "from", "data"], ["deep", "learning", "neural", "networks", "transform", "features"], ["python", "code", "runs", "machine", "learning", "algorithms"],]model = Word2Vec( sentences, vector_size=100, # Embedding dimension window=5, # Context window size min_count=1, # Minimum word frequency workers=4, # Parallel training threads sg=1, # 1=Skip-gram, 0=CBOW epochs=100,)# Word similarityprint(model.wv.most_similar("machine", topn=5))# [('learning', 0.98), ('deep', 0.91), ('algorithms', 0.88), ...]# Word analogy: king - man + woman = queenresult = model.wv.most_similar( positive=["king", "woman"], negative=["man"], topn=1,)print(result) # [('queen', 0.89)]# Load pre-trained Google News embeddings (300-dim, 3M words)wv = api.load("word2vec-google-news-300")print(wv.most_similar("transformer", topn=5))
GloVe and fastText
GloVe (Pennington et al., 2014) factorizes the global word co-occurrence matrix. The loss function directly optimizes for the log-ratio of co-occurrence probabilities — encoding meaning as a consequence of corpus statistics rather than local context windows.
fastText (Bojanowski et al., 2017) represents words as sums of their character n-gram embeddings. This solves the OOV (out-of-vocabulary) problem:
python
from gensim.models import FastTextmodel = FastText(sentences, vector_size=100, window=5, min_count=1)# fastText can embed words it never saw in training:# "transformerized" → sum of character n-grams including "transform"vec = model.wv["transformerized"] # Works even though unseen!
Part 4: The Transformer Architecture
Attention Mechanism from Scratch
The transformer (Vaswani et al., 2017) replaces recurrence with attention. Every token can directly attend to every other token — no information bottleneck.
Scaled dot-product attention:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V
Q (queries): what this token is looking for
K (keys): what each token offers
V (values): what each token actually contains
/ √d_k: prevents softmax saturation in high dimensions
Attention Mechanism + Positional Encoding from Scratch
Click Run to execute — Python runs in your browser via WebAssembly
Residual connection (x + sublayer(x)): prevents vanishing gradients, allows gradients to flow directly from output to early layers
Layer normalization: normalizes across the feature dimension (not batch), stabilizes training regardless of sequence length
Multi-head attention runs H attention heads in parallel, each with different learned projections. Each head can specialize — one might track syntactic dependencies, another coreference, another semantic similarity. Outputs are concatenated and projected:
| Property | BERT (Encoder-only) | GPT (Decoder-only) |
|---|---|---|
| Attention | Bidirectional (full) | Causal (masked) |
| Training | Masked Language Model | Next-token prediction |
| Best for | Classification, NER, QA | Text generation |
| Example | distilbert-base-uncased | gpt2, llama-3 |
Part 5: HuggingFace Transformers
The transformers library provides a consistent API over thousands of pre-trained models. The AutoClass pattern loads the right architecture from a model name:
AutoTokenizer and Input Encoding
python
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")# Single sentencetext = "The model didn't converge during training."encoding = tokenizer( text, return_tensors="pt", # PyTorch tensors padding=True, truncation=True, max_length=128,)print(encoding.keys())# dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])# input_ids: integer token IDs# attention_mask: 1 for real tokens, 0 for padding# token_type_ids: 0 for sentence A, 1 for sentence B (BERT sentence pairs)ids = encoding["input_ids"][0].tolist()tokens = tokenizer.convert_ids_to_tokens(ids)print(tokens)# ['[CLS]', 'the', 'model', 'didn', "'", 't', 'converge', 'during', 'training', '.', '[SEP]']# Batch encoding (handles padding automatically)batch = tokenizer( ["Short text.", "A longer piece of text that needs padding."], return_tensors="pt", padding=True, # Pad shorter sequences truncation=True, max_length=32,)print(batch["input_ids"].shape) # torch.Size([2, 12])# BPE subword splitting (handles OOV words)rare = "Supercalifragilisticexpialidocious and transformerization"toks = tokenizer.tokenize(rare)print(toks)# ['super', '##cali', '##fra', '##gil', '##istic', '##ex', '##pia', ...]
Pipeline API — Inference in One Line
python
from transformers import pipeline# Sentiment analysisclassifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")results = classifier([ "This movie was absolutely fantastic!", "I hated every minute of this terrible film.",])# [{'label': 'POSITIVE', 'score': 0.9998}, {'label': 'NEGATIVE', 'score': 0.9994}]# Named Entity Recognitionner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")entities = ner("Elon Musk founded SpaceX in California.")# [{'entity_group': 'PER', 'word': 'Elon Musk', 'score': 0.998},# {'entity_group': 'ORG', 'word': 'SpaceX', 'score': 0.993}, ...]# Zero-shot classification (no fine-tuning needed)zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")result = zero_shot( "The interest rate hike will slow economic growth.", candidate_labels=["finance", "sports", "technology", "politics"],)# {'labels': ['finance', 'politics', 'technology', 'sports'],# 'scores': [0.89, 0.07, 0.02, 0.01]}# Summarizationsummarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")text = "..." * 500 # Long article textsummary = summarizer(text, max_length=150, min_length=40, do_sample=False)print(summary[0]["summary_text"])# Question answeringqa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")result = qa( question="What year did BERT come out?", context="BERT was published in 2018 by Google. It transformed NLP tasks.")# {'answer': '2018', 'score': 0.94, 'start': 25, 'end': 29}# Text generation (GPT-style)generator = pipeline("text-generation", model="gpt2")output = generator("Deep learning is", max_new_tokens=50, num_return_sequences=2)# Fill-mask (BERT-style)fill_mask = pipeline("fill-mask", model="bert-base-uncased")result = fill_mask("Paris is the [MASK] of France.")# [{'sequence': 'Paris is the capital of France.', 'score': 0.97, ...}]# Translationtranslator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")result = translator("Machine learning transforms everything.")print(result[0]["translation_text"]) # "L'apprentissage automatique transforme tout."
Part 6: Fine-Tuning with HuggingFace
When to Fine-tune vs Prompt Engineer
| Approach | When to use | Data needed | Cost |
|---|---|---|---|
| Zero-shot prompting | Flexible task, LLM available | None | API cost |
| Few-shot prompting | Need format control | 5-20 examples | API cost |
| Fine-tuning small model | High volume, low latency | 1K-100K examples | GPU time |
| Fine-tuning large model | Specialized domain | 10K+ examples | Significant |
Rule of thumb: Try prompting first. Only fine-tune when:
You need consistent output format (prompting is unreliable)
You're making millions of API calls (cost prohibitive)
Your domain is highly specialized (medical, legal, scientific)
You need sub-100ms latency
Complete Fine-tuning Pipeline
python
from datasets import load_datasetfrom transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding,)import numpy as npfrom sklearn.metrics import accuracy_score, f1_score# ── 1. Load dataset ────────────────────────────────────────────────────────────dataset = load_dataset("imdb") # 50K movie reviews, binary sentiment# ── 2. Tokenize ────────────────────────────────────────────────────────────────checkpoint = "distilbert-base-uncased" # 66M params, 40% smaller than BERTtokenizer = AutoTokenizer.from_pretrained(checkpoint)def tokenize_fn(batch): return tokenizer( batch["text"], truncation=True, max_length=256, # Truncate at 256 tokens to save memory )tokenized = dataset.map(tokenize_fn, batched=True, batch_size=1000, remove_columns=["text"])tokenized = tokenized.rename_column("label", "labels")# ── 3. Data collator (dynamic padding per batch, more efficient) ───────────────data_collator = DataCollatorWithPadding(tokenizer=tokenizer)# ── 4. Load model ──────────────────────────────────────────────────────────────model = AutoModelForSequenceClassification.from_pretrained( checkpoint, num_labels=2, id2label={0: "negative", 1: "positive"}, label2id={"negative": 0, "positive": 1},)# ── 5. Metrics ─────────────────────────────────────────────────────────────────def compute_metrics(eval_pred): logits, labels = eval_pred preds = np.argmax(logits, axis=-1) return { "accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds, average="binary"), }# ── 6. Training arguments ──────────────────────────────────────────────────────args = TrainingArguments( output_dir="imdb-distilbert", num_train_epochs=3, per_device_train_batch_size=32, per_device_eval_batch_size=64, learning_rate=2e-5, # Standard for fine-tuning: 1e-5 to 5e-5 warmup_ratio=0.1, # Warm up for 10% of steps weight_decay=0.01, # L2 regularization evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1", fp16=True, # Mixed precision: 2x faster on modern GPUs dataloader_num_workers=4, report_to="none", # Disable wandb/tensorboard for demo)# ── 7. Trainer ─────────────────────────────────────────────────────────────────trainer = Trainer( model=model, args=args, train_dataset=tokenized["train"].select(range(10000)), # Subset for demo eval_dataset=tokenized["test"].select(range(2000)), tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics,)trainer.train()# Epoch 1/3: loss=0.28, accuracy=0.898, f1=0.897# Epoch 2/3: loss=0.19, accuracy=0.921, f1=0.920# Epoch 3/3: loss=0.14, accuracy=0.931, f1=0.930# ── 8. Save and push to hub ────────────────────────────────────────────────────trainer.save_model("imdb-distilbert-final")tokenizer.save_pretrained("imdb-distilbert-final")# Push to HuggingFace Hub# model.push_to_hub("your-username/imdb-distilbert")# tokenizer.push_to_hub("your-username/imdb-distilbert")
Parameter-Efficient Fine-Tuning (PEFT/LoRA)
For large models (7B+ parameters), full fine-tuning is impractical. LoRA (Low-Rank Adaptation) freezes the base model and trains only tiny adapter matrices:
python
from peft import LoraConfig, get_peft_model, TaskTypeconfig = LoraConfig( task_type=TaskType.SEQ_CLS, r=16, # Rank of adapter matrices (lower = fewer params) lora_alpha=32, # Scaling factor lora_dropout=0.05, target_modules=["q_lin", "k_lin", "v_lin"], # Which layers to adapt)peft_model = get_peft_model(model, config)peft_model.print_trainable_parameters()# trainable params: 147,456 || all params: 67,091,458 || trainable%: 0.22# Train with standard Trainer — same API as full fine-tuning# LoRA adds ~0.22% parameters but recovers most of full fine-tuning performance
Part 7: Working with LLM APIs
Groq and OpenAI-Compatible APIs
python
from groq import Groqimport jsonimport timeclient = Groq() # Reads GROQ_API_KEY from env# ── Basic completion ───────────────────────────────────────────────────────────response = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[ {"role": "system", "content": "You are an expert NLP engineer."}, {"role": "user", "content": "Explain BPE tokenization in 3 bullet points."}, ], temperature=0.3, # Low temperature for factual/technical content max_tokens=300,)text = response.choices[0].message.contentcost_in = response.usage.prompt_tokenscost_out = response.usage.completion_tokensprint(f"Response ({cost_in}+{cost_out} tokens):\n{text}")# ── Prompt engineering patterns ────────────────────────────────────────────────# 1. Zero-shotzero_shot = [{"role": "user", "content": "Classify: 'Shipment delayed by storm.' → category:"}]# 2. Few-shot (3 examples train the format)few_shot = [ {"role": "system", "content": "Classify customer support tickets."}, {"role": "user", "content": "My login stopped working yesterday."}, {"role": "assistant", "content": "AUTH_ISSUE"}, {"role": "user", "content": "Invoice shows wrong amount charged."}, {"role": "assistant", "content": "BILLING_ISSUE"}, {"role": "user", "content": "The dashboard loads slowly."}, {"role": "assistant", "content": "PERFORMANCE_ISSUE"}, {"role": "user", "content": "I can't export my report to PDF."}, # New query]# 3. Chain-of-thought (improves complex reasoning by 30-50%)cot = """<task>Classify the sentiment of this review.</task><review>The product works well overall but the documentation is confusing.</review><thinking>Think step by step:1. Positive signals: "works well", "overall"2. Negative signals: "documentation is confusing"3. The positive aspects describe core functionality, negative is secondary4. Overall tone: mixed, slightly positive</thinking><output>{"sentiment": "mixed", "positive_score": 0.6, "aspects": ["functionality+", "docs-"]}</output>"""# ── JSON mode (structured output) ─────────────────────────────────────────────struct_response = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[ {"role": "system", "content": "Extract entities. Return JSON only."}, {"role": "user", "content": "Elon Musk's SpaceX launched Falcon 9 from Cape Canaveral."}, ], response_format={"type": "json_object"},)data = json.loads(struct_response.choices[0].message.content)# {"persons": ["Elon Musk"], "organizations": ["SpaceX"], "rockets": ["Falcon 9"], ...}# ── Retry with exponential backoff ────────────────────────────────────────────def call_with_retry(client, messages, model="llama-3.3-70b-versatile", max_retries=3): for attempt in range(max_retries): try: return client.chat.completions.create(model=model, messages=messages) except Exception as e: if attempt == max_retries - 1: raise wait = 2 ** attempt # 1s, 2s, 4s print(f"Attempt {attempt+1} failed: {e}. Retrying in {wait}s...") time.sleep(wait)# ── Embeddings API ─────────────────────────────────────────────────────────────# (Using OpenAI for embeddings — Groq doesn't offer embeddings API)from openai import OpenAIopenai_client = OpenAI()def embed(texts, model="text-embedding-3-small"): response = openai_client.embeddings.create(input=texts, model=model) return [item.embedding for item in response.data]doc_embeddings = embed(["Python is great", "I love coding", "The market crashed"])# Each embedding: list of 1536 floats# Semantic search: cosine_similarity(query_embedding, doc_embeddings)
Part 8: Production NLP Patterns
Batched Inference and Quantization
python
# ── Batched inference: process many texts efficiently ─────────────────────────from transformers import pipelineimport torch# Use batch_size parameter for throughputclassifier = pipeline( "text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if torch.cuda.is_available() else -1, batch_size=32, # Process 32 texts per forward pass)texts = ["text1", "text2", ...] * 1000 # 1000 textsresults = classifier(texts) # ~8x faster than batch_size=1# ── Quantization with bitsandbytes ────────────────────────────────────────────from transformers import AutoModelForCausalLM, BitsAndBytesConfig# INT8 quantization: ~2x memory reduction, minimal quality lossquantization_config = BitsAndBytesConfig(load_in_8bit=True)# INT4 quantization: ~4x memory reductionquantization_config_4bit = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, # Nested quantization bnb_4bit_quant_type="nf4", # NormalFloat4 — better than uniform INT4)model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", quantization_config=quantization_config_4bit, device_map="auto",)# 3B model: 12GB FP16 → 3GB INT4# ── Embedding caching ─────────────────────────────────────────────────────────import hashlibimport jsonimport osclass EmbeddingCache: def __init__(self, cache_dir=".embedding_cache"): self.cache_dir = cache_dir os.makedirs(cache_dir, exist_ok=True) def _key(self, text, model): return hashlib.md5(f"{model}:{text}".encode()).hexdigest() def get(self, text, model): path = os.path.join(self.cache_dir, self._key(text, model) + ".json") if os.path.exists(path): with open(path) as f: return json.load(f) return None def set(self, text, model, embedding): path = os.path.join(self.cache_dir, self._key(text, model) + ".json") with open(path, "w") as f: json.dump(embedding, f)# ── Long document handling: chunking strategies ───────────────────────────────def chunk_fixed(text, chunk_size=512, overlap=50): """Fixed-size chunking with overlap.""" words = text.split() chunks = [] for i in range(0, len(words), chunk_size - overlap): chunk = " ".join(words[i:i + chunk_size]) chunks.append(chunk) return chunksdef chunk_sentences(text, max_tokens=200): """Sentence-boundary chunking.""" import re sentences = re.split(r'(?<=[.!?])\s+', text) chunks = [] current = [] current_len = 0 for sent in sentences: sent_len = len(sent.split()) if current_len + sent_len > max_tokens and current: chunks.append(" ".join(current)) current = [sent] current_len = sent_len else: current.append(sent) current_len += sent_len if current: chunks.append(" ".join(current)) return chunks
PROJECT: Complete Text Classification System
Build a full text classification system from scratch — no sklearn, no transformers. Pure Python + math.
PROJECT: Complete Text Classification System
Click Run to execute — Python runs in your browser via WebAssembly
Exercises
Exercise 1 (Pyodide): Extend the NaiveBayesClassifier above to output a probability distribution across classes using the softmax of log-probabilities, not just the argmax. Test it on the 4-class dataset.
Exercise 2 (Pyodide): Implement a bigram language model. Given a training corpus, compute P(word | previous_word) for all word pairs. Add Laplace smoothing. Use it to compute the perplexity of a test sentence.
Exercise 3 (Pyodide): Build a simple keyword extraction system using TF-IDF scores. For a given document, extract the top-N most distinctive words/phrases by comparing the document's TF-IDF against a background corpus.
Exercise 4 (Pyodide): Implement the Jaccard similarity between two documents (using token sets). Compare it to cosine similarity on TF-IDF vectors. When do they agree/disagree?
Exercise 5 (Full library): Load bert-base-uncased with HuggingFace transformers. Tokenize 100 sentences from a dataset of your choice. Visualize attention weights from the last layer using matplotlib. Which token pairs have the highest attention?
Exercise 6 (Full library): Fine-tune distilbert-base-uncased on the AG News dataset (4 topics: World, Sports, Business, Sci/Tech). Target >90% accuracy. Report per-class F1, confusion matrix, and 5 examples the model gets wrong.
Exercise 7 (Full library): Use the sentence-transformers library with model all-MiniLM-L6-v2 to build a semantic search engine over a Wikipedia paragraph dump. Given a query, find the top-5 most semantically similar paragraphs.
Exercise 8 (Full library): Use Groq's API to build a zero-shot text classifier that categorizes news headlines into 6 categories without any training data. Compare its accuracy to your Naive Bayes classifier from Exercise 1 on the same test set.
Key Takeaways
Build a preprocessing pipeline once and reuse it — normalization, tokenization, stop-word removal, and stemming are always the first steps regardless of the final model
TF-IDF + Naive Bayes achieves 85–95% accuracy on well-separated topics; always benchmark it before using transformers
Attention computes softmax(QKᵀ/√d_k)V — this is the core operation of every modern NLP model; understanding its shape transformations is non-negotiable
BERT is bidirectional (reads full context), GPT is autoregressive (reads left-to-right only) — choose BERT for classification, GPT for generation
Subword tokenization (BPE/WordPiece) makes OOV words impossible — every byte sequence can be encoded
Fine-tuning requires only 2–3 epochs on domain data; more epochs causes catastrophic forgetting
LoRA fine-tunes 0.1–1% of parameters with ~90% of full fine-tuning quality — use it for models > 1B parameters
Cosine similarity on L2-normalized embeddings reduces to a dot product — precompute normalization once for fast batch search