Python Mastery — From Zero to AI Engineering

Lesson 16

NLP & Transformers — From Tokenization to Fine-Tuning

38 min

The Text Pipeline

Every NLP system transforms raw human language into numbers a model can reason about. The canonical pipeline:

Raw text → Normalize → Tokenize → Vectorize → Model → Decode → Output

Each stage makes tradeoffs. Normalization loses information (case, punctuation) but reduces noise. Tokenization determines vocabulary size and OOV handling. Vectorization decides whether words carry semantic meaning. Understanding these tradeoffs is the difference between engineers who copy-paste transformers and engineers who can debug why their model underperforms on domain-specific text.

Part 1: Text Preprocessing Pipeline

Normalization, Tokenization, Stop Words

The first stage of any NLP pipeline cleans and standardizes raw text. These transforms run in Pyodide — no external libraries needed.

Normalization, Tokenization, Stop Words

Click Run to execute — Python runs in your browser via WebAssembly

Stemming, Lemmatization, and N-grams

Stemming aggressively chops suffixes. Lemmatization uses morphological rules to return dictionary base forms. N-grams capture local word context.

Stemming, Lemmatization, N-grams

Click Run to execute — Python runs in your browser via WebAssembly

Vocabulary Building and Text Vectorization from Scratch

Vocabulary Builder + TF-IDF from Scratch

Click Run to execute — Python runs in your browser via WebAssembly

Part 2: Classical NLP

Word Frequency Analysis and Zipf's Law

Zipf's Law is one of the most surprising facts in language: word frequency is inversely proportional to rank. The most common word appears roughly twice as often as the second, three times as often as the third.

Word Frequency Analysis and Zipf's Law

Click Run to execute — Python runs in your browser via WebAssembly

Named Entity Recognition and Sentiment Analysis from Scratch

Rule-Based NER + Lexicon Sentiment

Click Run to execute — Python runs in your browser via WebAssembly

Naive Bayes Text Classifier from Scratch

Naive Bayes is the workhorse of classical text classification. "Naive" because it assumes feature independence — which is false for text — but works surprisingly well anyway.

The math: given a document d and class c:

P(c|d) ∝ P(c) × Π P(w|c)   for each word w in d

In log space (to avoid underflow):

log P(c|d) ∝ log P(c) + Σ log P(w|c)

Naive Bayes Text Classifier from Scratch

Click Run to execute — Python runs in your browser via WebAssembly

Levenshtein Edit Distance for Spelling Correction

Edit Distance and Spell Correction

Click Run to execute — Python runs in your browser via WebAssembly

Part 3: Word Embeddings

Before transformers, dense vector representations solved the fundamental problem with bag-of-words: no semantic meaning. In BoW, "cat" and "feline" are completely unrelated — they share zero vocabulary overlap.

Why BoW Fails

python

# Two semantically identical sentences get very different BoW vectors:
doc1 = "The automobile is fast"
doc2 = "The car is rapid"

# BoW with these 6 unique words:
# [automobile, fast, car, rapid, the, is]
# doc1: [1, 1, 0, 0, 1, 1]
# doc2: [0, 0, 1, 1, 1, 1]
# Cosine similarity ≈ 0.5  (not even close to 1.0)

Word embeddings fix this by placing semantically similar words near each other in a continuous vector space.

Word2Vec: CBOW and Skip-gram

Word2Vec (Mikolov et al., 2013) trains on a simple self-supervised task using a sliding context window over text:

Skip-gram: Given center word, predict context words.

Input: "bank" → predict: ["river", "flows", "near", "the"]
Good for rare words

CBOW (Continuous Bag of Words): Given context, predict center word.

Input: ["river", "flows", "near", "the"] → predict: "bank"
Faster, good for frequent words

The embeddings are never directly supervised — they emerge as a byproduct of the prediction task. After training billions of sentences, semantic structure emerges geometrically.

python

# Using gensim (requires pip install gensim)
from gensim.models import Word2Vec
import gensim.downloader as api

# Train from scratch on custom text
sentences = [
    ["machine", "learning", "models", "learn", "from", "data"],
    ["deep", "learning", "neural", "networks", "transform", "features"],
    ["python", "code", "runs", "machine", "learning", "algorithms"],
]

model = Word2Vec(
    sentences,
    vector_size=100,   # Embedding dimension
    window=5,          # Context window size
    min_count=1,       # Minimum word frequency
    workers=4,         # Parallel training threads
    sg=1,              # 1=Skip-gram, 0=CBOW
    epochs=100,
)

# Word similarity
print(model.wv.most_similar("machine", topn=5))
# [('learning', 0.98), ('deep', 0.91), ('algorithms', 0.88), ...]

# Word analogy: king - man + woman = queen
result = model.wv.most_similar(
    positive=["king", "woman"],
    negative=["man"],
    topn=1,
)
print(result)  # [('queen', 0.89)]

# Load pre-trained Google News embeddings (300-dim, 3M words)
wv = api.load("word2vec-google-news-300")
print(wv.most_similar("transformer", topn=5))

GloVe and fastText

GloVe (Pennington et al., 2014) factorizes the global word co-occurrence matrix. The loss function directly optimizes for the log-ratio of co-occurrence probabilities — encoding meaning as a consequence of corpus statistics rather than local context windows.

fastText (Bojanowski et al., 2017) represents words as sums of their character n-gram embeddings. This solves the OOV (out-of-vocabulary) problem:

python

from gensim.models import FastText

model = FastText(sentences, vector_size=100, window=5, min_count=1)

# fastText can embed words it never saw in training:
# "transformerized" → sum of character n-grams including "transform"
vec = model.wv["transformerized"]   # Works even though unseen!

Part 4: The Transformer Architecture

Attention Mechanism from Scratch

The transformer (Vaswani et al., 2017) replaces recurrence with attention. Every token can directly attend to every other token — no information bottleneck.

Scaled dot-product attention:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

Q (queries): what this token is looking for
K (keys): what each token offers
V (values): what each token actually contains
/ √d_k: prevents softmax saturation in high dimensions

Attention Mechanism + Positional Encoding from Scratch

Click Run to execute — Python runs in your browser via WebAssembly

The Full Transformer Block

┌─────────────────────────────────────────────┐
│              Transformer Block               │
│                                             │
│  Input x ─────────────────────────────┐    │
│      ↓                                │    │
│  MultiHead Attention (x, x, x)        │    │
│      ↓                                │    │
│  Add & LayerNorm ←─────────────────── x    │
│      ↓                                │    │
│  Feed-Forward Network                 │    │
│   (Linear → ReLU → Linear)            │    │
│      ↓                                │    │
│  Add & LayerNorm ←─────────────────── ┘    │
│      ↓                                     │
│  Output (same shape as Input)              │
└─────────────────────────────────────────────┘

Why Add & Norm?

Residual connection (x + sublayer(x)): prevents vanishing gradients, allows gradients to flow directly from output to early layers
Layer normalization: normalizes across the feature dimension (not batch), stabilizes training regardless of sequence length

Multi-head attention runs H attention heads in parallel, each with different learned projections. Each head can specialize — one might track syntactic dependencies, another coreference, another semantic similarity. Outputs are concatenated and projected:

MultiHead(Q,K,V) = Concat(head₁,...,headₕ) × Wᴼ
where headᵢ = Attention(Q×Wᵢᴼ, K×Wᵢᴷ, V×Wᵢᵛ)

BERT vs GPT architecture:

| Property | BERT (Encoder-only) | GPT (Decoder-only) | |---|---|---| | Attention | Bidirectional (full) | Causal (masked) | | Training | Masked Language Model | Next-token prediction | | Best for | Classification, NER, QA | Text generation | | Example | distilbert-base-uncased | gpt2, llama-3 |

Part 5: HuggingFace Transformers

The transformers library provides a consistent API over thousands of pre-trained models. The AutoClass pattern loads the right architecture from a model name:

AutoTokenizer and Input Encoding

python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Single sentence
text = "The model didn't converge during training."
encoding = tokenizer(
    text,
    return_tensors="pt",      # PyTorch tensors
    padding=True,
    truncation=True,
    max_length=128,
)
print(encoding.keys())
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

# input_ids: integer token IDs
# attention_mask: 1 for real tokens, 0 for padding
# token_type_ids: 0 for sentence A, 1 for sentence B (BERT sentence pairs)

ids = encoding["input_ids"][0].tolist()
tokens = tokenizer.convert_ids_to_tokens(ids)
print(tokens)
# ['[CLS]', 'the', 'model', 'didn', "'", 't', 'converge', 'during', 'training', '.', '[SEP]']

# Batch encoding (handles padding automatically)
batch = tokenizer(
    ["Short text.", "A longer piece of text that needs padding."],
    return_tensors="pt",
    padding=True,       # Pad shorter sequences
    truncation=True,
    max_length=32,
)
print(batch["input_ids"].shape)   # torch.Size([2, 12])

# BPE subword splitting (handles OOV words)
rare = "Supercalifragilisticexpialidocious and transformerization"
toks = tokenizer.tokenize(rare)
print(toks)
# ['super', '##cali', '##fra', '##gil', '##istic', '##ex', '##pia', ...]

Pipeline API — Inference in One Line

python

from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis",
                      model="distilbert-base-uncased-finetuned-sst-2-english")
results = classifier([
    "This movie was absolutely fantastic!",
    "I hated every minute of this terrible film.",
])
# [{'label': 'POSITIVE', 'score': 0.9998}, {'label': 'NEGATIVE', 'score': 0.9994}]

# Named Entity Recognition
ner = pipeline("ner",
               model="dslim/bert-base-NER",
               aggregation_strategy="simple")
entities = ner("Elon Musk founded SpaceX in California.")
# [{'entity_group': 'PER', 'word': 'Elon Musk', 'score': 0.998},
#  {'entity_group': 'ORG', 'word': 'SpaceX', 'score': 0.993}, ...]

# Zero-shot classification (no fine-tuning needed)
zero_shot = pipeline("zero-shot-classification",
                     model="facebook/bart-large-mnli")
result = zero_shot(
    "The interest rate hike will slow economic growth.",
    candidate_labels=["finance", "sports", "technology", "politics"],
)
# {'labels': ['finance', 'politics', 'technology', 'sports'],
#  'scores': [0.89, 0.07, 0.02, 0.01]}

# Summarization
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
text = "..." * 500   # Long article text
summary = summarizer(text, max_length=150, min_length=40, do_sample=False)
print(summary[0]["summary_text"])

# Question answering
qa = pipeline("question-answering",
              model="distilbert-base-cased-distilled-squad")
result = qa(
    question="What year did BERT come out?",
    context="BERT was published in 2018 by Google. It transformed NLP tasks."
)
# {'answer': '2018', 'score': 0.94, 'start': 25, 'end': 29}

# Text generation (GPT-style)
generator = pipeline("text-generation", model="gpt2")
output = generator("Deep learning is", max_new_tokens=50, num_return_sequences=2)

# Fill-mask (BERT-style)
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
result = fill_mask("Paris is the [MASK] of France.")
# [{'sequence': 'Paris is the capital of France.', 'score': 0.97, ...}]

# Translation
translator = pipeline("translation_en_to_fr",
                       model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Machine learning transforms everything.")
print(result[0]["translation_text"])  # "L'apprentissage automatique transforme tout."

Part 6: Fine-Tuning with HuggingFace

When to Fine-tune vs Prompt Engineer

| Approach | When to use | Data needed | Cost | |---|---|---|---| | Zero-shot prompting | Flexible task, LLM available | None | API cost | | Few-shot prompting | Need format control | 5-20 examples | API cost | | Fine-tuning small model | High volume, low latency | 1K-100K examples | GPU time | | Fine-tuning large model | Specialized domain | 10K+ examples | Significant |

Rule of thumb: Try prompting first. Only fine-tune when:

You need consistent output format (prompting is unreliable)
You're making millions of API calls (cost prohibitive)
Your domain is highly specialized (medical, legal, scientific)
You need sub-100ms latency

Complete Fine-tuning Pipeline

python

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ── 1. Load dataset ────────────────────────────────────────────────────────────
dataset = load_dataset("imdb")  # 50K movie reviews, binary sentiment

# ── 2. Tokenize ────────────────────────────────────────────────────────────────
checkpoint = "distilbert-base-uncased"   # 66M params, 40% smaller than BERT
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_fn(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=256,    # Truncate at 256 tokens to save memory
    )

tokenized = dataset.map(tokenize_fn, batched=True, batch_size=1000,
                        remove_columns=["text"])
tokenized = tokenized.rename_column("label", "labels")

# ── 3. Data collator (dynamic padding per batch, more efficient) ───────────────
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# ── 4. Load model ──────────────────────────────────────────────────────────────
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1},
)

# ── 5. Metrics ─────────────────────────────────────────────────────────────────
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1":       f1_score(labels, preds, average="binary"),
    }

# ── 6. Training arguments ──────────────────────────────────────────────────────
args = TrainingArguments(
    output_dir="imdb-distilbert",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,              # Standard for fine-tuning: 1e-5 to 5e-5
    warmup_ratio=0.1,               # Warm up for 10% of steps
    weight_decay=0.01,              # L2 regularization
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    fp16=True,                      # Mixed precision: 2x faster on modern GPUs
    dataloader_num_workers=4,
    report_to="none",               # Disable wandb/tensorboard for demo
)

# ── 7. Trainer ─────────────────────────────────────────────────────────────────
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"].select(range(10000)),  # Subset for demo
    eval_dataset=tokenized["test"].select(range(2000)),
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
# Epoch 1/3: loss=0.28, accuracy=0.898, f1=0.897
# Epoch 2/3: loss=0.19, accuracy=0.921, f1=0.920
# Epoch 3/3: loss=0.14, accuracy=0.931, f1=0.930

# ── 8. Save and push to hub ────────────────────────────────────────────────────
trainer.save_model("imdb-distilbert-final")
tokenizer.save_pretrained("imdb-distilbert-final")

# Push to HuggingFace Hub
# model.push_to_hub("your-username/imdb-distilbert")
# tokenizer.push_to_hub("your-username/imdb-distilbert")

Parameter-Efficient Fine-Tuning (PEFT/LoRA)

For large models (7B+ parameters), full fine-tuning is impractical. LoRA (Low-Rank Adaptation) freezes the base model and trains only tiny adapter matrices:

python

from peft import LoraConfig, get_peft_model, TaskType

config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=16,               # Rank of adapter matrices (lower = fewer params)
    lora_alpha=32,      # Scaling factor
    lora_dropout=0.05,
    target_modules=["q_lin", "k_lin", "v_lin"],   # Which layers to adapt
)

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()
# trainable params: 147,456 || all params: 67,091,458 || trainable%: 0.22

# Train with standard Trainer — same API as full fine-tuning
# LoRA adds ~0.22% parameters but recovers most of full fine-tuning performance

Part 7: Working with LLM APIs

Groq and OpenAI-Compatible APIs

python

from groq import Groq
import json
import time

client = Groq()  # Reads GROQ_API_KEY from env

# ── Basic completion ───────────────────────────────────────────────────────────
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are an expert NLP engineer."},
        {"role": "user",   "content": "Explain BPE tokenization in 3 bullet points."},
    ],
    temperature=0.3,    # Low temperature for factual/technical content
    max_tokens=300,
)
text = response.choices[0].message.content
cost_in = response.usage.prompt_tokens
cost_out = response.usage.completion_tokens
print(f"Response ({cost_in}+{cost_out} tokens):\n{text}")

# ── Prompt engineering patterns ────────────────────────────────────────────────
# 1. Zero-shot
zero_shot = [{"role": "user", "content": "Classify: 'Shipment delayed by storm.' → category:"}]

# 2. Few-shot (3 examples train the format)
few_shot = [
    {"role": "system",    "content": "Classify customer support tickets."},
    {"role": "user",      "content": "My login stopped working yesterday."},
    {"role": "assistant", "content": "AUTH_ISSUE"},
    {"role": "user",      "content": "Invoice shows wrong amount charged."},
    {"role": "assistant", "content": "BILLING_ISSUE"},
    {"role": "user",      "content": "The dashboard loads slowly."},
    {"role": "assistant", "content": "PERFORMANCE_ISSUE"},
    {"role": "user",      "content": "I can't export my report to PDF."},  # New query
]

# 3. Chain-of-thought (improves complex reasoning by 30-50%)
cot = """
<task>Classify the sentiment of this review.</task>
<review>The product works well overall but the documentation is confusing.</review>
<thinking>
Think step by step:
1. Positive signals: "works well", "overall"
2. Negative signals: "documentation is confusing"
3. The positive aspects describe core functionality, negative is secondary
4. Overall tone: mixed, slightly positive
</thinking>
<output>{"sentiment": "mixed", "positive_score": 0.6, "aspects": ["functionality+", "docs-"]}</output>
"""

# ── JSON mode (structured output) ─────────────────────────────────────────────
struct_response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "Extract entities. Return JSON only."},
        {"role": "user",   "content": "Elon Musk's SpaceX launched Falcon 9 from Cape Canaveral."},
    ],
    response_format={"type": "json_object"},
)
data = json.loads(struct_response.choices[0].message.content)
# {"persons": ["Elon Musk"], "organizations": ["SpaceX"], "rockets": ["Falcon 9"], ...}

# ── Retry with exponential backoff ────────────────────────────────────────────
def call_with_retry(client, messages, model="llama-3.3-70b-versatile",
                    max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(model=model, messages=messages)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt   # 1s, 2s, 4s
            print(f"Attempt {attempt+1} failed: {e}. Retrying in {wait}s...")
            time.sleep(wait)

# ── Embeddings API ─────────────────────────────────────────────────────────────
# (Using OpenAI for embeddings — Groq doesn't offer embeddings API)
from openai import OpenAI
openai_client = OpenAI()

def embed(texts, model="text-embedding-3-small"):
    response = openai_client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

doc_embeddings = embed(["Python is great", "I love coding", "The market crashed"])
# Each embedding: list of 1536 floats
# Semantic search: cosine_similarity(query_embedding, doc_embeddings)

Part 8: Production NLP Patterns

Batched Inference and Quantization

python

# ── Batched inference: process many texts efficiently ─────────────────────────
from transformers import pipeline
import torch

# Use batch_size parameter for throughput
classifier = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0 if torch.cuda.is_available() else -1,
    batch_size=32,    # Process 32 texts per forward pass
)

texts = ["text1", "text2", ...] * 1000   # 1000 texts
results = classifier(texts)   # ~8x faster than batch_size=1

# ── Quantization with bitsandbytes ────────────────────────────────────────────
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# INT8 quantization: ~2x memory reduction, minimal quality loss
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

# INT4 quantization: ~4x memory reduction
quantization_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,   # Nested quantization
    bnb_4bit_quant_type="nf4",        # NormalFloat4 — better than uniform INT4
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=quantization_config_4bit,
    device_map="auto",
)
# 3B model: 12GB FP16 → 3GB INT4

# ── Embedding caching ─────────────────────────────────────────────────────────
import hashlib
import json
import os

class EmbeddingCache:
    def __init__(self, cache_dir=".embedding_cache"):
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)

    def _key(self, text, model):
        return hashlib.md5(f"{model}:{text}".encode()).hexdigest()

    def get(self, text, model):
        path = os.path.join(self.cache_dir, self._key(text, model) + ".json")
        if os.path.exists(path):
            with open(path) as f:
                return json.load(f)
        return None

    def set(self, text, model, embedding):
        path = os.path.join(self.cache_dir, self._key(text, model) + ".json")
        with open(path, "w") as f:
            json.dump(embedding, f)

# ── Long document handling: chunking strategies ───────────────────────────────
def chunk_fixed(text, chunk_size=512, overlap=50):
    """Fixed-size chunking with overlap."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

def chunk_sentences(text, max_tokens=200):
    """Sentence-boundary chunking."""
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    current = []
    current_len = 0
    for sent in sentences:
        sent_len = len(sent.split())
        if current_len + sent_len > max_tokens and current:
            chunks.append(" ".join(current))
            current = [sent]
            current_len = sent_len
        else:
            current.append(sent)
            current_len += sent_len
    if current:
        chunks.append(" ".join(current))
    return chunks

PROJECT: Complete Text Classification System

Build a full text classification system from scratch — no sklearn, no transformers. Pure Python + math.

PROJECT: Complete Text Classification System

Click Run to execute — Python runs in your browser via WebAssembly

Exercises

Exercise 1 (Pyodide): Extend the NaiveBayesClassifier above to output a probability distribution across classes using the softmax of log-probabilities, not just the argmax. Test it on the 4-class dataset.

Exercise 2 (Pyodide): Implement a bigram language model. Given a training corpus, compute P(word | previous_word) for all word pairs. Add Laplace smoothing. Use it to compute the perplexity of a test sentence.

Exercise 3 (Pyodide): Build a simple keyword extraction system using TF-IDF scores. For a given document, extract the top-N most distinctive words/phrases by comparing the document's TF-IDF against a background corpus.

Exercise 4 (Pyodide): Implement the Jaccard similarity between two documents (using token sets). Compare it to cosine similarity on TF-IDF vectors. When do they agree/disagree?

Exercise 5 (Full library): Load bert-base-uncased with HuggingFace transformers. Tokenize 100 sentences from a dataset of your choice. Visualize attention weights from the last layer using matplotlib. Which token pairs have the highest attention?

Exercise 6 (Full library): Fine-tune distilbert-base-uncased on the AG News dataset (4 topics: World, Sports, Business, Sci/Tech). Target >90% accuracy. Report per-class F1, confusion matrix, and 5 examples the model gets wrong.

Exercise 7 (Full library): Use the sentence-transformers library with model all-MiniLM-L6-v2 to build a semantic search engine over a Wikipedia paragraph dump. Given a query, find the top-5 most semantically similar paragraphs.

Exercise 8 (Full library): Use Groq's API to build a zero-shot text classifier that categorizes news headlines into 6 categories without any training data. Compare its accuracy to your Naive Bayes classifier from Exercise 1 on the same test set.

Key Takeaways

Build a preprocessing pipeline once and reuse it — normalization, tokenization, stop-word removal, and stemming are always the first steps regardless of the final model
TF-IDF + Naive Bayes achieves 85–95% accuracy on well-separated topics; always benchmark it before using transformers
Attention computes softmax(QKᵀ/√d_k)V — this is the core operation of every modern NLP model; understanding its shape transformations is non-negotiable
BERT is bidirectional (reads full context), GPT is autoregressive (reads left-to-right only) — choose BERT for classification, GPT for generation
Subword tokenization (BPE/WordPiece) makes OOV words impossible — every byte sequence can be encoded
Fine-tuning requires only 2–3 epochs on domain data; more epochs causes catastrophic forgetting
LoRA fine-tunes 0.1–1% of parameters with ~90% of full fine-tuning quality — use it for models > 1B parameters
Cosine similarity on L2-normalized embeddings reduces to a dot product — precompute normalization once for fast batch search

Deep Learning with PyTorch Production Python — FastAPI, Packaging & Profiling