Lesson 1

Python for Data Work — the Essential Toolkit

12 min

Python became the dominant language for data work not because it is the fastest language at runtime, but because it sits at the intersection of ease of use, a massive ecosystem, and deep integration with optimised compiled libraries. Understanding why that ecosystem works the way it does will make every subsequent lesson click faster.

The Core Data Stack

The Python data ecosystem is built in layers. At the bottom sits NumPy, which provides a fast, contiguous-memory array type and the mathematical operations to work with it. Above that sits Pandas, which wraps NumPy arrays into labelled tables called DataFrames. Visualisation libraries like Matplotlib and Seaborn consume Pandas structures directly. Machine learning frameworks like scikit-learn and XGBoost speak the same array interface.

| Library | Primary Role | Typical Import | |---|---|---| | NumPy | N-dimensional arrays, linear algebra | import numpy as np | | Pandas | Labelled tables, time series, I/O | import pandas as pd | | Matplotlib | Low-level plotting | import matplotlib.pyplot as plt | | Seaborn | Statistical visualisation | import seaborn as sns | | SciPy | Scientific computing, stats | import scipy.stats as stats | | scikit-learn | Machine learning | from sklearn import ... |

Why Vectorised Operations Beat Loops

Python loops are slow for numerical work because each iteration involves Python object overhead — type checking, reference counting, and memory allocation. NumPy and Pandas operations are implemented in C and operate on entire arrays in one call, bypassing that overhead entirely.

python

import numpy as np
import time

data = list(range(10_000_000))
arr = np.array(data)

# Python loop — slow
start = time.time()
result_loop = [x * 2 for x in data]
print(f"Loop: {time.time() - start:.3f}s")

# NumPy vectorised — fast
start = time.time()
result_vec = arr * 2
print(f"NumPy: {time.time() - start:.3f}s")

On a typical machine the NumPy version is 20–100x faster. The difference grows with array size because NumPy also benefits from CPU SIMD instructions and cache-friendly memory access patterns.

The same principle applies to Pandas. Avoid iterating rows with for row in df.iterrows() whenever a vectorised method exists:

python

import pandas as pd

df = pd.DataFrame({"price": [10.5, 20.0, 15.75], "qty": [3, 1, 4]})

# Slow anti-pattern
df["total_slow"] = [row["price"] * row["qty"] for _, row in df.iterrows()]

# Fast vectorised approach
df["total_fast"] = df["price"] * df["qty"]

Setting Up a Reproducible Environment

Reproducibility is a professional requirement. Other analysts — and future you — must be able to run your notebook on a different machine and get identical results. Use a virtual environment to isolate dependencies.

bash

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate        # macOS/Linux
# .venv\Scripts\activate.bat     # Windows

# Install the core stack
pip install numpy pandas matplotlib seaborn scipy scikit-learn jupyterlab

# Freeze exact versions for reproducibility
pip freeze > requirements.txt

For package management at scale, many teams use conda or mamba instead of pip, which handles binary dependencies more reliably:

bash

conda create -n datawork python=3.11
conda activate datawork
conda install numpy pandas matplotlib seaborn scipy scikit-learn jupyterlab

Inside a Jupyter notebook, always declare your imports and any global settings at the top of the first cell:

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Reproducible randomness
np.random.seed(42)

# Sensible display options
pd.set_option("display.max_columns", 50)
pd.set_option("display.float_format", "{:.4f}".format)

# High-res plots in notebooks
%matplotlib inline
plt.rcParams["figure.dpi"] = 120

The Jupyter Workflow

JupyterLab is the standard interactive environment for data work. It lets you mix code, output, and prose in a single document. Key habits to build from day one:

Restart & Run All before sharing — ensures cells execute top-to-bottom without hidden state.
Use Markdown cells to annotate findings as you go; analysis without explanation is just computation.
Keep notebooks short and focused. One analysis question per notebook.

Summary

Python's data ecosystem is layered: NumPy at the base, Pandas on top, visualisation and ML libraries consuming both.
Vectorised operations over arrays are 20–100x faster than equivalent Python loops because they delegate to compiled C code.
A virtual environment (venv or conda) combined with a pinned requirements.txt guarantees reproducible results across machines.
Always seed random number generators and set display options at the top of every notebook.

NumPy Fundamentals