Lesson 2

NumPy Fundamentals

15 min

NumPy's ndarray is the foundation of almost everything in the Python data stack. Pandas Series and DataFrames wrap ndarrays internally. scikit-learn's transformers consume and produce ndarrays. Understanding how the ndarray works — how memory is laid out, how indexing maps to that memory, how shapes transform during operations — will make every library you use on top of it less mysterious.

Creating Arrays

There are many construction paths, each suited to a different situation:

python

import numpy as np

# From a Python list
a = np.array([1, 2, 3, 4, 5])

# Evenly spaced values
linspace = np.linspace(0, 1, 6)      # [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
arange  = np.arange(0, 10, 2)        # [0, 2, 4, 6, 8]

# Special matrices
zeros = np.zeros((3, 4))             # 3x4 of 0.0
ones  = np.ones((2, 3), dtype=int)   # 2x3 of 1
eye   = np.eye(3)                    # 3x3 identity
rand  = np.random.default_rng(42).standard_normal((100, 5))  # 100x5 normal

The dtype parameter controls how values are stored in memory. Choosing the right dtype can halve memory usage:

| dtype | Bytes | Use case | |---|---|---| | float64 | 8 | Default float — general use | | float32 | 4 | Neural network weights, image data | | int64 | 8 | Large integer counts | | int32 | 4 | Moderate integer ranges | | bool | 1 | Masks and filters | | uint8 | 1 | Pixel values (0–255) |

python

img = np.zeros((1080, 1920, 3), dtype=np.uint8)   # ~6 MB, not 50 MB
print(img.nbytes)   # 6_220_800

Indexing and Slicing

NumPy supports four distinct indexing styles:

python

a = np.arange(12).reshape(3, 4)
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [ 8,  9, 10, 11]])

# Basic slicing — returns a VIEW (no copy)
row1 = a[1, :]      # array([4, 5, 6, 7])
col2 = a[:, 2]      # array([2, 6, 10])
sub  = a[0:2, 1:3]  # 2x2 submatrix

# Integer array indexing — returns a COPY
rows = a[[0, 2], :]           # first and third rows
fancy = a[[0, 1], [2, 3]]     # a[0,2] and a[1,3] → [2, 7]

# Boolean indexing — returns a COPY
mask  = a > 5
above = a[mask]     # array([6, 7, 8, 9, 10, 11])

# np.where — vectorised conditional
result = np.where(a % 2 == 0, a, -1)   # even → keep, odd → -1

The view vs copy distinction matters for memory and for side effects. Mutating a slice mutates the original array:

python

b = a[0, :]
b[0] = 999
print(a[0, 0])   # 999 — a was modified too!

Use .copy() explicitly when you need an independent array.

Broadcasting

Broadcasting is the mechanism that lets NumPy apply operations between arrays of different shapes without allocating extra memory. The rules are applied dimension-by-dimension from the right:

If arrays have different numbers of dimensions, the shape of the smaller one is padded with 1s on the left.
Dimensions of size 1 are stretched to match the other array's size.
If shapes are incompatible after stretching, NumPy raises a ValueError.

python

# Add a 1-D mean vector to each row of a 2-D matrix
matrix = np.random.rand(100, 4)    # shape (100, 4)
means  = matrix.mean(axis=0)       # shape (4,)  — row means

centered = matrix - means          # means is broadcast to (100, 4)
print(centered.mean(axis=0))       # ~[0, 0, 0, 0]

# Outer product via broadcast
row = np.array([[1, 2, 3]])        # shape (1, 3)
col = np.array([[10], [20], [30]]) # shape (3, 1)
outer = row * col
# array([[10, 20, 30],
#        [20, 40, 60],
#        [30, 60, 90]])

Memory Layout: C vs Fortran Order

Arrays can be stored row-major (C order, the default) or column-major (Fortran order). Row-major means elements of a row are contiguous in memory; iterating row-by-row is cache-friendly. Column-major arrays are contiguous column-by-column, which makes column operations faster.

python

c_arr = np.array([[1, 2], [3, 4]], order='C')
f_arr = np.array([[1, 2], [3, 4]], order='F')

print(c_arr.flags['C_CONTIGUOUS'])   # True
print(f_arr.flags['F_CONTIGUOUS'])   # True

# Strides show bytes to step in each dimension
print(c_arr.strides)   # (16, 8) — step 16 bytes per row, 8 per col
print(f_arr.strides)   # (8, 16) — step 8 bytes per row (col-major)

For most data analysis work you will never need to specify order explicitly. It becomes important when passing arrays to Fortran-based BLAS/LAPACK routines (used by np.linalg) or when writing high-performance extensions.

Key Aggregation Functions

python

data = np.random.default_rng(0).normal(loc=5, scale=2, size=(1000, 3))

print(data.mean(axis=0))      # column means
print(data.std(axis=0))       # column standard deviations
print(data.min(), data.max()) # global min/max
print(np.percentile(data, [25, 50, 75], axis=0))  # quartiles per column

The axis parameter is universal: axis=0 collapses rows (operates down columns), axis=1 collapses columns (operates across rows).

Summary

ndarray is a fixed-type, contiguous-memory array. Choosing the right dtype can significantly reduce memory consumption.
Slicing returns views (no memory copy); fancy and Boolean indexing return copies.
Broadcasting allows element-wise operations between arrays of compatible shapes without explicit replication.
Memory layout (C vs Fortran order) affects cache performance; the default C order is efficient for row-wise access patterns typical in data analysis.
Aggregation functions accept an axis argument: axis=0 reduces rows, axis=1 reduces columns.

Python for Data Work — the Essential Toolkit Pandas DataFrames