GadaaLabs
Python Mastery — From Zero to AI Engineering
Lesson 9

NumPy — Arrays, Broadcasting & Linear Algebra

28 min

Why NumPy Exists: The Python Performance Problem

Python lists are flexible — they can hold any object, any size, any type. But that flexibility has a cost. A Python list is an array of pointers, each pointing to a separate heap-allocated Python object. When you sum a list of integers, Python must:

  1. Follow a pointer to each integer object
  2. Unbox the C integer from the Python wrapper
  3. Add it
  4. Box the result into a new Python wrapper
  5. Manage garbage collection

NumPy sidesteps all of this. A NumPy array stores values as a contiguous block of raw C memory — 64-bit floats packed one after another, no pointers, no Python object overhead. Operations on this memory:

  • Are implemented in C (and Fortran for linear algebra)
  • Use SIMD (Single Instruction, Multiple Data) CPU instructions to process multiple elements simultaneously
  • Avoid the Python GIL for computation (only the coordination layer is Python)

The result: NumPy operations are typically 50–500x faster than equivalent Python loops.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Creating Arrays

Python
Click Run to execute — Python runs in your browser via WebAssembly

Indexing, Slicing, and Fancy Indexing

Python
Click Run to execute — Python runs in your browser via WebAssembly

Universal Functions and Axis Operations

Python
Click Run to execute — Python runs in your browser via WebAssembly

Broadcasting: The Most Powerful (and Confusing) Feature

Broadcasting is the mechanism that lets NumPy operate on arrays of different shapes. Understanding it is essential for writing concise, loop-free NumPy code.

The Broadcasting Rules:

  1. If the arrays have different number of dimensions, pad the smaller shape on the left with 1s
  2. Dimensions of size 1 are stretched to match the other array's size in that dimension
  3. If shapes still don't match after stretching, raise an error
Python
Click Run to execute — Python runs in your browser via WebAssembly

Reshaping and Stacking

Python
Click Run to execute — Python runs in your browser via WebAssembly

Linear Algebra

Python
Click Run to execute — Python runs in your browser via WebAssembly

Performance: Vectorized vs Loop Benchmark

Python
Click Run to execute — Python runs in your browser via WebAssembly

PROJECT: Neural Network Forward Pass from Scratch

A neural network is just a sequence of matrix multiplications and non-linear functions. NumPy is all you need to implement one:

Python
Click Run to execute — Python runs in your browser via WebAssembly

Key Takeaways

  • NumPy's speed comes from memory layout: contiguous C arrays plus SIMD CPU instructions, not Python-level tricks — this is why copying to a list and back is expensive
  • Slicing returns views, not copies: arr[0:5] shares memory with arr — modifying it modifies the original; use .copy() when you need independence
  • Boolean indexing is the most practical pattern: arr[arr > 0] is cleaner and faster than any loop filter, and arr[mask] = value is the idiomatic way to conditionally set values
  • Broadcasting follows three strict rules: pad left with 1s, stretch size-1 dimensions, error on incompatible sizes — once internalized, it replaces most explicit loops
  • axis=0 reduces rows, axis=1 reduces columns: a (3, 4) array summed on axis=0 gives shape (4,); summed on axis=1 gives shape (3,) — think "collapse along this axis"
  • @ is matrix multiply, * is element-wise: confusing them is the most common NumPy bug — always check shapes before and after
  • np.linalg.solve(A, b) beats inv(A) @ b: computing the inverse is slower and numerically less stable than solving directly; use solve for systems of equations
  • Vectorization is a mindset shift: instead of asking "how do I loop over elements?", ask "what array operation produces the result?" — the answer is almost always faster