NumPy Fundamentals
NumPy's ndarray is the foundation of almost everything in the Python data stack. Pandas Series and DataFrames wrap ndarrays internally. scikit-learn's transformers consume and produce ndarrays. Understanding how the ndarray works — how memory is laid out, how indexing maps to that memory, how shapes transform during operations — will make every library you use on top of it less mysterious.
Creating Arrays
There are many construction paths, each suited to a different situation:
The dtype parameter controls how values are stored in memory. Choosing the right dtype can halve memory usage:
| dtype | Bytes | Use case |
|---|---|---|
| float64 | 8 | Default float — general use |
| float32 | 4 | Neural network weights, image data |
| int64 | 8 | Large integer counts |
| int32 | 4 | Moderate integer ranges |
| bool | 1 | Masks and filters |
| uint8 | 1 | Pixel values (0–255) |
Indexing and Slicing
NumPy supports four distinct indexing styles:
The view vs copy distinction matters for memory and for side effects. Mutating a slice mutates the original array:
Use .copy() explicitly when you need an independent array.
Broadcasting
Broadcasting is the mechanism that lets NumPy apply operations between arrays of different shapes without allocating extra memory. The rules are applied dimension-by-dimension from the right:
- If arrays have different numbers of dimensions, the shape of the smaller one is padded with 1s on the left.
- Dimensions of size 1 are stretched to match the other array's size.
- If shapes are incompatible after stretching, NumPy raises a
ValueError.
Memory Layout: C vs Fortran Order
Arrays can be stored row-major (C order, the default) or column-major (Fortran order). Row-major means elements of a row are contiguous in memory; iterating row-by-row is cache-friendly. Column-major arrays are contiguous column-by-column, which makes column operations faster.
For most data analysis work you will never need to specify order explicitly. It becomes important when passing arrays to Fortran-based BLAS/LAPACK routines (used by np.linalg) or when writing high-performance extensions.
Key Aggregation Functions
The axis parameter is universal: axis=0 collapses rows (operates down columns), axis=1 collapses columns (operates across rows).
Summary
ndarrayis a fixed-type, contiguous-memory array. Choosing the rightdtypecan significantly reduce memory consumption.- Slicing returns views (no memory copy); fancy and Boolean indexing return copies.
- Broadcasting allows element-wise operations between arrays of compatible shapes without explicit replication.
- Memory layout (C vs Fortran order) affects cache performance; the default C order is efficient for row-wise access patterns typical in data analysis.
- Aggregation functions accept an
axisargument:axis=0reduces rows,axis=1reduces columns.