GadaaLabs
Python Mastery — From Zero to AI Engineering
Lesson 6

File I/O, JSON, CSV & the pathlib API

24 min

The Context Manager: Always Use with

When you open a file, the OS gives you a file descriptor — a limited system resource. If your program crashes or you forget to call f.close(), the descriptor leaks. The with statement guarantees the file is closed even if an exception is raised.

python
# Wrong — close() may never be called if an exception occurs
f = open("data.txt")
data = f.read()
f.close()

# Correct — __exit__ is called no matter what
with open("data.txt") as f:
    data = f.read()

The with statement works with any context manager — objects implementing __enter__ and __exit__. Files, database connections, locks, and temporary directories are all context managers.

Reading Modes

| Mode | Meaning | Creates file? | Truncates? | |------|---------|--------------|-----------| | r | Read text (default) | No | No | | w | Write text | Yes | Yes | | a | Append text | Yes | No | | r+ | Read + write text | No | No | | rb | Read binary | No | No | | wb | Write binary | Yes | Yes | | x | Exclusive create | Fails if exists | — |

Use binary mode (rb, wb) for images, PDFs, archives, or any file that is not plain text. Use text mode with encoding="utf-8" for all text files.

Context managers and reading strategies
Click Run to execute — Python runs in your browser via WebAssembly

Writing Files

Writing files
Click Run to execute — Python runs in your browser via WebAssembly

The pathlib API

pathlib.Path is the modern, object-oriented way to handle file system paths. It is cross-platform (handles Windows vs POSIX separators), composable with /, and replaces os.path for most use cases.

pathlib.Path API
Click Run to execute — Python runs in your browser via WebAssembly

JSON: Encoding and Decoding

JSON is the lingua franca of web APIs and configuration files. Python's json module converts between JSON strings and Python objects.

| JSON | Python | |------|--------| | object | dict | | array | list | | string | str | | number | int or float | | true/false | True/False | | null | None |

JSON encoding, decoding, custom serializer
Click Run to execute — Python runs in your browser via WebAssembly

CSV: Reading and Writing Tabular Data

CSV (comma-separated values) is everywhere — databases, spreadsheets, data pipelines. Python's csv module handles quoting, escaping, and different dialects correctly.

CSV reader, writer, DictReader, DictWriter
Click Run to execute — Python runs in your browser via WebAssembly

configparser and tempfile

configparser and tempfile
Click Run to execute — Python runs in your browser via WebAssembly

PROJECT: Configuration Manager

A production-quality config manager with JSON storage, defaults, nested dot-notation access, and validation.

PROJECT: Configuration Manager
Click Run to execute — Python runs in your browser via WebAssembly

PROJECT: CSV Data Pipeline

A data pipeline that validates, transforms, and cleans CSV data — a pattern that appears in virtually every data engineering task.

PROJECT: CSV Data Pipeline
Click Run to execute — Python runs in your browser via WebAssembly

Challenge

Push further with these exercises:

  1. Recursive directory scanner — write a function scan_directory(path, extensions=None) using pathlib.Path.rglob() that returns a dict mapping each extension to a list of matching file paths. Exclude hidden files (starting with .).

  2. JSON diff — write json_diff(a_str, b_str) that compares two JSON objects and returns a dict describing changes: added keys, removed keys, changed values (with old and new). Handle nested dicts recursively.

  3. CSV streaming aggregator — write a function that reads a very large CSV (simulate with a 10,000-row StringIO) without loading it all into memory. Compute: row count, sum and mean of a numeric column, value counts for a categorical column. Use a single pass.

  4. Config merger — extend ConfigManager with a merge(other: ConfigManager) method that deep-merges two configs, with the other config winning on conflicts. Write tests that verify nested dict merging.

  5. NDJSON writer/reader — implement a NDJSONWriter and NDJSONReader class for newline-delimited JSON (one JSON object per line, used in log streaming). The writer should accept a datetime encoder. The reader should yield decoded objects lazily.

Key Takeaways

  • Always use the with statement for file I/O — it guarantees cleanup via __exit__ even when exceptions occur
  • Text mode with encoding="utf-8" is the safe default; only use binary mode when dealing with non-text data
  • pathlib.Path replaces os.path — it is composable with /, self-documenting, and cross-platform
  • json.dumps/loads and json.dump/load (with file) are the two patterns you use 99% of the time; custom encoders handle non-serializable types like datetime
  • csv.DictReader and csv.DictWriter are almost always better than the plain reader/writer — named fields prevent index bugs
  • Iterate over large files line by line (or row by row with DictReader) — never load a multi-GB file with read() or readlines()
  • io.StringIO and io.BytesIO are your best friends for testing file I/O code without touching the real filesystem
  • configparser is perfect for .ini-style human-editable configs; use JSON for programmatically generated or deeply nested configs