Probability Fundamentals
Probability is the mathematical language of uncertainty. Every machine learning model is, at its core, a probability machine — it estimates the likelihood that an input belongs to a class, or the distribution of possible output values. Understanding probability distributions and Bayesian reasoning will make every modelling decision you make more principled.
Conditional Probability and Independence
Conditional probability answers: "What is the probability of A, given that B has already occurred?"
P(A | B) = P(A ∩ B) / P(B)
Two events are independent if P(A | B) = P(A) — knowing B occurred gives no information about A. The naive Bayes classifier assumes all features are independent given the class label, which is rarely true but works surprisingly well in practice.
The Normal (Gaussian) Distribution
The normal distribution is the most important distribution in statistics. The Central Limit Theorem guarantees that the mean of many independent random variables tends toward a normal distribution, regardless of the underlying distribution — which is why means and regression errors are often approximately normal.
Distribution properties at a glance:
| Distribution | Shape | Parameters | Common use case | |---|---|---|---| | Normal | Symmetric bell | μ (mean), σ (std) | Measurement errors, heights, test scores | | Poisson | Right-skewed, discrete | λ (rate) | Event counts per unit time/area | | Binomial | Discrete, bounded | n (trials), p (success prob) | Binary outcomes over repeated trials | | Exponential | Right-skewed, continuous | λ (rate) | Time between events |
Poisson Distribution
The Poisson distribution models the number of events occurring in a fixed interval when events happen at a known average rate and independently:
Binomial Distribution
The binomial distribution models the number of successes in n independent Bernoulli trials, each with probability p of success:
Bayesian Intuition
Bayesian thinking updates beliefs as evidence accumulates:
- Prior: What you believe before seeing data. P(hypothesis).
- Likelihood: How probable the observed data is under the hypothesis. P(data | hypothesis).
- Posterior: Updated belief after seeing data. P(hypothesis | data) ∝ P(data | hypothesis) × P(hypothesis).
This result surprises most people. A test that sounds "99% accurate" produces positive results that are wrong 83% of the time when the disease is rare. This is the base rate fallacy — ignoring the prior probability leads to drastically overestimating posterior probabilities.
Summary
- Conditional probability P(A|B) describes the probability of A given that B is already known; Bayes' theorem inverts this to update beliefs after observing evidence.
- The normal distribution (symmetric bell) appears wherever many small independent factors add together; the 68-95-99.7 rule provides quick intuition for standard deviation ranges.
- The Poisson distribution models event counts per interval at a fixed rate; the binomial models success counts over repeated binary trials.
- Bayesian reasoning quantifies belief update: posterior is proportional to likelihood times prior. Low base rates (small priors) can make even accurate tests unreliable — always account for prevalence.