Make money doing the work you believe in

Maximum Likelihood Estimation is the quiet engine underneath almost everything in ML, DL, and RL.

Most people learn it once in a stats class and move on. They shouldn't. It's not just a parameter-fitting trick — it's the philosophical foundation of why our models work at all.

The core idea is almost trivial.

You have data. You have a model with parameters θ. Ask: what value of θ makes the data you actually observed as probable as possible?

Formally: maximize log p(data | θ).

That's it. Everything else is an implementation detail.

In classical ML, MLE gives you linear regression for free. Assume Gaussian noise on your outputs → maximizing likelihood gives you least squares. Assume Bernoulli outputs → you get logistic regression. The loss functions you memorized aren't arbitrary. They're MLE in disguise.

In deep learning, cross-entropy loss is MLE. When you train a classifier, you're maximizing the log-likelihood of the true labels under your model's output distribution. Every gradient step is the model asking: how do I make the observed data more probable?

Extend this to generative models:

→ VAEs maximise a lower bound on the log-likelihood (the ELBO)

→ Diffusion models maximise a reweighted likelihood over denoising steps

→ Autoregressive LLMs do pure MLE — next-token prediction is maximising p(xₜ | x₁,...,xₜ₋₁) over billions of tokens

The entire pre-training recipe of GPT is MLE on text.

In reinforcement learning, MLE shows up in a less obvious place.

Policy gradient methods like REINFORCE compute:

∇θ log π(a|s) · R

That score function — ∇θ log π — is the gradient of the log-likelihood of your policy's action. You're doing MLE, but weighted by reward. Actions that led to high reward get their likelihood pushed up. Actions that didn't, get pushed down.

Even modern RLHF (the technique behind ChatGPT's fine-tuning) is built on this: fit a reward model via MLE, then optimise policy likelihood against that reward signal.

The deeper point:

MLE has a dual identity. Maximising log-likelihood is identical to minimising KL divergence between your model distribution and the true data distribution.

MLE ≡ minimise KL(p_data ‖ p_model)

This is why it generalises everywhere. You're always doing the same thing: pulling your model's distribution toward the data-generating process, as efficiently as information theory allows.

The limitations are real — MLE can overfit, it treats all errors equally, it breaks under distribution shift. That's why we have regularisation (MAP estimation), robust losses, and distributional RL.

Image Source: share.google/AbqFE6BjUl…

May 21
at
6:43 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.