The app for independent voices

If you start from the compression objective, make representations compact within each class and spread apart between classes, and you try to optimize it iteratively, each iteration step naturally produces two operations. The first looks like multi-head self-attention: it compresses the representation. The second looks like a feed-forward MLP: it sparsifies the representation.

Stack these iterations into layers, and what emerges is a transformer. Not because someone designed it that way, but because the math demands it.

Intelligence Is Compression, Part 1: One Principle
Apr 10
at
12:33 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.