Hugo (@robonaissance): "If you start from the compression objective, make representations compact within each class and spread apart between classes, and you try to optimize it iteratively, each iteration step naturally produces two operations. The first looks like multi-head self-attention: it compres…"

The app for independent voices

If you start from the compression objective, make representations compact within each class and spread apart between classes, and you try to optimize it iteratively, each iteration step naturally produces two operations. The first looks like multi-head self-attention: it compresses the representation. The second looks like a feed-forward MLP: it sparsifies the representation.

Stack these iterations into layers, and what emerges is a transformer. Not because someone designed it that way, but because the math demands it.

Robonaissance

Intelligence Is Compression, Part 1: One Principle

Apr 10

12:33 PM

The app for independent voices

Log in or sign up