If you start from the compression objective, make representations compact within each class and spread apart between classes, and you try to optimize it iteratively, each iteration step naturally produces two operations. The first looks like multi-head self-attention: it compresses the representation. The second looks like a feed-forward MLP: it sparsifies the representation.
Stack these iterations into layers, and what emerges is a transformer. Not because someone designed it that way, but because the math demands it.