Efficiency and performance tweaks in the transformer architecture usually focus(ed) on the normalization, attention, and FFN modules.
Well, here is a New Year’s gift from DeepSeek (arxiv.org/abs/2512.24880). Finally some improvements of the residual path as well.