Make money doing the work you believe in

Efficiency and performance tweaks in the transformer architecture usually focus(ed) on the normalization, attention, and FFN modules.

Well, here is a New Year’s gift from DeepSeek (arxiv.org/abs/2512.24880). Finally some improvements of the residual path as well.

Jan 1
at
4:43 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.