Interesting, OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. I think in this way, you can pre-train your LLMs without large activation outliers, which I think is the secret for such easy quantization they were able to do with gpt-oss models.
Aug 5
at
9:06 PM
Log in or sign up
Join the most interesting and insightful discussions.