1. Big Data: Foundation model training data is big data; for big data, there is the CAP theorem.
2. Batch Size: Yann LeCun said in 2020, before foundation models, 'friends don’t let friends use batch size over 32.' There is some wisdom to it if you really understand derivatives. From a fixed-point theory perspective, current backpropagation is way too primitive.
3. Token Embedding size: DeepSeek is 7,168, OAI GPT-OSS-120B is 8,192. Why such large is needed? Because transformer (dense or MoE) is way too rigid; recently mathematicians discovered that for the most complex manifold, 128 dimensions is enough. Something is very wrong transformer architecture.
4. Knowledge Distillation: It is knowledge normalization, has research value. Can you normalize the entire world? What about bitter lessons?
Oct 17
at
9:58 AM
Relevant people
Log in or sign up
Join the most interesting and insightful discussions.