Deep Manifold (@deepmanifold): "1. Big Data: Foundation model training data is big data; for big data, there is the CAP theorem. 2. Batch Size: Yann LeCun said in 2020, before foundation models, 'friends don’t let friends use batch size over 32.' There is some wisdom to it if you really understand derivatives…"

The app for independent voices

1. Big Data: Foundation model training data is big data; for big data, there is the CAP theorem.

2. Batch Size: Yann LeCun said in 2020, before foundation models, 'friends don’t let friends use batch size over 32.' There is some wisdom to it if you really understand derivatives. From a fixed-point theory perspective, current backpropagation is way too primitive.

3. Token Embedding size: DeepSeek is 7,168, OAI GPT-OSS-120B is 8,192. Why such large is needed? Because transformer (dense or MoE) is way too rigid; recently mathematicians discovered that for the most complex manifold, 128 dimensions is enough. Something is very wrong transformer architecture.

4. Knowledge Distillation: It is knowledge normalization, has research value. Can you normalize the entire world? What about bitter lessons?

Oct 17

9:58 AM

The app for independent voices

Log in or sign up