The way we represent embeddings has a structural flaw that limits how good embeddings can be.
How much training data do you need to learn a classifier? The answer depends on how many classes the model has to distinguish between simultaneously. This is called sample complexity, and it scales differently for flat versus hierarchical approaches.
Say your taxonomy has b choices at each level (the branching factor) and L levels deep. A product catalog might have b=10 categories at each level across L=5 levels — that’s 10⁵ = 100,000 leaf products. Also let d be your embedding dimension, and ε be how much error you’ll tolerate
A flat classifier sees all 100,000 leaves at once. Its sample complexity:
Flat: n_flat = O(d · L · log(b) / ε²) (the ε² in the denominator just means: tighter accuracy demands quadratically more data, which is standard in learning theory and where the scaling laws come from).
A hierarchical classifier decomposes the problem — at each level, it only distinguishes between b children, not the entire leaf space:
Hierarchical: n_hier = O(d · log(b) / ε²)
The difference is the factor of L. The flat version pays for all 5 levels of depth simultaneously. The hierarchical version pays for one level at a time. The same total information reaches the leaf, but one brute-forces the combinatorial space while the other walks the tree.
This, amongst many other reasons, is why I think the future of AI will come from exploring geometries and representing information better, not through scaling RL and the other paradigms that exist currently.
One example where this works very well is our use of fractals to change embeddings to represent hierarchies better. Read more about that here—