ImageNet has 1000 categories. A typical network maps each image to a 512-dimensional feature vector.
Each category is not a single point in that space. It is a low-dimensional subspace. Golden retrievers vary along perhaps 5 directions: angle, lighting, fur shade. Sports cars vary along 8: color, angle, model, background. Each class occupies a thin slab.
The rate reduction objective must arrange 1000 subspaces of different shapes and dimensions in 512-dimensional space so each is internally compact and all are mutually spread apart.
This is not a problem intuition can solve. It is a problem an objective function was built for. Two clusters is straightforward. A thousand requires principled optimization.