Oh I see your point though. I think the move towards the smaller, fine-grained experts largely goes back to the DeepSeekMoE paper (arxiv.org/pdf/2401.06066) that found it to be beneficial.
But to your point, what would maybe not be a bad idea is a larger shared expert.
Feb 12
at
3:44 PM
Relevant people
Log in or sign up
Join the most interesting and insightful discussions.