Today’s paper introduces STEP3-VL-10B, a lightweight open-source foundation model designed to address the trade-off between computational efficiency and advanced multimodal intelligence. While current frontier models often rely on massive scaling that hinders practical deployment, smaller models typically lack sophisticated reasoning capabilities. This work presents a 10-billion parameter model that utilizes specific architectural and training strategies to rival the performance of systems ten to twenty times its size.