Sebastian Raschka, PhD (@rasbt): "I (finally) implemented Qwen3 Mixture-of-Experts from Scratch in pure PyTorch. The Qwen3 model suite recently added new open-weight models (Instruct, Thinking, and Coder), and I couldn't resist... As I mentioned a few weeks ago, I've been using the older (and smaller) Qwen3 de…"

The app for independent voices

Aug 2, 2025

I (finally) implemented Qwen3 Mixture-of-Experts from Scratch in pure PyTorch.

The Qwen3 model suite recently added new open-weight models (Instruct, Thinking, and Coder), and I couldn't resist...

As I mentioned a few weeks ago, I've been using the older (and smaller) Qwen3 dense models as a starting point for experimentation and research due to their really good performance.

After last week's release, I finally sat down and coded the MoE variants as well.

The purpose of this re-implementation is to have human-readable code to better understand the internals and make it easier to adapt or modify for downstream tasks.

If you are curious, you can check out the notebook here: github.com/rasbt/LLMs-f…

tldr:

- Qwen3 Coder Flash (30B-A3B) architecture

- MoE setup with 128 experts, 8 active per token

- Pure PyTorch in a standalone Jupyter notebook

- Roughly 60 GB memory requirement (runs on a single A100 or newer)

Happy tinkering!

Aug 2

1:28 PM

The app for independent voices

Log in or sign up