Make money doing the work you believe in

Welcome Gemma 4

In the past few days, DeepMind released the Gemma 4 family of models.

With it, they're reclaiming the crown on the performance-efficiency Pareto frontier for small models, and showing multimodality doesn't have to come at a cost.

Here's the lineup:

💎 Gemma 4 E2B: a dense 2B model supporting text, vision, and audio

💎 Gemma 4 E4B: same architecture, scaled up to 4B effective parameters

💎 Gemma 4 26B A4B: a MoE model with 26B total parameters and only 4B active at inference, accepting text and vision

💎 Gemma 4 31B: a full 31B dense model, also for text and vision

If you want to go beyond the headlines and understand how frontier multimodal models are built, I genuinely can't recommend this visual guide by Maarten Grootendorst enough!

It goes through the LLM architecture, how vision and audio encoders plug into it, and explains practical tips and tricks that are useful beyond the specific multimodal application.

For example, maintaining aspect ratio when processing images is something you can try today on your ViT with 2D RoPE (e.g., DINOv3).

Enjoy!

A Visual Guide to Gemma 4
Apr 12
at
2:37 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.