The app for independent voices

Started running Qwen on my MacBook a few weeks ago.

Then moved to a Mac Mini M4.

Then found a way to run a 35B model on 16GB of RAM.

Yesterday Google dropped Gemma 4 and I swapped the brain by evening. Local models turned out to be useful for much more than I expected. Not just message classification.

Context compression, nightshift signal preparation, memory consolidation, email preprocessing. A full preprocessing layer that touches nearly everything my agent does.

Three tiers of local inference on a $600 machine:

- Fast (2.3B): classify every message in under 2 seconds

- Primary (4.5B): compress context before Claude sees it

- Heavy (35B): complex analysis via SSD weight streaming.

Yes, on 16GB The 35B trick: mmap flash-paging via llama.cpp. The model is Mixture of Experts. Only 3B active per token. The OS pages the rest from SSD on demand. 17 tokens/sec, 81% memory free, zero swap.

Apple published a paper about this approach back in 2023. Then Gemma 4 dropped. Classification went from 8.5s to 1.9s.

The swap touched 5 files. Result: ~30-40% fewer cloud API sessions. Plus a resilience chain that keeps the agent running when Claude is down.

My $600 Mac Mini Runs a 35B AI Model. Yesterday I Swapped Its Brain.
Apr 3
at
9:52 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.