Started running Qwen on my MacBook a few weeks ago.
Then moved to a Mac Mini M4.
Then found a way to run a 35B model on 16GB of RAM.
Yesterday Google dropped Gemma 4 and I swapped the brain by evening. Local models turned out to be useful for much more than I expected. Not just message classification.
Context compression, nightshift signal preparation, memory consolidation, email preprocessing. A full preprocessing layer that touches nearly everything my agent does.
Three tiers of local inference on a $600 machine:
- Fast (2.3B): classify every message in under 2 seconds
- Primary (4.5B): compress context before Claude sees it
- Heavy (35B): complex analysis via SSD weight streaming.
Yes, on 16GB The 35B trick: mmap flash-paging via llama.cpp. The model is Mixture of Experts. Only 3B active per token. The OS pages the rest from SSD on demand. 17 tokens/sec, 81% memory free, zero swap.
Apple published a paper about this approach back in 2023. Then Gemma 4 dropped. Classification went from 8.5s to 1.9s.
The swap touched 5 files. Result: ~30-40% fewer cloud API sessions. Plus a resilience chain that keeps the agent running when Claude is down.