I used to overthink which LLM to self-host.
Llama. Mistral. Gemma. Qwen. DeepSeek.
New models every week. Benchmarks everywhere. Analysis paralysis.
Here's the decision tree I use now:
→ Consumer GPU (16GB)? Mistral 7B or Llama 3 8B
→ Single 4090 (24GB)? Gemma 2 27B quantized
→ Coding focus? DeepSeek Coder or Qwen2.5-Coder
→ Multilingual? Qwen2.5
→ Apache 2.0 required? Gemma
→ Maximum performance? Llama 3.1 70B on multi-GPU
The nuance:
1️⃣ 7-8B models are the sweet spot for most
Fast inference. Fits on gaming GPUs. Good enough for 80% of use cases.
2️⃣ Quantization is mandatory
4-bit GGUF or AWQ. 50-70% memory reduction. Minimal quality loss.
If you're not quantizing, you're wasting VRAM.
3️⃣ Coding models are different beasts
DeepSeek Coder V2 and Qwen2.5-Coder beat general models on code tasks.
Use the right tool.
4️⃣ License matters for production
Llama has restrictions. Gemma is Apache 2.0. Read the fine print.
5️⃣ Qwen2.5 is slept on
Best multilingual support. Strong reasoning. Alibaba keeps shipping.
The real advice:
Start with Llama 3 8B quantized.
Only upgrade when you hit a wall.
Most self-hosting failures come from overcomplicating the stack.
What model are you running locally? 👇
💾 Bookmark for your next self-hosting decision
♻️ Repost for someone still paying API costs for simple tasks