Paolo Perrone (@paoloap): "I used to overthink which LLM to self-host. Llama. Mistral. Gemma. Qwen. DeepSeek. New models every week. Benchmarks everywhere. Analysis paralysis. Here's the decision tree I use now: → Consumer GPU (16GB)? Mistral 7B or Llama 3 8B → Single 4090 (24GB)? Gemma 2 27B quantiz…"

I used to overthink which LLM to self-host.

Llama. Mistral. Gemma. Qwen. DeepSeek.

New models every week. Benchmarks everywhere. Analysis paralysis.

Here's the decision tree I use now:

→ Consumer GPU (16GB)? Mistral 7B or Llama 3 8B

→ Single 4090 (24GB)? Gemma 2 27B quantized

→ Coding focus? DeepSeek Coder or Qwen2.5-Coder

→ Multilingual? Qwen2.5

→ Apache 2.0 required? Gemma

→ Maximum performance? Llama 3.1 70B on multi-GPU

The nuance:

1️⃣ 7-8B models are the sweet spot for most

Fast inference. Fits on gaming GPUs. Good enough for 80% of use cases.

2️⃣ Quantization is mandatory

4-bit GGUF or AWQ. 50-70% memory reduction. Minimal quality loss.

If you're not quantizing, you're wasting VRAM.

3️⃣ Coding models are different beasts

DeepSeek Coder V2 and Qwen2.5-Coder beat general models on code tasks.

Use the right tool.

4️⃣ License matters for production

Llama has restrictions. Gemma is Apache 2.0. Read the fine print.

5️⃣ Qwen2.5 is slept on

Best multilingual support. Strong reasoning. Alibaba keeps shipping.

The real advice:

Start with Llama 3 8B quantized.

Only upgrade when you hit a wall.

Most self-hosting failures come from overcomplicating the stack.

What model are you running locally? 👇

💾 Bookmark for your next self-hosting decision

♻️ Repost for someone still paying API costs for simple tasks

May 8

2:14 PM