Make money doing the work you believe in

4-bit Llama 3.3 70B: Same accuracy, 3.5x smaller

Using AutoRound, the quantization process takes about 1.5 hours on an H100 SXM. Including a zero-shot MMLU evaluation to ensure the model isn’t broken, the total cost is under $10 when using RunPod.

The resulting 4-bit model matches the accuracy of the original, at least according to MMLU, while being 3.5x smaller. However, I strongly recommend conducting a more extensive evaluation before deploying the model. Generative task evaluations (e.g., IFEval) can offer deeper insights into the model's performance, though they are significantly more expensive to run.

I also tried many different configurations for 2-bit and 3-bit quantization but they all broke the model. I'll try other algorithms later this week.

HQQ followed by some short fine-tuning could work.

You can download the 4-bit Llama 3.3 here:

huggingface.co/kaitchup…

More details here:

Quantize and Run Llama 3.3 70B Instruct on Your GPU
Dec 9, 2024
at
4:54 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.