Using AutoRound, the quantization process takes about 1.5 hours on an H100 SXM. Including a zero-shot MMLU evaluation to ensure the model isn’t broken, the total cost is under $10 when using RunPod.
The resulting 4-bit model matches the accuracy of the original, at least according to MMLU, while being 3.5x smaller. However, I strongly recommend conducting a more extensive evaluation before deploying the model. Generative task evaluations (e.g., IFEval) can offer deeper insights into the model's performance, though they are significantly more expensive to run.
I also tried many different configurations for 2-bit and 3-bit quantization but they all broke the model. I'll try other algorithms later this week.
HQQ followed by some short fine-tuning could work.