Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Most efficient way to fine-tune an LLM in 2024?
114 points by holomorphiclabs 41 days ago | hide | past | favorite | 48 comments
In Apr 2024 what is the most efficient way to fine-tune an LLM?

In particular we are trying to understand performance vs. cost trade-offs. We don't have a budget to train from scratch.

We are working with a proprietary data set on the order of 100M tokens and are looking to fine-tune a general purpose language model and also create task-specific models based on the same corpus.

Any help would be appreciated!




Qlora + axolotl + good foundation model (llama/mistral/etc, usually instruction fine tuned) + runpod works great.

A single A100 or H100 with 80GB VRAM can fine tune 70B open models (and obviously scaling out to many nodes/GPUs is faster, or can use much cheaper GPUs for fine tuning smaller models.)

The localllama Reddit sub at https://www.reddit.com/r/LocalLLaMA/ is also an awesome community for the GPU poor :)


Can consumer systems like the rtx3090 or 4X rtx3090 achieve something?

Have you seen benchmarks


Thank you! and yes huge fan of r/localllama :)


You probably want to build a retrial augmented generation pipeline.

If you do end up wanting to fine tune then use qlora with axolotl or unsloth to prove your hypothesis on a smaller model and then evaluate if you want the marginal gains you get from full precision training.

After you fine tune it with 100m token dataset, use DPO to polish it off. You need to create a DPO dataset for that but it can be relatively small to get some great gains.

After that, look at applying grammars during inference if you are expecting structured results like json.

You should be able to run the experiments on 4090s from vast.ai or runpod or similar service.

It can cost less than $100 depending on your requirements.


This is great advice!

I'd like to add that if you don't have pairwise preference data (A > B) but do have binary data (A is good for x_1, B is good for x_2, etc.), then Kahneman-Tversky Optimization (KTO) might be a better fit. Despite learning with a weaker signal, it works as well or better than dpo in practice.


Do you have any tutorials do achieve all this? Thanks.


I know this is maybe not the answer you want, but if you are just interested in getting the job done there exist companies that are experts on this, for example:

https://fortune.com/2024/03/11/adaptive-startup-funding-falc...


Also interested in this. Does this task really require such specialized knowledge?


The first thing that is required is to define what they are trying to do. In other words, list some question and answer examples. It's amazing how many people are unwilling or unable to do this and just jump to "we need to train a custom model". To do what exactly, or answer what kinds of questions? I have actually had multiple clients refuse to do that.


Very good point. I totally agree with you.


For my ChillTranslator project I spent maybe a few dollars fine-tuning Phi 2, to generate less spicy variations of inflammatory Hacker News comments with very little data to see how well it worked (especially compared to your 100M tokens). I'll improve it when I have time. I mostly followed the Brev fine-tune tutorial but I wanted to have a 2 GB file GGUF quantised model I could run on any device with a specific JSON grammar. It uses Transformers PEFT and QLoRA. I didn't try Axolotl yet, or OpenPipe but I hope to. Actual compute time is probably much less than I spent, I wasted time dealing with drivers, trying to figure out how to merge the finetuned weights, serialise to old fashioned Pickle, not safe-tensors, and how to convert to GGUF, quantise it and rsync it.


A bit late, but Unsloth makes LoRA / QLoRA finetuning 2x faster and reduces VRAM by 80% with 0% degradation in accuracy! (no approximations are done!)

Mistral 7b is 2x faster than HuggingFace + Flash Attention 2. Gemma 7b is 2.4x faster than HF + FA2.

Check out https://github.com/unslothai/unsloth for full benchmarks!


The approach I see used is axolotl with QLoRA using cloud GPUs which can be quite cheap.

https://github.com/OpenAccess-AI-Collective/axolotl

Someone from one of the cloud GPU vendors wrote a guide: https://brev.dev/blog/fine-tuning-mistral


I recommend reviewing Stanford's dspy library - great examples of few-shot learning that works by generating and tuning prompts for LLMs and even distilling instruction following tasks to smaller models like T5. Second, as others mentioned, using QLoRA for supervised fine-tuning followed by DPO/KTO for preference optimization. This strategy placed Huggingface's Zephyr and IBM's Neural Chat on leaderboards for 7B parameter models. I also recommend reviewing the Unsloth library which has excellent accelerated examples of using these methods, along with the axolotl library. Lastly, skypilot and Modal both have excellent examples that showcase using axolotl to efficiently finetune models on cloud GPUs. [1] https://github.com/stanfordnlp/dspy [2] https://github.com/unslothai/unsloth [3] https://github.com/OpenAccess-AI-Collective/axolotl [4] https://github.com/skypilot-org/skypilot [5] https://github.com/modal-labs/llm-finetuning


i looked at dspy last week, and was trying to wrap my head around how it would be useful for a "fine tune" style use case - where i would want to give the base model more context vs use a vector DB and have the model put together a result.

could you give a high level way to think about how to use dspy for something like this?


I think of dspy as a programmatic way to guide LLMs with information, whether from context based on retrieval or from input and output pairs, rather than traditional low-rank fine-tuning. Their readme has a high-level introduction to using RAG with a user defined way to pass relevant context. I also found their link to Weaviate's notebooks, where dspy is used with a vector DB, helpful in understanding an end-to-end workflow: [1] https://github.com/weaviate/recipes/tree/main/integrations/d...


A possible alternative to fine-tuning is in-context learning, especially if you are using a model with long context where you can provide a lot of examples. Models can do one/few-shot learning, but in-context learning improves the more examples you give. You could experiment cheaply with Claude Haiku to see if this works for you.


Finetuning a LoRA based adapter using a tool like predibase.com does this really fast. If you wanna go fully open source and have your own hardware you can do the same thing using a ludwig + lorax stack to do this yourself.


What's your measure of performance?

Theres no one size fits all answer yet, but if you just want to test it out there are many commercial offerings on which you should be able to get some results for under $10k.


Are there any that are recommended? Honestly we would rather not share data with any 3P vendors. It's been a painstaking progress to curate it.


Apologize if out of topic but could anyone please point me to a resource regarding best practices of implementing RAG with either proprietary LLMs like GPT?


I don't know if it's best practices but I found two tutorials. Apologies for the markdown.

* [Example 1](https://www.mongodb.com/developer/products/atlas/rag_with_cl...) (Claude and MongoDB's vector database)

* [Example 2](https://docs.mistral.ai/guides/basic-RAG/) (Mistral and the Faiss vector database or other embedding frameworks)


I understand the methods to address the fine-tuning and RAG issues but lack the time and possibly the technical skills to implement the solution. Fine-tuning can potentially dumb down a perfect model, and RAG has context limitations and may not cover all content. My thinking, we should vectorize the text and embed these vectors into all layers of the model at inference time. This approach would bypass the context size limitations and resource wastage associated with fine-tuning, as vectorization is fast. I believe this vectorization and embedding strategy is the solution.


What LLM are you hoping to use. Have you considered using HelixML? If I am reading you right, the primary concern is compute costs, not human-time costs?


We are finding there is a trade-off between model performance and hosting costs post-training. The optimal outcome is where we have a model that performs well on next-token prediction (and some other in-house tasks we've defined) that ultimately results in a model that we can host on the lowest-cost hosting provider rather than be locked in. I think we'd only go the proprietary model route if the model really was that much better. We're just trying to save our selves weeks/months of benchmarking time/costs if there was already an established option in this space.


That said, I think that dvt's comment is helpful about RAG likely being what you need rather than fine-tuning, but wanted to offer something if you know that's what you need.



Thank you we have been exploring this.


I think you may be misunderstanding what fine tuning does. It does not teach the model new knowledge. In fact, Meta has a paper out that argues you only need a data set of 1000[1] to achieve pretty good alignment (fine-tuning) results. (100M is way overkill.) For knowledge retrieval, you need RAG (usually using the context window).

[1] https://arxiv.org/pdf/2305.11206.pdf


This is not correct. Fine-tuning can absolutely add new knowledge to a model. It's been repeatedly demonstrated at this point.

LIMA demonstrated that instruction-tuning and output formatting could be trained with a limited number of samples, not that finetuning was incapable of adding new information to the model.

It may be sub-optimal in most cases to RAG, but it does work.


Do you have any good links to support the idea that this has been repeatedly demonstrated?

I've had trouble finding high quality sources of information about successful applications of fine-tuning to add knowledge to a model.


Here is a recent HN discussion of an article that talks about this. https://news.ycombinator.com/item?id=39748537

Anecdotally, I literally "added knowledge" to a model via fine-tuning earlier today.

Fine tuning can do extremely well given a specific question and answer, the tuned model "knows" how to answer that question much more accurately.

I gave it a specific question, and a good answer as a fine tuning input. (Literally 2 data points as the input, 2 questions/answer sets.)

I asked it that question, and the tuned model blows the base model away, for answering that specific question.


> I asked it that question, and the tuned model blows the base model away, for answering that specific question.

Validating on training data...What could possibly go wrong?


This thread reminds of a competition I once joined where we were supposed to fine-tune an LLM to fill out trivia answers, and we were expressly disallowed from training on the validation set.

However: we were allowed to pick any base model in a given repo. All of the teams that “won” did so for the same reason: they had all picked the same base model (whereas a majority of teams picked the given default), presumably the one that had at some point been trained on the most favorable data for this particular challenge.

It was quite silly. Had everyone had the same base model we’d have a bit more of an interesting problem (more around NLP and alignment than picking the ‘best’ model).


Well, in this case we're literally asking if the model can remember new facts, not generalize, so seems like a legit first level test; second level might be, can it answer a question incorporating that specific knowledge in a broader question.


Our findings are that RAG does not generalize well when critical understanding is shared over a large corpus of information. We do not think it is a question of either context length or retrieval. In our case it is very clearly capturing understanding within the model architecture itself.


Does that mean you tested on specific questions? Get 1-5 typical queries and test them with a properly configured llamaindex.

If your documents repeat the same information several different ways then you actually might get something out of LoRA on raw documents. But you need a way to measure it and you have to verify that RAG won't work with real tests first.

To do effective training with LoRA though and expect it to pick up most of the information reliably then you need to cover the knowledge and skills with multiple question answer pairs for each item you expect it to learn. Which you can then use QA pairs to validate that it learned those things.

But it's a lot of QA pair generation.


Depending on the application, you would do continued pretraining over new tokens to gain new knowledge. 100M tokens is applicable here.

You would fine-tune, certainly, for domain-specific tasks, and would curate a subset of the 100M tokens. Total tokens in alignment study references is 1,000,000.

RAG is a hacky way to interpolate new knowledge with a base model. Not always reliable nor easy to integrate into task-specific workflows.


When I first played with RAG I thought “wow this is so cool”. Now I’m starting to think it’s kinda useless, in the sense that the critical bit is the initial search, and that doesn’t use the LLM power, or at most it’s used to capture the user intent and reformulate the query.

We’re building some “smart search” functionality for some teams and I start to wonder if a traditional search results list (i.e. sans the LLM, or used only to rewrite the user query) with the document chunks wouldn’t be better than blindly taking the top N and feeding them to the LLM to produce some response.

E.g. we have some docs about specific supermarket chains, but the word “supermarket” might not appear at all in them, but the user query might be “show me what we have about supermarkets”. Now the embeddings hopefully will place the word “supermarket” close to, say, “Costco”, but they might also place it closer to “shopping center”, and we might have docs about shopping centers that could rank higher. So we might take the top 5 docs and send them to the LLM, but the docs the user was after might have been in 7th and 9th position, nowhere to be seen by the LLM nor the user.


I’ve worked in scaled enterprise search, both with lexical (lucene based, eg elastic search) & semantic search engines (vector retrieval).

Vector retrieval that isn’t contextualized in the domain is usually bad (RAG solutions call this “naive rag” … and make up for it with funky chunking and retrieval ensembles). Training custom retrievers and reranker is often key but quite an effort and still hard to generalize in a domain with broad knowledge.

Lexical based searching provides nice guarantees and deterministic control in results (depending on how you index). Certainly useful here is advanced querying capability. Constructing/enriching queries with transformers is cool.

Reranking is often nice ensemble additions, albeit can be done with smaller models.


> We’re building some “smart search” functionality for some teams and I start to wonder if a traditional search results list (i.e. sans the LLM, or used only ti rewrite the user query) with the document chunks wouldn’t be better than blindly taking the top N and feeding them to the LLM to produce some response.

Yep, it's a pretty common pattern: query -> embeddings -> vector db -> records -> context -> LLM -> result.


Yes that’s basically the RAG pattern, but I’ve edited my comment to elaborate a bit. I’m questioning what the LLM brings to the table vs just showing the search results (a long list not limited by context length) to the user.

The LLM doesn’t even get the full docs most of the time, just chunks. It has a very narrow view so its full power is not used.


Another approach is to take the user query, have the LLM guess the answer and use that guessed answer for the RAG step.


question: RAG by definition offloads the retrieval to a vector similarity search via embeddings db (faiss, knn et al).

what is the preferred way to feed documents / knowledge into a model so that the primary retrieval is done by the llm, and perhaps use vector db only for information enhancement (a la onebox)?


if i understand the problem correctly - you'd like to feed xMM documents directly into an LLM so that it uses this context to "reason" answers to questions, vs offload the retrieval to a vector db and merely assemble results into an "answer"?

and since your dataset is large, the longest context windows are insufficient.


Single-GPU, optimal efficiency: unsloth + qlora + mistral-7b on runpod/vast/lambda

Blazing fast compared to out-of-the-box transformers, also make sure to use flash attention if you have A100s or better and context length >= 2k

Add FAISS (https://github.com/facebookresearch/faiss) if you need fast local RAG


Interested


I was just gonna ask this question and saw this at the top of Ask. Interested.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: