A Brief Refresher on Information Retrieval (IR) - Filtering, Ranking
Here is a short note on the IR pipeline. In the new age, Large Language models (LLMs, GPTs) need IR more and more (and the other way round too). I hope that the readers find this refresher useful — feedback welcome!
Problem statement: Given a user query, the goal of an IR engine is to retrieve, at blazing speeds, a short set of documents that appear relevant to the user.
Documents could be web pages, e-commerce products, annual reports, insurance policies, etc.
Broadly, the IR pipeline consists of two steps:
Candidate Generation - select or filter a set of candidate docs from the ocean of docs, that appear relevant to the query. We need relatively lightweight methods here, cause you know the ocean is big.
Ranking - rank the selected candidate docs (a smaller pile) so that the 'right' doc is top-most. It is ok to try hard here, use computationally heavy (-ier) methods here.
How do you implement this pipeline?
For text-only documents, here is how we implement these phases. If there is structured metadata per doc, these phases become more complex.
Candidate generation
Embed-and-match. Embed both query and docs. Match using efficient index databases.
(dense) vector embed-match (Faiss, Vector DBs)
sparse embed-match (BM25)
How do you embed text?
Chunk into sentences or paragraphs. Use a neural encoder, e.g., sentence transformers. Usually, query and docs embedded separately, and matched via cosine similarity at run-time. Enables scalability over the ocean of docs.
Hybrid ways?
Each type of index database brings up a different set of document candidates. So, to ensure maximum recall, use a combination of them. A hybrid way of candidate generation is to combine vector and sparse matches, because vector match is too semantic and sparse match is too syntactic. See an medium.com/@ekshakhs/ha… I wrote earlier about this.
The term hybrid means different to different people in the IR community. So, tread carefully.
Some followup questions if you like to learn more:
how does search over vector embeddings work? how optimized?
how do vector databases work? What differentiates them?
Implement Ranking
We have a smaller set of retrieved candidate docs at this stage to match with the query q. So, we can try harder.
Simple ranking: build a q-doc scorer: given a query and doc, it returns the relevance score. Sort docs by score.
Scoring q-doc individually doesn’t relate docs. Do pairwise (q-doc1-doc2) or listwise scoring (q-[doc]) to compute a list of scores. Sort to rank.
Because it is ok to try harder, so most methods use ML here.
earlier, decision trees (Gradient-Boosted Decision Trees)
now, more complex neural cross-attention encoders
Note that, to make the user happy, the ranking stage is your best bet. So try really hard — have multiple stages of (cascaded) ranking.
How good is an IR engine?
To measure the effectiveness of an IR engine, we use two key notions: Recall and Precision.
Recall: Did the IR engine retrieve all the relevant documents? Ignore relevant docs => Lower Recall.
Precision: In the set of docs that the engine retrieved, how many of them were irrelevant? Ignore irrelevant docs => higher precision.
In other words, retrieve all the relevant documents, and just the relevant ones.
We’re only getting started
There are many more aspects to IR on real-life docs: document metadata, matching with (nested) fields, ensuring diversity in query results, balancing relevance with doc popularity, etc. . Query understanding/rewriting and document similarity can help with those.
Finally, remember that the true goal of this fancy pipeline is to make the end user, the one who queries, happy. Happiness is a very subjective emotion. In the above description, we’ve assumed that we can always delight the user by using linguistic matching methods (syntactic, semantic, frequency based). That is not true. The user intent, as they say, might be elsewhere.
User might be looking for popular items - ok if not so relevant to query terms.
User might ask short, partial or ambiguously worded queries. The engine may have no way to resolve this ambiguity by matching. However, if we collect historical user behavior data, there is hope.
and more …
In such cases, we’ve to bring in a fresh set of tactics. Thankfully, many of them can be fused into the (re-)ranking phase. Will cover those in another article.