Sairam Sundaresan (@sairamsundaresan): "🚨 It’s 2025 and your RAG is burning time and money What if you could make it 30× faster and cheaper? ❌ Where RAG breaks down ↳ Retrieved passages are often irrelevant ↳ Longer contexts slow everything down ↳ Memory costs explode as you scale

🚨 It’s 2025 and your RAG is burning time and money

What if you could make it 30× faster and cheaper?

❌ Where RAG breaks down

↳ Retrieved passages are often irrelevant

↳ Longer contexts slow everything down

↳ Memory costs explode as you scale

—

That’s where ReFrag comes in.

It rethinks how context is handled without sacrificing accuracy.

—

🛠️ How ReFrag works

↳ Compresses text into chunk-level embeddings

↳ Precomputes once per chunk → reuse every query

↳ Expands back into tokens only when needed

↳ Keeps normal model decoding intact

—

⚡ Performance gains

↳ 30.8× faster time-to-first-token

↳ 16× longer context with no accuracy loss

↳ 3.75× faster than prior SOTA methods

↳ Accuracy holds across RAG, summarization & chat

—

💡 Why it matters

↳ Faster responses → smoother user experience

↳ Smaller memory use → reduced infra costs

↳ Cached embeddings → simpler deployments

—

♻️ Restack to help someone learn AI the right way

Sep 16

3:08 AM