Give me 60 seconds, and I'll teach you how Cursor indexes code (no BS):
It took 12 months for Cursor to reach $100M ARR.
Cursor IDE is the best I've ever seen.
Here’s how Merkle trees make it possible:
0. Merkle trees 101:
↳ Hierarchical hash chains that fingerprint data blocks
↳ Leaf nodes = hash of code chunks
↳ Parent nodes = hash of child hashes
↳ Root hash = single fingerprint for the entire codebase
Key benefit: Detect changes instantly by comparing root hashes.
1. Code chunking strategies:
↳ AST-based splitting: Uses tree-sitter to parse code into logical blocks (functions, classes)
↳ Token limits: Merge sibling AST nodes without exceeding model token caps (e.g., 8k for OpenAI)
↳ Semantic boundaries: Avoid mid-function splits for better embeddings
2. Merkle tree construction:
↳ Local hashing: Compute SHA-256 hashes for all code chunks
↳ Tree sync: Compare root hash with server to identify changed files
↳Incremental uploads: Only modified chunks get re-embedded
Result: 90% fewer uploads vs. full re-indexing.
3. Embedding and privacy:
↳ Uses OpenAI’s text-embedding-3-small or custom code-specific models
↳ Obfuscates file paths with client-side encryption (e.g., src/utils . py → a1b2/c3d4/e5f6)
↳ No raw code stored, embeddings purged after request
4. RAG for code generation:
Here's what happens when you ask about your codebase.
↳ Query vector DB (Turbopuffer) for relevant chunks
↳ Inject top matches into LLM context
↳ Generate answers using GPT-4 + codebase context
5. Why Merkle trees?
↳ Bandwidth savings: Sync only delta changes (Git-like)
↳ Cache optimization: Hash-indexed embeddings enable instant re-indexing
↳ Data integrity: Tamper-proof codebase fingerprints
6. Technical Challenges:
⚠️ Network overhead: Retries due to server load spike traffic
⚠️ AST parsing edge cases: Language-specific syntax quirks
⚠️ Embedding inversion risks: Theoretical code leaks from vectors (mitigated by short TTLs)
I find Cursor best with clear guidelines, task definition, and small to mid-size projects.
With OpenAI buying Windsurf for $3Bn, I'm excited to see where Cursor will go.
Are you using Cursor?