How would you design a search engine like Elasticsearch?

Answer

Designing a full-text search system requires understanding inverted indexes and distributed search. Core data structure — Inverted Index: maps each word/term to the list of documents containing it. For "mongodb" → [doc1, doc4, doc7]. Elasticsearch architecture: Index: logical namespace (like a database); Shard: a Lucene instance — each index is divided into primary shards (for scale) and replica shards (for redundancy); Node: a single server in the cluster; Cluster: collection of nodes. Indexing flow: document → tokenization (split into terms) → normalization (lowercase, stemming, stop words) → build inverted index → store in Lucene segment. Search flow: query → parse → route to all relevant shards → each shard searches its local inverted index → merge and rank results (TF-IDF or BM25 scoring). Distributed search: query goes to a coordinator node → broadcasts to all primary shards → shards return top-N local results → coordinator merges and re-ranks → returns top-K global results. Near real-time: Lucene writes new documents to in-memory buffer → flushes to segments periodically (every ~1s) → new docs visible after flush. Relevance scoring: BM25 considers term frequency, inverse document frequency, field length normalization. Design decisions: number of primary shards (set at index creation, can't change later); shard size 20-40GB ideal; replica count for read scaling and HA.

Answer

More System Design Questions