How would you design a web crawler?

Q: How would you design a web crawler?

A web crawler systematically browses the web to collect and index content. Requirements: crawl billions of pages, politeness (rate limit per domain), avoid duplicate URLs, prioritize important pages, handle redirects and errors, distributed scalability. Core components: (1) URL Frontier: queue of URLs to visit. Priority queue based on importance (PageRank estimate). Partition by domain to enforce per-domain rate limiting; (2) Fetcher: downloads the page content via HTTP. Respects robots.txt, rat

Answer

A web crawler systematically browses the web to collect and index content. Requirements: crawl billions of pages, politeness (rate limit per domain), avoid duplicate URLs, prioritize important pages, handle redirects and errors, distributed scalability. Core components: (1) URL Frontier: queue of URLs to visit. Priority queue based on importance (PageRank estimate). Partition by domain to enforce per-domain rate limiting; (2) Fetcher: downloads the page content via HTTP. Respects robots.txt, rate limits, handles redirects. Multiple fetcher workers; (3) Parser: extracts links from HTML, cleans and normalizes URLs (canonicalization — remove tracking params, normalize encoding); (4) Duplicate detection: before adding a URL to the frontier, check if already visited. Use a distributed Bloom filter (space-efficient probabilistic structure) to check in memory; store canonical URL hash in database for definitive check; (5) Content deduplication: same page content from different URLs — use SimHash (locality-sensitive hash) to detect near-duplicates; store content fingerprints; (6) Storage: raw pages → S3 or HDFS; extracted content → Elasticsearch; URL state → distributed database (Cassandra). Politeness: robots.txt parsing; per-domain crawl delay; distributed URL frontier partitioned by domain (each worker owns a domain partition). Distributed coordination: multiple crawler workers, consistent hashing to assign domains to workers. DNS caching: DNS lookups are slow — cache DNS results aggressively.

Answer

More System Design Questions