HARD

Design a Web Crawler

Design a distributed web crawler that can index billions of web pages efficiently while being polite to websites.

Estimated Time: 45 minutes
#Crawling#Distributed Queue#Politeness#Deduplication
Solution Overview

Use URL frontier with priority queue. Implement politeness policies (robots.txt, rate limiting). Detect and handle duplicate content.

Hints to Get Started
1

How to prioritize URLs?

2

Handling dynamic content (JavaScript rendering)

3

URL normalization

Components
  • URL Frontier
  • Fetcher
  • Parser
  • Duplicate Detector
  • Storage
Politeness
  • robots.txt compliance
  • Rate limiting per domain
  • Backoff on errors