HARD

Design a Web Crawler

Design a distributed web crawler that can index billions of web pages efficiently while being polite to websites.

Estimated Time: 45 minutes
#Crawling#Distributed Queue#Politeness#Deduplication

Solution Overview

Use URL frontier with priority queue. Implement politeness policies (robots.txt, rate limiting). Detect and handle duplicate content.

Solution Overview

Use URL frontier with priority queue. Implement politeness policies (robots.txt, rate limiting). Detect and handle duplicate content.

Hints to Get Started
1

How to prioritize URLs?

2

Handling dynamic content (JavaScript rendering)

3

URL normalization

Components
  • URL Frontier
  • Fetcher
  • Parser
  • Duplicate Detector
  • Storage
Politeness
  • robots.txt compliance
  • Rate limiting per domain
  • Backoff on errors