HARD
Design a Web Crawler
Design a distributed web crawler that can index billions of web pages efficiently while being polite to websites.
Estimated Time: 45 minutes
#Crawling#Distributed Queue#Politeness#Deduplication
Solution Overview
Use URL frontier with priority queue. Implement politeness policies (robots.txt, rate limiting). Detect and handle duplicate content.
Hints to Get Started
1
How to prioritize URLs?
2
Handling dynamic content (JavaScript rendering)
3
URL normalization
Components
- •URL Frontier
- •Fetcher
- •Parser
- •Duplicate Detector
- •Storage
Politeness
- •robots.txt compliance
- •Rate limiting per domain
- •Backoff on errors