Context
Meta wants a large-scale crawling pipeline to continuously fetch public web pages, extract structured content and metadata, and publish cleaned datasets for downstream search, integrity, and ranking systems. The current process relies on ad hoc batch jobs and manual re-crawls, causing stale content, duplicate fetches, and poor observability across ingestion and transformation stages.
You are asked to design a production-grade pipeline using Meta-preferred infrastructure patterns, with strong support for incremental crawling, deduplication, replay, and data quality enforcement.
Scale Requirements
- Seed URLs: 5B known URLs, growing by 50M/day
- Fetch rate: 2M URLs/min sustained, 5M URLs/min peak
- Average page size: 800KB compressed, 2.5MB uncompressed
- Daily ingest: ~4PB compressed raw content at peak crawl windows
- Freshness target: high-priority domains recrawled within 15 minutes; long-tail within 7 days
- Serving SLA: extracted metadata queryable within 10 minutes of successful fetch
- Retention: raw fetches 30 days, parsed content 180 days, aggregates 2 years
Requirements
- Build a pipeline for URL discovery, prioritization, fetch scheduling, parsing, enrichment, and storage.
- Support both streaming recrawl for high-priority domains and batch backfills for historical reprocessing.
- Enforce robots.txt, per-domain politeness limits, and duplicate-content suppression.
- Guarantee idempotent processing for retries, replay, and backfills.
- Produce analytics-ready datasets for crawl coverage, fetch success, content freshness, and extraction quality.
- Design monitoring, alerting, and recovery for fetcher failures, parser regressions, and downstream lag.
- Explain partitioning, checkpointing, and schema evolution choices.
Constraints
- Prefer Meta technologies where applicable: use Scribe, Scuba, Presto, and Airflow-like orchestration via internal workflow systems rather than generic placeholders.
- Assume a mixed streaming + batch architecture across multiple regions.
- Budget is constrained: avoid full recrawl of the entire corpus more than once every 14 days.
- Must support legal takedown and deletion requests within 24 hours across raw and derived stores.