Design Meta Web Crawl Pipeline

Context

Meta wants a large-scale crawling pipeline to continuously fetch public web pages, extract structured content and metadata, and publish cleaned datasets for downstream search, integrity, and ranking systems. The current process relies on ad hoc batch jobs and manual re-crawls, causing stale content, duplicate fetches, and poor observability across ingestion and transformation stages.

You are asked to design a production-grade pipeline using Meta-preferred infrastructure patterns, with strong support for incremental crawling, deduplication, replay, and data quality enforcement.

Scale Requirements

Seed URLs: 5B known URLs, growing by 50M/day
Fetch rate: 2M URLs/min sustained, 5M URLs/min peak
Average page size: 800KB compressed, 2.5MB uncompressed
Daily ingest: ~4PB compressed raw content at peak crawl windows
Freshness target: high-priority domains recrawled within 15 minutes; long-tail within 7 days
Serving SLA: extracted metadata queryable within 10 minutes of successful fetch
Retention: raw fetches 30 days, parsed content 180 days, aggregates 2 years

Requirements

Build a pipeline for URL discovery, prioritization, fetch scheduling, parsing, enrichment, and storage.
Support both streaming recrawl for high-priority domains and batch backfills for historical reprocessing.
Enforce robots.txt, per-domain politeness limits, and duplicate-content suppression.
Guarantee idempotent processing for retries, replay, and backfills.
Produce analytics-ready datasets for crawl coverage, fetch success, content freshness, and extraction quality.
Design monitoring, alerting, and recovery for fetcher failures, parser regressions, and downstream lag.
Explain partitioning, checkpointing, and schema evolution choices.

Constraints

Prefer Meta technologies where applicable: use Scribe, Scuba, Presto, and Airflow-like orchestration via internal workflow systems rather than generic placeholders.
Assume a mixed streaming + batch architecture across multiple regions.
Budget is constrained: avoid full recrawl of the entire corpus more than once every 14 days.
Must support legal takedown and deletion requests within 24 hours across raw and derived stores.

Context

You are asked to design a production-grade pipeline using Meta-preferred infrastructure patterns, with strong support for incremental crawling, deduplication, replay, and data quality enforcement.

Scale Requirements

Seed URLs: 5B known URLs, growing by 50M/day
Fetch rate: 2M URLs/min sustained, 5M URLs/min peak
Average page size: 800KB compressed, 2.5MB uncompressed
Daily ingest: ~4PB compressed raw content at peak crawl windows
Freshness target: high-priority domains recrawled within 15 minutes; long-tail within 7 days
Serving SLA: extracted metadata queryable within 10 minutes of successful fetch
Retention: raw fetches 30 days, parsed content 180 days, aggregates 2 years

Requirements

Build a pipeline for URL discovery, prioritization, fetch scheduling, parsing, enrichment, and storage.
Support both streaming recrawl for high-priority domains and batch backfills for historical reprocessing.
Enforce robots.txt, per-domain politeness limits, and duplicate-content suppression.
Guarantee idempotent processing for retries, replay, and backfills.
Produce analytics-ready datasets for crawl coverage, fetch success, content freshness, and extraction quality.
Design monitoring, alerting, and recovery for fetcher failures, parser regressions, and downstream lag.
Explain partitioning, checkpointing, and schema evolution choices.

Constraints

Prefer Meta technologies where applicable: use Scribe, Scuba, Presto, and Airflow-like orchestration via internal workflow systems rather than generic placeholders.
Assume a mixed streaming + batch architecture across multiple regions.
Budget is constrained: avoid full recrawl of the entire corpus more than once every 14 days.
Must support legal takedown and deletion requests within 24 hours across raw and derived stores.

Context

You are asked to design a production-grade pipeline using Meta-preferred infrastructure patterns, with strong support for incremental crawling, deduplication, replay, and data quality enforcement.

Scale Requirements

Seed URLs: 5B known URLs, growing by 50M/day
Fetch rate: 2M URLs/min sustained, 5M URLs/min peak
Average page size: 800KB compressed, 2.5MB uncompressed
Daily ingest: ~4PB compressed raw content at peak crawl windows
Freshness target: high-priority domains recrawled within 15 minutes; long-tail within 7 days
Serving SLA: extracted metadata queryable within 10 minutes of successful fetch
Retention: raw fetches 30 days, parsed content 180 days, aggregates 2 years

Requirements

Build a pipeline for URL discovery, prioritization, fetch scheduling, parsing, enrichment, and storage.
Support both streaming recrawl for high-priority domains and batch backfills for historical reprocessing.
Enforce robots.txt, per-domain politeness limits, and duplicate-content suppression.
Guarantee idempotent processing for retries, replay, and backfills.
Produce analytics-ready datasets for crawl coverage, fetch success, content freshness, and extraction quality.
Design monitoring, alerting, and recovery for fetcher failures, parser regressions, and downstream lag.
Explain partitioning, checkpointing, and schema evolution choices.

Constraints

Prefer Meta technologies where applicable: use Scribe, Scuba, Presto, and Airflow-like orchestration via internal workflow systems rather than generic placeholders.
Assume a mixed streaming + batch architecture across multiple regions.
Budget is constrained: avoid full recrawl of the entire corpus more than once every 14 days.
Must support legal takedown and deletion requests within 24 hours across raw and derived stores.

Context

You are asked to design a production-grade pipeline using Meta-preferred infrastructure patterns, with strong support for incremental crawling, deduplication, replay, and data quality enforcement.

Scale Requirements

Seed URLs: 5B known URLs, growing by 50M/day
Fetch rate: 2M URLs/min sustained, 5M URLs/min peak
Average page size: 800KB compressed, 2.5MB uncompressed
Daily ingest: ~4PB compressed raw content at peak crawl windows
Freshness target: high-priority domains recrawled within 15 minutes; long-tail within 7 days
Serving SLA: extracted metadata queryable within 10 minutes of successful fetch
Retention: raw fetches 30 days, parsed content 180 days, aggregates 2 years

Requirements

Build a pipeline for URL discovery, prioritization, fetch scheduling, parsing, enrichment, and storage.
Support both streaming recrawl for high-priority domains and batch backfills for historical reprocessing.
Enforce robots.txt, per-domain politeness limits, and duplicate-content suppression.
Guarantee idempotent processing for retries, replay, and backfills.
Produce analytics-ready datasets for crawl coverage, fetch success, content freshness, and extraction quality.
Design monitoring, alerting, and recovery for fetcher failures, parser regressions, and downstream lag.
Explain partitioning, checkpointing, and schema evolution choices.

Constraints

Prefer Meta technologies where applicable: use Scribe, Scuba, Presto, and Airflow-like orchestration via internal workflow systems rather than generic placeholders.
Assume a mixed streaming + batch architecture across multiple regions.
Budget is constrained: avoid full recrawl of the entire corpus more than once every 14 days.
Must support legal takedown and deletion requests within 24 hours across raw and derived stores.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Meta Web Crawl Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Meta Web Crawl Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Meta Web Crawl Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer