Design Biodiversity Observation Data Pipeline

Scenario

You are designing a new biodiversity data platform for a conservation organization that aggregates species observations from field surveys, camera traps, acoustic sensors, satellite-derived habitat layers, and external research partners. Today, analysts manually reconcile CSVs, geospatial files, and API exports, which leads to duplicate sightings, inconsistent taxonomy, and reports that disagree across teams. The trigger for the rebuild is an executive escalation after habitat and species trend dashboards showed conflicting counts for the same protected areas. You need a pipeline that supports both scheduled bulk ingestion and low-latency updates for operational monitoring while preserving lineage and auditability.

Current State

Component	Status
Field data collection	ArcGIS Survey123 forms and mobile CSV uploads
Sensor ingestion	Camera trap and acoustic files land in object storage
External partner feeds	REST APIs, SFTP drops, and periodic Darwin Core archives
Processing	Python scripts and ad hoc PostGIS jobs
Storage	PostgreSQL/PostGIS plus raw files in cloud object storage
Orchestration	Cron jobs and manual reruns

Scale: ~25M historical observations, 1.5M new records/day in peak season, 8-12 TB/year of images and audio metadata, 200 GB/month of geospatial layers, partner feeds arriving hourly to weekly, dashboard freshness target under 15 minutes for sensor-derived detections and under 6 hours for bulk partner loads.

Question

How would you design this end-to-end biodiversity data pipeline so that heterogeneous ecological data can be ingested, standardized, validated, deduplicated, and served reliably for analytics and operational conservation workflows? Explain the architecture you would choose, how you would handle geospatial and taxonomy-specific quality issues, and how you would make the system observable and resilient at this scale.

Scenario

Current State

Component	Status
Field data collection	ArcGIS Survey123 forms and mobile CSV uploads
Sensor ingestion	Camera trap and acoustic files land in object storage
External partner feeds	REST APIs, SFTP drops, and periodic Darwin Core archives
Processing	Python scripts and ad hoc PostGIS jobs
Storage	PostgreSQL/PostGIS plus raw files in cloud object storage
Orchestration	Cron jobs and manual reruns

Question

Scenario

Current State

Component	Status
Field data collection	ArcGIS Survey123 forms and mobile CSV uploads
Sensor ingestion	Camera trap and acoustic files land in object storage
External partner feeds	REST APIs, SFTP drops, and periodic Darwin Core archives
Processing	Python scripts and ad hoc PostGIS jobs
Storage	PostgreSQL/PostGIS plus raw files in cloud object storage
Orchestration	Cron jobs and manual reruns

Question

Scenario

Current State

Component	Status
Field data collection	ArcGIS Survey123 forms and mobile CSV uploads
Sensor ingestion	Camera trap and acoustic files land in object storage
External partner feeds	REST APIs, SFTP drops, and periodic Darwin Core archives
Processing	Python scripts and ad hoc PostGIS jobs
Storage	PostgreSQL/PostGIS plus raw files in cloud object storage
Orchestration	Cron jobs and manual reruns

Interview Guides

Scenario

Current State

Question

Design Biodiversity Observation Data Pipeline

Scenario

Current State

Question

Your Answer

Design Biodiversity Observation Data Pipeline

Scenario

Current State

Question

Design Biodiversity Observation Data Pipeline

Scenario

Current State

Question

Your Answer