You are designing a new biodiversity data platform for a conservation organization that aggregates species observations from field surveys, camera traps, acoustic sensors, satellite-derived habitat layers, and external research partners. Today, analysts manually reconcile CSVs, geospatial files, and API exports, which leads to duplicate sightings, inconsistent taxonomy, and reports that disagree across teams. The trigger for the rebuild is an executive escalation after habitat and species trend dashboards showed conflicting counts for the same protected areas. You need a pipeline that supports both scheduled bulk ingestion and low-latency updates for operational monitoring while preserving lineage and auditability.
| Component | Status |
|---|---|
| Field data collection | ArcGIS Survey123 forms and mobile CSV uploads |
| Sensor ingestion | Camera trap and acoustic files land in object storage |
| External partner feeds | REST APIs, SFTP drops, and periodic Darwin Core archives |
| Processing | Python scripts and ad hoc PostGIS jobs |
| Storage | PostgreSQL/PostGIS plus raw files in cloud object storage |
| Orchestration | Cron jobs and manual reruns |
Scale: ~25M historical observations, 1.5M new records/day in peak season, 8-12 TB/year of images and audio metadata, 200 GB/month of geospatial layers, partner feeds arriving hourly to weekly, dashboard freshness target under 15 minutes for sensor-derived detections and under 6 hours for bulk partner loads.
How would you design this end-to-end biodiversity data pipeline so that heterogeneous ecological data can be ingested, standardized, validated, deduplicated, and served reliably for analytics and operational conservation workflows? Explain the architecture you would choose, how you would handle geospatial and taxonomy-specific quality issues, and how you would make the system observable and resilient at this scale.