Context
ChatOrbit operates multiple conversational AI products across web chat, mobile apps, Slack bots, and voice assistants. Today, each platform emits different event schemas into separate PostgreSQL and S3 stores, making it difficult to build a unified interaction model for analytics, billing, and model-quality reporting.
You need to design a pipeline and data model that standardizes user interactions across platforms while preserving platform-specific metadata. The system must support both near-real-time operational dashboards and batch analytics for retention, latency, token usage, and conversation outcomes.
Scale Requirements
- Traffic: 120K events/second peak, 35K average
- Daily volume: ~2.5B events/day
- Event size: 1-4 KB JSON payloads
- Latency target: < 3 minutes from event generation to warehouse availability
- Retention: Raw events for 180 days, curated analytics tables for 3 years
- Query patterns: session reconstruction, cross-platform user journeys, prompt/response metrics, cost attribution
Requirements
- Design a canonical event model for user, session, conversation, message, tool-call, and feedback events across all conversational AI platforms.
- Build ingestion for heterogeneous sources including mobile/web SDKs, backend APIs, and third-party platform webhooks.
- Support schema evolution without breaking downstream consumers.
- Deduplicate retries and out-of-order events while preserving event lineage.
- Produce analytics-ready tables for conversation sessions, platform engagement, token/cost usage, and user feedback.
- Enable backfills and replay for historical reprocessing when the canonical schema changes.
- Define monitoring, data quality checks, and failure recovery for streaming and batch layers.
Constraints
- AWS is the required cloud; existing warehouse is Snowflake.
- Team size is 5 data engineers; operational complexity should be moderate.
- Must support GDPR/CCPA deletion within 72 hours.
- Incremental infrastructure budget is capped at $30K/month.
- Some platforms provide only at-least-once delivery and inconsistent user identifiers.