Context
You’re interviewing with the Supply Chain Data Platform team at OmniMart, a big-box retailer with 2,400 stores in the US, a fast-growing e-commerce channel, and a marketplace business. OmniMart sells ~1.2M active SKUs, with high seasonality (holiday, back-to-school) and frequent promotions. Stockouts and overstocks are both expensive: stockouts drive lost revenue and customer churn, while overstocks increase markdowns and carrying costs.
Today, OmniMart’s inventory reporting is stitched together from multiple systems: store POS, warehouse management, vendor EDI feeds, and e-commerce order events. The current “inventory snapshot” is produced by a nightly batch ETL that lands data into Snowflake around 9am local time, meaning planners and automated replenishment systems operate on stale data. Worse, late-arriving updates (returns, cancellations, delayed EDI acknowledgements) cause the snapshot to be wrong for hours or days, and different teams disagree on “available to promise.” Leadership wants a single, trusted inventory fact that is fresh enough for operational decisions and reliable enough for financial reconciliation.
Your task is to design the data needed and a complete pipeline to support inventory optimization (replenishment recommendations, safety stock tuning, and store-to-store transfers), while also serving analytics and near-real-time operational dashboards.
Scale Requirements
- Stores: 2,400; DCs: 28; vendors: 9,000
- SKUs: 1.2M active; ~250M SKU-store combinations (sparse)
- Event throughput (peak):
- POS sales: 150K line-items/sec (holiday peak)
- E-commerce orders/updates: 25K events/sec
- Inventory adjustments (cycle counts, damages, shrink): 8K events/sec
- Latency targets:
- Operational “available inventory” dashboard: P95 < 5 minutes
- Replenishment features for models: hourly aggregates; daily backtests
- Storage:
- Raw immutable events retained 13 months (~3–5 PB in object storage)
- Curated warehouse tables retained 3+ years
Data Characteristics (What data you need)
You should assume you need to ingest and model at least the following domains:
-
Demand signals
- Store POS transactions (line-item level)
- E-commerce orders, cancellations, substitutions, partial shipments
- Promotions/pricing calendar (start/end, eligibility, discount depth)
- Forecast inputs: weather, holidays, local events (optional but valuable)
-
Supply signals
- Purchase orders, ASN (advance ship notices), inbound appointments
- Vendor lead times, fill rates, MOQ/pack sizes
- DC inventory, store inventory, in-transit inventory
-
Inventory movements / state changes
- Receipts at DC/store
- Transfers (store↔store, DC↔store)
- Returns (customer returns, RTV)
- Adjustments: shrink, damages, cycle counts
-
Master/reference data
- SKU attributes (category, size, perishability, shelf life)
- Store attributes (region, format, capacity constraints)
- Supplier/vendor master
Example event schema (inventory movement)
| field | type | notes |
|---|
| event_id | string | globally unique; used for dedupe |
| event_ts | timestamp | business event time |
| ingest_ts | timestamp | ingestion time |
| sku_id | string | normalized SKU |
| location_id | string | store/DC |
| movement_type | string | SALE, RECEIPT, TRANSFER_OUT, ADJUSTMENT, RETURN |
| qty_delta | number | signed quantity change |
| source_system | string | POS, WMS, OMS, EDI |
| reference_id | string | order_id, po_id, transfer_id |
Common issues: duplicates (retries), out-of-order events, late-arriving EDI, schema drift, and inconsistent identifiers across systems.
Requirements
Functional requirements
- Near-real-time inventory position per SKU-location, including on-hand, reserved, in-transit, and available-to-promise.
- Support late-arriving and corrected events (returns posted days later; EDI updates) without corrupting downstream aggregates.
- Provide hourly and daily demand aggregates (units sold, lost sales proxy, promo lift) for inventory optimization models.
- Build a reconciliation-friendly data model: ability to tie inventory changes back to source documents (PO, receipt, sale, adjustment).
- Expose curated tables for:
- BI dashboards (store ops)
- Data science feature store / model training
- Finance and audit reporting
Non-functional requirements
- Idempotent processing and exactly-once effective outcomes (no double-counting).
- Data quality framework: schema validation, referential integrity, freshness SLAs, anomaly detection.
- Observability: end-to-end lineage, backfill tooling, and clear on-call runbooks.
- Security/compliance: least privilege, encryption, and audit logs. (Assume no PII in inventory events, but orders may contain customer identifiers—must be segregated.)
Constraints
- Cloud: AWS. Data warehouse is Snowflake (already heavily used).
- Existing: Kafka is available (MSK), Airflow is used for orchestration, and the team knows Spark.
- Budget: incremental platform spend should stay under $80K/month.
- Org: 6 data engineers; 2 analytics engineers (dbt); 3 data scientists.
Your task (what you must design)
Explain:
- What data you would ingest (minimum viable set vs phase 2), and why.
- How you would structure the pipeline (stream + batch), including:
- ingestion patterns
- data modeling layers (raw/bronze, clean/silver, curated/gold)
- handling late data and corrections
- orchestration and backfills
- How you would ensure data quality, and how you would monitor and operate this in production.