Retail Inventory Optimization Data Pipeline

Medium

Pipelines

Asked at 1 company1ETLBatch ProcessingOrchestrationDependenciesData Wrangling

Also asked at

Problem

Context

You’re interviewing with the Supply Chain Data Platform team at OmniMart, a big-box retailer with 2,400 stores in the US, a fast-growing e-commerce channel, and a marketplace business. OmniMart sells ~1.2M active SKUs, with high seasonality (holiday, back-to-school) and frequent promotions. Stockouts and overstocks are both expensive: stockouts drive lost revenue and customer churn, while overstocks increase markdowns and carrying costs.

Today, OmniMart’s inventory reporting is stitched together from multiple systems: store POS, warehouse management, vendor EDI feeds, and e-commerce order events. The current “inventory snapshot” is produced by a nightly batch ETL that lands data into Snowflake around 9am local time, meaning planners and automated replenishment systems operate on stale data. Worse, late-arriving updates (returns, cancellations, delayed EDI acknowledgements) cause the snapshot to be wrong for hours or days, and different teams disagree on “available to promise.” Leadership wants a single, trusted inventory fact that is fresh enough for operational decisions and reliable enough for financial reconciliation.

Your task is to design the data needed and a complete pipeline to support inventory optimization (replenishment recommendations, safety stock tuning, and store-to-store transfers), while also serving analytics and near-real-time operational dashboards.

Scale Requirements

Stores: 2,400; DCs: 28; vendors: 9,000
SKUs: 1.2M active; ~250M SKU-store combinations (sparse)
Event throughput (peak):
- POS sales: 150K line-items/sec (holiday peak)
- E-commerce orders/updates: 25K events/sec
- Inventory adjustments (cycle counts, damages, shrink): 8K events/sec
Latency targets:
- Operational “available inventory” dashboard: P95 < 5 minutes
- Replenishment features for models: hourly aggregates; daily backtests
Storage:
- Raw immutable events retained 13 months (~3–5 PB in object storage)
- Curated warehouse tables retained 3+ years

Data Characteristics (What data you need)

You should assume you need to ingest and model at least the following domains:

Demand signals
- Store POS transactions (line-item level)
- E-commerce orders, cancellations, substitutions, partial shipments
- Promotions/pricing calendar (start/end, eligibility, discount depth)
- Forecast inputs: weather, holidays, local events (optional but valuable)
Supply signals
- Purchase orders, ASN (advance ship notices), inbound appointments
- Vendor lead times, fill rates, MOQ/pack sizes
- DC inventory, store inventory, in-transit inventory
Inventory movements / state changes
- Receipts at DC/store
- Transfers (store↔store, DC↔store)
- Returns (customer returns, RTV)
- Adjustments: shrink, damages, cycle counts
Master/reference data
- SKU attributes (category, size, perishability, shelf life)
- Store attributes (region, format, capacity constraints)
- Supplier/vendor master

Example event schema (inventory movement)

field	type	notes
event_id	string	globally unique; used for dedupe
event_ts	timestamp	business event time
ingest_ts	timestamp	ingestion time
sku_id	string	normalized SKU
location_id	string	store/DC
movement_type	string	SALE, RECEIPT, TRANSFER_OUT, ADJUSTMENT, RETURN
qty_delta	number	signed quantity change
source_system	string	POS, WMS, OMS, EDI
reference_id	string	order_id, po_id, transfer_id

Common issues: duplicates (retries), out-of-order events, late-arriving EDI, schema drift, and inconsistent identifiers across systems.

Requirements

Functional requirements

Near-real-time inventory position per SKU-location, including on-hand, reserved, in-transit, and available-to-promise.
Support late-arriving and corrected events (returns posted days later; EDI updates) without corrupting downstream aggregates.
Provide hourly and daily demand aggregates (units sold, lost sales proxy, promo lift) for inventory optimization models.
Build a reconciliation-friendly data model: ability to tie inventory changes back to source documents (PO, receipt, sale, adjustment).
Expose curated tables for:
- BI dashboards (store ops)
- Data science feature store / model training
- Finance and audit reporting

Non-functional requirements

Idempotent processing and exactly-once effective outcomes (no double-counting).
Data quality framework: schema validation, referential integrity, freshness SLAs, anomaly detection.
Observability: end-to-end lineage, backfill tooling, and clear on-call runbooks.
Security/compliance: least privilege, encryption, and audit logs. (Assume no PII in inventory events, but orders may contain customer identifiers—must be segregated.)

Constraints

Cloud: AWS. Data warehouse is Snowflake (already heavily used).
Existing: Kafka is available (MSK), Airflow is used for orchestration, and the team knows Spark.
Budget: incremental platform spend should stay under $80K/month.
Org: 6 data engineers; 2 analytics engineers (dbt); 3 data scientists.

Your task (what you must design)

Explain:

What data you would ingest (minimum viable set vs phase 2), and why.
How you would structure the pipeline (stream + batch), including:
- ingestion patterns
- data modeling layers (raw/bronze, clean/silver, curated/gold)
- handling late data and corrections
- orchestration and backfills
How you would ensure data quality, and how you would monitor and operate this in production.

Problem

Context

Scale Requirements

Stores: 2,400; DCs: 28; vendors: 9,000
SKUs: 1.2M active; ~250M SKU-store combinations (sparse)
Event throughput (peak):
- POS sales: 150K line-items/sec (holiday peak)
- E-commerce orders/updates: 25K events/sec
- Inventory adjustments (cycle counts, damages, shrink): 8K events/sec
Latency targets:
- Operational “available inventory” dashboard: P95 < 5 minutes
- Replenishment features for models: hourly aggregates; daily backtests
Storage:
- Raw immutable events retained 13 months (~3–5 PB in object storage)
- Curated warehouse tables retained 3+ years

Data Characteristics (What data you need)

You should assume you need to ingest and model at least the following domains:

Demand signals
- Store POS transactions (line-item level)
- E-commerce orders, cancellations, substitutions, partial shipments
- Promotions/pricing calendar (start/end, eligibility, discount depth)
- Forecast inputs: weather, holidays, local events (optional but valuable)
Supply signals
- Purchase orders, ASN (advance ship notices), inbound appointments
- Vendor lead times, fill rates, MOQ/pack sizes
- DC inventory, store inventory, in-transit inventory
Inventory movements / state changes
- Receipts at DC/store
- Transfers (store↔store, DC↔store)
- Returns (customer returns, RTV)
- Adjustments: shrink, damages, cycle counts
Master/reference data
- SKU attributes (category, size, perishability, shelf life)
- Store attributes (region, format, capacity constraints)
- Supplier/vendor master

Example event schema (inventory movement)

field	type	notes
event_id	string	globally unique; used for dedupe
event_ts	timestamp	business event time
ingest_ts	timestamp	ingestion time
sku_id	string	normalized SKU
location_id	string	store/DC
movement_type	string	SALE, RECEIPT, TRANSFER_OUT, ADJUSTMENT, RETURN
qty_delta	number	signed quantity change
source_system	string	POS, WMS, OMS, EDI
reference_id	string	order_id, po_id, transfer_id

Common issues: duplicates (retries), out-of-order events, late-arriving EDI, schema drift, and inconsistent identifiers across systems.

Requirements

Functional requirements

Near-real-time inventory position per SKU-location, including on-hand, reserved, in-transit, and available-to-promise.
Support late-arriving and corrected events (returns posted days later; EDI updates) without corrupting downstream aggregates.
Provide hourly and daily demand aggregates (units sold, lost sales proxy, promo lift) for inventory optimization models.
Build a reconciliation-friendly data model: ability to tie inventory changes back to source documents (PO, receipt, sale, adjustment).
Expose curated tables for:
- BI dashboards (store ops)
- Data science feature store / model training
- Finance and audit reporting

Non-functional requirements

Idempotent processing and exactly-once effective outcomes (no double-counting).
Data quality framework: schema validation, referential integrity, freshness SLAs, anomaly detection.
Observability: end-to-end lineage, backfill tooling, and clear on-call runbooks.
Security/compliance: least privilege, encryption, and audit logs. (Assume no PII in inventory events, but orders may contain customer identifiers—must be segregated.)

Constraints

Cloud: AWS. Data warehouse is Snowflake (already heavily used).
Existing: Kafka is available (MSK), Airflow is used for orchestration, and the team knows Spark.
Budget: incremental platform spend should stay under $80K/month.
Org: 6 data engineers; 2 analytics engineers (dbt); 3 data scientists.

Your task (what you must design)

Explain:

What data you would ingest (minimum viable set vs phase 2), and why.
How you would structure the pipeline (stream + batch), including:
- ingestion patterns
- data modeling layers (raw/bronze, clean/silver, curated/gold)
- handling late data and corrections
- orchestration and backfills
How you would ensure data quality, and how you would monitor and operate this in production.

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Real-Time Flash Sale Inventory PipelineHard AModel Retail Operations Analytics PipelineMedium AOptimize Read-Heavy Reference PipelineHard

Next question

Context

Scale Requirements

Stores: 2,400; DCs: 28; vendors: 9,000
SKUs: 1.2M active; ~250M SKU-store combinations (sparse)
Event throughput (peak):
- POS sales: 150K line-items/sec (holiday peak)
- E-commerce orders/updates: 25K events/sec
- Inventory adjustments (cycle counts, damages, shrink): 8K events/sec
Latency targets:
- Operational “available inventory” dashboard: P95 < 5 minutes
- Replenishment features for models: hourly aggregates; daily backtests
Storage:
- Raw immutable events retained 13 months (~3–5 PB in object storage)
- Curated warehouse tables retained 3+ years

Data Characteristics (What data you need)

You should assume you need to ingest and model at least the following domains:

Demand signals
- Store POS transactions (line-item level)
- E-commerce orders, cancellations, substitutions, partial shipments
- Promotions/pricing calendar (start/end, eligibility, discount depth)
- Forecast inputs: weather, holidays, local events (optional but valuable)
Supply signals
- Purchase orders, ASN (advance ship notices), inbound appointments
- Vendor lead times, fill rates, MOQ/pack sizes
- DC inventory, store inventory, in-transit inventory
Inventory movements / state changes
- Receipts at DC/store
- Transfers (store↔store, DC↔store)
- Returns (customer returns, RTV)
- Adjustments: shrink, damages, cycle counts
Master/reference data
- SKU attributes (category, size, perishability, shelf life)
- Store attributes (region, format, capacity constraints)
- Supplier/vendor master

Example event schema (inventory movement)

field	type	notes
event_id	string	globally unique; used for dedupe
event_ts	timestamp	business event time
ingest_ts	timestamp	ingestion time
sku_id	string	normalized SKU
location_id	string	store/DC
movement_type	string	SALE, RECEIPT, TRANSFER_OUT, ADJUSTMENT, RETURN
qty_delta	number	signed quantity change
source_system	string	POS, WMS, OMS, EDI
reference_id	string	order_id, po_id, transfer_id

Common issues: duplicates (retries), out-of-order events, late-arriving EDI, schema drift, and inconsistent identifiers across systems.

Requirements

Functional requirements

Near-real-time inventory position per SKU-location, including on-hand, reserved, in-transit, and available-to-promise.
Support late-arriving and corrected events (returns posted days later; EDI updates) without corrupting downstream aggregates.
Provide hourly and daily demand aggregates (units sold, lost sales proxy, promo lift) for inventory optimization models.
Build a reconciliation-friendly data model: ability to tie inventory changes back to source documents (PO, receipt, sale, adjustment).
Expose curated tables for:
- BI dashboards (store ops)
- Data science feature store / model training
- Finance and audit reporting

Non-functional requirements

Idempotent processing and exactly-once effective outcomes (no double-counting).
Data quality framework: schema validation, referential integrity, freshness SLAs, anomaly detection.
Observability: end-to-end lineage, backfill tooling, and clear on-call runbooks.
Security/compliance: least privilege, encryption, and audit logs. (Assume no PII in inventory events, but orders may contain customer identifiers—must be segregated.)

Constraints

Cloud: AWS. Data warehouse is Snowflake (already heavily used).
Existing: Kafka is available (MSK), Airflow is used for orchestration, and the team knows Spark.
Budget: incremental platform spend should stay under $80K/month.
Org: 6 data engineers; 2 analytics engineers (dbt); 3 data scientists.

Your task (what you must design)

Explain:

What data you would ingest (minimum viable set vs phase 2), and why.
How you would structure the pipeline (stream + batch), including:
- ingestion patterns
- data modeling layers (raw/bronze, clean/silver, curated/gold)
- handling late data and corrections
- orchestration and backfills
How you would ensure data quality, and how you would monitor and operate this in production.