Dataford
Interview Guides
Upgrade
All questions/Pipelines/Retail Inventory Optimization Data Pipeline

Retail Inventory Optimization Data Pipeline

Medium
Pipelines
Asked at 1 company1ETLBatch ProcessingOrchestrationDependenciesData Wrangling
Also asked at
McKinsey &

Problem

Context

You’re interviewing with the Supply Chain Data Platform team at OmniMart, a big-box retailer with 2,400 stores in the US, a fast-growing e-commerce channel, and a marketplace business. OmniMart sells ~1.2M active SKUs, with high seasonality (holiday, back-to-school) and frequent promotions. Stockouts and overstocks are both expensive: stockouts drive lost revenue and customer churn, while overstocks increase markdowns and carrying costs.

Today, OmniMart’s inventory reporting is stitched together from multiple systems: store POS, warehouse management, vendor EDI feeds, and e-commerce order events. The current “inventory snapshot” is produced by a nightly batch ETL that lands data into Snowflake around 9am local time, meaning planners and automated replenishment systems operate on stale data. Worse, late-arriving updates (returns, cancellations, delayed EDI acknowledgements) cause the snapshot to be wrong for hours or days, and different teams disagree on “available to promise.” Leadership wants a single, trusted inventory fact that is fresh enough for operational decisions and reliable enough for financial reconciliation.

Your task is to design the data needed and a complete pipeline to support inventory optimization (replenishment recommendations, safety stock tuning, and store-to-store transfers), while also serving analytics and near-real-time operational dashboards.

Scale Requirements

  • Stores: 2,400; DCs: 28; vendors: 9,000
  • SKUs: 1.2M active; ~250M SKU-store combinations (sparse)
  • Event throughput (peak):
    • POS sales: 150K line-items/sec (holiday peak)
    • E-commerce orders/updates: 25K events/sec
    • Inventory adjustments (cycle counts, damages, shrink): 8K events/sec
  • Latency targets:
    • Operational “available inventory” dashboard: P95 < 5 minutes
    • Replenishment features for models: hourly aggregates; daily backtests
  • Storage:
    • Raw immutable events retained 13 months (~3–5 PB in object storage)
    • Curated warehouse tables retained 3+ years

Data Characteristics (What data you need)

You should assume you need to ingest and model at least the following domains:

  1. Demand signals

    • Store POS transactions (line-item level)
    • E-commerce orders, cancellations, substitutions, partial shipments
    • Promotions/pricing calendar (start/end, eligibility, discount depth)
    • Forecast inputs: weather, holidays, local events (optional but valuable)
  2. Supply signals

    • Purchase orders, ASN (advance ship notices), inbound appointments
    • Vendor lead times, fill rates, MOQ/pack sizes
    • DC inventory, store inventory, in-transit inventory
  3. Inventory movements / state changes

    • Receipts at DC/store
    • Transfers (store↔store, DC↔store)
    • Returns (customer returns, RTV)
    • Adjustments: shrink, damages, cycle counts
  4. Master/reference data

    • SKU attributes (category, size, perishability, shelf life)
    • Store attributes (region, format, capacity constraints)
    • Supplier/vendor master

Example event schema (inventory movement)

fieldtypenotes
event_idstringglobally unique; used for dedupe
event_tstimestampbusiness event time
ingest_tstimestampingestion time
sku_idstringnormalized SKU
location_idstringstore/DC
movement_typestringSALE, RECEIPT, TRANSFER_OUT, ADJUSTMENT, RETURN
qty_deltanumbersigned quantity change
source_systemstringPOS, WMS, OMS, EDI
reference_idstringorder_id, po_id, transfer_id

Common issues: duplicates (retries), out-of-order events, late-arriving EDI, schema drift, and inconsistent identifiers across systems.

Requirements

Functional requirements

  1. Near-real-time inventory position per SKU-location, including on-hand, reserved, in-transit, and available-to-promise.
  2. Support late-arriving and corrected events (returns posted days later; EDI updates) without corrupting downstream aggregates.
  3. Provide hourly and daily demand aggregates (units sold, lost sales proxy, promo lift) for inventory optimization models.
  4. Build a reconciliation-friendly data model: ability to tie inventory changes back to source documents (PO, receipt, sale, adjustment).
  5. Expose curated tables for:
    • BI dashboards (store ops)
    • Data science feature store / model training
    • Finance and audit reporting

Non-functional requirements

  1. Idempotent processing and exactly-once effective outcomes (no double-counting).
  2. Data quality framework: schema validation, referential integrity, freshness SLAs, anomaly detection.
  3. Observability: end-to-end lineage, backfill tooling, and clear on-call runbooks.
  4. Security/compliance: least privilege, encryption, and audit logs. (Assume no PII in inventory events, but orders may contain customer identifiers—must be segregated.)

Constraints

  • Cloud: AWS. Data warehouse is Snowflake (already heavily used).
  • Existing: Kafka is available (MSK), Airflow is used for orchestration, and the team knows Spark.
  • Budget: incremental platform spend should stay under $80K/month.
  • Org: 6 data engineers; 2 analytics engineers (dbt); 3 data scientists.

Your task (what you must design)

Explain:

  1. What data you would ingest (minimum viable set vs phase 2), and why.
  2. How you would structure the pipeline (stream + batch), including:
    • ingestion patterns
    • data modeling layers (raw/bronze, clean/silver, curated/gold)
    • handling late data and corrections
    • orchestration and backfills
  3. How you would ensure data quality, and how you would monitor and operate this in production.

Problem

Context

You’re interviewing with the Supply Chain Data Platform team at OmniMart, a big-box retailer with 2,400 stores in the US, a fast-growing e-commerce channel, and a marketplace business. OmniMart sells ~1.2M active SKUs, with high seasonality (holiday, back-to-school) and frequent promotions. Stockouts and overstocks are both expensive: stockouts drive lost revenue and customer churn, while overstocks increase markdowns and carrying costs.

Today, OmniMart’s inventory reporting is stitched together from multiple systems: store POS, warehouse management, vendor EDI feeds, and e-commerce order events. The current “inventory snapshot” is produced by a nightly batch ETL that lands data into Snowflake around 9am local time, meaning planners and automated replenishment systems operate on stale data. Worse, late-arriving updates (returns, cancellations, delayed EDI acknowledgements) cause the snapshot to be wrong for hours or days, and different teams disagree on “available to promise.” Leadership wants a single, trusted inventory fact that is fresh enough for operational decisions and reliable enough for financial reconciliation.

Your task is to design the data needed and a complete pipeline to support inventory optimization (replenishment recommendations, safety stock tuning, and store-to-store transfers), while also serving analytics and near-real-time operational dashboards.

Scale Requirements

  • Stores: 2,400; DCs: 28; vendors: 9,000
  • SKUs: 1.2M active; ~250M SKU-store combinations (sparse)
  • Event throughput (peak):
    • POS sales: 150K line-items/sec (holiday peak)
    • E-commerce orders/updates: 25K events/sec
    • Inventory adjustments (cycle counts, damages, shrink): 8K events/sec
  • Latency targets:
    • Operational “available inventory” dashboard: P95 < 5 minutes
    • Replenishment features for models: hourly aggregates; daily backtests
  • Storage:
    • Raw immutable events retained 13 months (~3–5 PB in object storage)
    • Curated warehouse tables retained 3+ years

Data Characteristics (What data you need)

You should assume you need to ingest and model at least the following domains:

  1. Demand signals

    • Store POS transactions (line-item level)
    • E-commerce orders, cancellations, substitutions, partial shipments
    • Promotions/pricing calendar (start/end, eligibility, discount depth)
    • Forecast inputs: weather, holidays, local events (optional but valuable)
  2. Supply signals

    • Purchase orders, ASN (advance ship notices), inbound appointments
    • Vendor lead times, fill rates, MOQ/pack sizes
    • DC inventory, store inventory, in-transit inventory
  3. Inventory movements / state changes

    • Receipts at DC/store
    • Transfers (store↔store, DC↔store)
    • Returns (customer returns, RTV)
    • Adjustments: shrink, damages, cycle counts
  4. Master/reference data

    • SKU attributes (category, size, perishability, shelf life)
    • Store attributes (region, format, capacity constraints)
    • Supplier/vendor master

Example event schema (inventory movement)

fieldtypenotes
event_idstringglobally unique; used for dedupe
event_tstimestampbusiness event time
ingest_tstimestampingestion time
sku_idstringnormalized SKU
location_idstringstore/DC
movement_typestringSALE, RECEIPT, TRANSFER_OUT, ADJUSTMENT, RETURN
qty_deltanumbersigned quantity change
source_systemstringPOS, WMS, OMS, EDI
reference_idstringorder_id, po_id, transfer_id

Common issues: duplicates (retries), out-of-order events, late-arriving EDI, schema drift, and inconsistent identifiers across systems.

Requirements

Functional requirements

  1. Near-real-time inventory position per SKU-location, including on-hand, reserved, in-transit, and available-to-promise.
  2. Support late-arriving and corrected events (returns posted days later; EDI updates) without corrupting downstream aggregates.
  3. Provide hourly and daily demand aggregates (units sold, lost sales proxy, promo lift) for inventory optimization models.
  4. Build a reconciliation-friendly data model: ability to tie inventory changes back to source documents (PO, receipt, sale, adjustment).
  5. Expose curated tables for:
    • BI dashboards (store ops)
    • Data science feature store / model training
    • Finance and audit reporting

Non-functional requirements

  1. Idempotent processing and exactly-once effective outcomes (no double-counting).
  2. Data quality framework: schema validation, referential integrity, freshness SLAs, anomaly detection.
  3. Observability: end-to-end lineage, backfill tooling, and clear on-call runbooks.
  4. Security/compliance: least privilege, encryption, and audit logs. (Assume no PII in inventory events, but orders may contain customer identifiers—must be segregated.)

Constraints

  • Cloud: AWS. Data warehouse is Snowflake (already heavily used).
  • Existing: Kafka is available (MSK), Airflow is used for orchestration, and the team knows Spark.
  • Budget: incremental platform spend should stay under $80K/month.
  • Org: 6 data engineers; 2 analytics engineers (dbt); 3 data scientists.

Your task (what you must design)

Explain:

  1. What data you would ingest (minimum viable set vs phase 2), and why.
  2. How you would structure the pipeline (stream + batch), including:
    • ingestion patterns
    • data modeling layers (raw/bronze, clean/silver, curated/gold)
    • handling late data and corrections
    • orchestration and backfills
  3. How you would ensure data quality, and how you would monitor and operate this in production.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
Real-Time Flash Sale Inventory PipelineHardAModel Retail Operations Analytics PipelineMediumAOptimize Read-Heavy Reference PipelineHard
Next question