You’re joining the data platform team at a global e-commerce marketplace processing 10M+ orders/day across web and mobile. Finance and Sales Ops rely on a curated sales reporting layer for daily revenue, discounts, returns, and margin reporting by product category, region, and sales channel. The current operational schema is highly normalized and optimized for writes, but analysts are struggling with slow queries and inconsistent definitions (e.g., “net sales” vs “gross sales”).
Design a data model (tables + keys + relationships) for a sales reporting system that supports common analytics queries such as:
In your answer, explain:
A star schema places measurable events in fact tables (e.g., sales line items) and descriptive attributes in dimension tables (e.g., product, customer, date). This reduces join complexity for analytics and improves performance in columnar warehouses.
SELECT d.calendar_date, p.category, SUM(f.net_sales_amount) AS net_sales
FROM fact_sales_line f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY 1,2;
The grain defines what a single row represents (e.g., one order line item). Correct grain ensures metrics are additive and prevents double counting when joining to dimensions like product or customer.
/* Grain example: one row per order_id + line_number */
-- fact_sales_line(order_id, line_number, product_key, quantity, net_sales_amount)
Surrogate keys (warehouse-generated integers) provide stable joins even when source identifiers change or collide across systems. Natural keys (e.g., SKU, customer_id) should typically be enforced via unique constraints and stored for lineage, but not always used as the primary join key in the warehouse.
CREATE TABLE dim_product (
product_key BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
sku VARCHAR(64) NOT NULL,
...,
UNIQUE (sku)
);
SCD Type 2 preserves history by versioning dimension rows with effective start/end timestamps and a current flag. This enables historically accurate reporting (e.g., revenue by the product category at the time of sale).
SELECT f.order_id, p.category
FROM fact_sales_line f
JOIN dim_product p
ON f.product_key = p.product_key
WHERE p.is_current = TRUE;
Returns and refunds can be modeled as negative facts, separate fact tables, or as adjustments linked to the original sale line. Discount allocation must be defined (order-level prorated to lines vs line-level discounts) to ensure net sales reconcile with finance systems.
SELECT order_id,
SUM(gross_sales_amount) AS gross,
SUM(discount_amount) AS discounts,
SUM(refund_amount) AS refunds,
SUM(net_sales_amount) AS net
FROM fact_sales_line
GROUP BY 1;