Designing Metric Databases for SLAs

Context

You’re joining a fintech payments company that processes 25M card authorizations/day across multiple regions. Reliability and performance are business-critical: a 200ms latency regression can drop authorization conversion and trigger partner escalations. The company wants a database-backed metrics system to track API latency, error rates, and throughput per endpoint, region, and client tier. These metrics power dashboards, alerting (SLO/SLA), and incident retrospectives.

Unlike ad-hoc analytics tables, performance metrics have unique characteristics: they are high-volume, time-series, often require rollups (minute → hour → day), and must support both real-time-ish queries (last 15 minutes) and long-range trend analysis (last 90 days). The system also needs to be maintainable: schema evolution, backfills, retention policies, and data quality guarantees matter as much as query correctness.

Core Question

Explain how you would approach building and maintaining databases for capturing performance metrics.

In your answer, cover:

Data modeling: What raw events you would store vs what aggregated tables you would maintain (e.g., minute-level rollups). How you choose dimensions (endpoint, region, status_code_family) and avoid cardinality explosions.
Ingestion & idempotency: How metrics are written (batch vs streaming), how you handle duplicates/out-of-order arrivals, and what keys/constraints you’d use.
Query patterns: How you would support common questions like p95 latency by endpoint over time, error rate by region, and “top regressions vs last week.” Mention SQL constructs you’d rely on (aggregations, window functions, CTEs).
Performance & maintenance: Partitioning strategy (typically by time), indexing, rollup jobs, retention/TTL, and how you’d handle late data and backfills.
Data quality & governance: Validation checks (missing minutes, impossible values), auditability, and how you’d make the system safe for on-call usage.

Scope Guidance (what the interviewer expects)

Speak concretely: propose 2–3 tables (raw + rollup), primary keys, and partitioning.
Discuss trade-offs (storage vs query latency; normalization vs denormalization; flexibility vs cost).
Mention operational concerns: schema changes, reprocessing, and how you keep dashboards stable during migrations.

Context

Core Question

Explain how you would approach building and maintaining databases for capturing performance metrics.

In your answer, cover:

Data modeling: What raw events you would store vs what aggregated tables you would maintain (e.g., minute-level rollups). How you choose dimensions (endpoint, region, status_code_family) and avoid cardinality explosions.
Ingestion & idempotency: How metrics are written (batch vs streaming), how you handle duplicates/out-of-order arrivals, and what keys/constraints you’d use.
Query patterns: How you would support common questions like p95 latency by endpoint over time, error rate by region, and “top regressions vs last week.” Mention SQL constructs you’d rely on (aggregations, window functions, CTEs).
Performance & maintenance: Partitioning strategy (typically by time), indexing, rollup jobs, retention/TTL, and how you’d handle late data and backfills.
Data quality & governance: Validation checks (missing minutes, impossible values), auditability, and how you’d make the system safe for on-call usage.

Scope Guidance (what the interviewer expects)

Speak concretely: propose 2–3 tables (raw + rollup), primary keys, and partitioning.
Discuss trade-offs (storage vs query latency; normalization vs denormalization; flexibility vs cost).
Mention operational concerns: schema changes, reprocessing, and how you keep dashboards stable during migrations.

Context

Core Question

Explain how you would approach building and maintaining databases for capturing performance metrics.

In your answer, cover:

Data modeling: What raw events you would store vs what aggregated tables you would maintain (e.g., minute-level rollups). How you choose dimensions (endpoint, region, status_code_family) and avoid cardinality explosions.
Ingestion & idempotency: How metrics are written (batch vs streaming), how you handle duplicates/out-of-order arrivals, and what keys/constraints you’d use.
Query patterns: How you would support common questions like p95 latency by endpoint over time, error rate by region, and “top regressions vs last week.” Mention SQL constructs you’d rely on (aggregations, window functions, CTEs).
Performance & maintenance: Partitioning strategy (typically by time), indexing, rollup jobs, retention/TTL, and how you’d handle late data and backfills.
Data quality & governance: Validation checks (missing minutes, impossible values), auditability, and how you’d make the system safe for on-call usage.

Scope Guidance (what the interviewer expects)

Speak concretely: propose 2–3 tables (raw + rollup), primary keys, and partitioning.
Discuss trade-offs (storage vs query latency; normalization vs denormalization; flexibility vs cost).
Mention operational concerns: schema changes, reprocessing, and how you keep dashboards stable during migrations.

Context

Core Question

Explain how you would approach building and maintaining databases for capturing performance metrics.

In your answer, cover:

Data modeling: What raw events you would store vs what aggregated tables you would maintain (e.g., minute-level rollups). How you choose dimensions (endpoint, region, status_code_family) and avoid cardinality explosions.
Ingestion & idempotency: How metrics are written (batch vs streaming), how you handle duplicates/out-of-order arrivals, and what keys/constraints you’d use.
Query patterns: How you would support common questions like p95 latency by endpoint over time, error rate by region, and “top regressions vs last week.” Mention SQL constructs you’d rely on (aggregations, window functions, CTEs).
Performance & maintenance: Partitioning strategy (typically by time), indexing, rollup jobs, retention/TTL, and how you’d handle late data and backfills.
Data quality & governance: Validation checks (missing minutes, impossible values), auditability, and how you’d make the system safe for on-call usage.

Scope Guidance (what the interviewer expects)

Speak concretely: propose 2–3 tables (raw + rollup), primary keys, and partitioning.
Discuss trade-offs (storage vs query latency; normalization vs denormalization; flexibility vs cost).
Mention operational concerns: schema changes, reprocessing, and how you keep dashboards stable during migrations.

Interview Guides

Context

Core Question

Scope Guidance (what the interviewer expects)

Designing Metric Databases for SLAs

Context

Core Question

Scope Guidance (what the interviewer expects)

Your Answer

Designing Metric Databases for SLAs

Context

Core Question

Scope Guidance (what the interviewer expects)

Designing Metric Databases for SLAs

Context

Core Question

Scope Guidance (what the interviewer expects)

Your Answer