Wrangling Messy Financial Data | Dataford Interview Questions

Problem

Context

Financial datasets often contain missing values, inconsistent date formats, duplicate transactions, and invalid amounts. In analytics and reporting, poor data wrangling leads directly to incorrect revenue, expense, and cash flow numbers.

Question

Explain how you would perform data wrangling on a messy financial dataset using SQL in PostgreSQL. Your answer should cover:

How to identify and handle missing or invalid values
How to standardize dates, categories, and numeric fields
How to detect duplicates and suspicious records
How to prepare the cleaned data for downstream reporting or aggregation

Scope Guidance

The interviewer is looking for a practical SQL-oriented explanation, not a full ETL architecture. Focus on common cleaning patterns, the order of operations, and how you would make the dataset reliable enough for financial analysis.

Missing and Invalid Value Handling

Financial data often includes NULLs, blank strings, negative values where they should not exist, or placeholder values like 0 for unknown amounts. In PostgreSQL, you typically normalize blanks to NULL, use COALESCE carefully, and apply validation rules so bad records are either corrected, flagged, or excluded.

SELECT
  transaction_id,
  NULLIF(TRIM(category), '') AS cleaned_category,
  amount,
  CASE
    WHEN amount IS NULL THEN 'missing_amount'
    WHEN amount < 0 AND transaction_type = 'deposit' THEN 'invalid_amount'
    ELSE 'valid'
  END AS quality_flag
FROM financial_transactions;

Standardization of Formats

Messy datasets often represent the same concept in multiple ways, such as 'ACH', 'ach', and 'Bank Transfer'. Standardization means converting these variants into a consistent format for grouping, filtering, and reporting.

SELECT
  UPPER(TRIM(currency_code)) AS currency_code,
  CASE
    WHEN LOWER(TRIM(payment_method)) IN ('ach', 'bank transfer') THEN 'BANK_TRANSFER'
    WHEN LOWER(TRIM(payment_method)) IN ('card', 'credit card') THEN 'CARD'
    ELSE 'OTHER'
  END AS standardized_payment_method
FROM financial_transactions;

Duplicate Detection

Financial systems may contain duplicate rows caused by reprocessing, retries, or upstream ingestion issues. A common approach is to group by business keys such as account, date, amount, and reference number to find records that appear more than once.

SELECT
  account_id,
  transaction_date,
  amount,
  reference_number,
  COUNT(*) AS duplicate_count
FROM financial_transactions
GROUP BY account_id, transaction_date, amount, reference_number
HAVING COUNT(*) > 1;

Data Quality Flags

Instead of deleting every problematic record, it is often better to create flags that classify issues such as missing category, invalid date, or outlier amount. This preserves auditability and lets downstream users decide whether to exclude or review those records.

SELECT
  transaction_id,
  CASE
    WHEN posted_date IS NULL THEN 'missing_date'
    WHEN amount = 0 THEN 'zero_amount'
    WHEN category IS NULL THEN 'missing_category'
    ELSE 'clean'
  END AS data_quality_status
FROM financial_transactions;

Preparing Clean Reporting Data

After cleaning, the goal is to produce a reliable dataset with consistent fields and clear business rules. This usually means selecting validated columns, standardized categories, cleaned dates, and only approved records for aggregation.

SELECT
  transaction_id,
  account_id,
  posted_date,
  standardized_category,
  amount
FROM cleaned_financial_transactions
WHERE data_quality_status = 'clean';