

Financial datasets often contain missing values, inconsistent date formats, duplicate transactions, and invalid amounts. In analytics and reporting, poor data wrangling leads directly to incorrect revenue, expense, and cash flow numbers.
Explain how you would perform data wrangling on a messy financial dataset using SQL in PostgreSQL. Your answer should cover:
The interviewer is looking for a practical SQL-oriented explanation, not a full ETL architecture. Focus on common cleaning patterns, the order of operations, and how you would make the dataset reliable enough for financial analysis.
Financial data often includes NULLs, blank strings, negative values where they should not exist, or placeholder values like 0 for unknown amounts. In PostgreSQL, you typically normalize blanks to NULL, use COALESCE carefully, and apply validation rules so bad records are either corrected, flagged, or excluded.
SELECT
transaction_id,
NULLIF(TRIM(category), '') AS cleaned_category,
amount,
CASE
WHEN amount IS NULL THEN 'missing_amount'
WHEN amount < 0 AND transaction_type = 'deposit' THEN 'invalid_amount'
ELSE 'valid'
END AS quality_flag
FROM financial_transactions;
Messy datasets often represent the same concept in multiple ways, such as 'ACH', 'ach', and 'Bank Transfer'. Standardization means converting these variants into a consistent format for grouping, filtering, and reporting.
SELECT
UPPER(TRIM(currency_code)) AS currency_code,
CASE
WHEN LOWER(TRIM(payment_method)) IN ('ach', 'bank transfer') THEN 'BANK_TRANSFER'
WHEN LOWER(TRIM(payment_method)) IN ('card', 'credit card') THEN 'CARD'
ELSE 'OTHER'
END AS standardized_payment_method
FROM financial_transactions;
Financial systems may contain duplicate rows caused by reprocessing, retries, or upstream ingestion issues. A common approach is to group by business keys such as account, date, amount, and reference number to find records that appear more than once.
SELECT
account_id,
transaction_date,
amount,
reference_number,
COUNT(*) AS duplicate_count
FROM financial_transactions
GROUP BY account_id, transaction_date, amount, reference_number
HAVING COUNT(*) > 1;
Instead of deleting every problematic record, it is often better to create flags that classify issues such as missing category, invalid date, or outlier amount. This preserves auditability and lets downstream users decide whether to exclude or review those records.
SELECT
transaction_id,
CASE
WHEN posted_date IS NULL THEN 'missing_date'
WHEN amount = 0 THEN 'zero_amount'
WHEN category IS NULL THEN 'missing_category'
ELSE 'clean'
END AS data_quality_status
FROM financial_transactions;
After cleaning, the goal is to produce a reliable dataset with consistent fields and clear business rules. This usually means selecting validated columns, standardized categories, cleaned dates, and only approved records for aggregation.
SELECT
transaction_id,
account_id,
posted_date,
standardized_category,
amount
FROM cleaned_financial_transactions
WHERE data_quality_status = 'clean';