Interview Guides

Clean and Deduplicate CSV Rows | Dataford Interview Questions - Dataford - Ace your Interview

Clean and Deduplicate CSV Rows

Medium

Coding

Data Wrangling

Problem

At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.

Formal Specification

Implement clean_and_dedup_csv(lines, key_columns, null_tokens).

Input:
- lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).
- key_columns: list[str]: column names that form a composite key.
- null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.
Cleaning rules:
1. Trim whitespace around each field.
2. If a field (after trimming) matches any null_tokens case-insensitively, replace it with the empty string "".
3. A row is invalid if any key column is "" after cleaning; invalid rows are dropped.
4. Deduplicate by composite key: keep the last valid occurrence.
Output:
- Return a list of dictionaries (one per row) mapping header -> cleaned_value, sorted by composite key lexicographically.

Examples

lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}

Output: [{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)

lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}

Output: [{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Examples

Example 1

Inputlines = ["id,email,age", "1,a@x.com,", "1, NULL ,20"], key_columns = ["id"], null_tokens = {"", "null"}Output[{"id": "1", "email": "", "age": "20"}]WhyBoth rows have key id=1; after cleaning, the second row overwrites the first. "NULL" is treated as null and becomes "".

Example 2

Inputlines = ["id,email", " ,a@x.com", "2,b@x.com"], key_columns = ["id"], null_tokens = {""}Output[{"id": "2", "email": "b@x.com"}]WhyThe first row’s id becomes empty after trimming, so it is dropped. The second row is valid and returned.

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Function Signature

def clean_and_dedup_csv(lines: list[str], key_columns: list[str], null_tokens: set[str]) -> list[dict[str, str]]:

Problem

At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.

Formal Specification

Implement clean_and_dedup_csv(lines, key_columns, null_tokens).

Input:
- lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).
- key_columns: list[str]: column names that form a composite key.
- null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.
Cleaning rules:
1. Trim whitespace around each field.
2. If a field (after trimming) matches any null_tokens case-insensitively, replace it with the empty string "".
3. A row is invalid if any key column is "" after cleaning; invalid rows are dropped.
4. Deduplicate by composite key: keep the last valid occurrence.
Output:
- Return a list of dictionaries (one per row) mapping header -> cleaned_value, sorted by composite key lexicographically.

Examples

lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}

Output: [{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)

lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}

Output: [{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Examples

Example 1

Example 2

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Function Signature

def clean_and_dedup_csv(lines: list[str], key_columns: list[str], null_tokens: set[str]) -> list[dict[str, str]]:

Practice Python

Python 3.10

Open on desktop for the full Python editor with syntax highlighting and autocomplete.

Up next

LClean and Merge Mismatched ColumnsEasy

Next question

Clean and Deduplicate CSV Rows

Medium

Coding

Data Wrangling

Problem

At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.

Formal Specification

Implement clean_and_dedup_csv(lines, key_columns, null_tokens).

Input:
- lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).
- key_columns: list[str]: column names that form a composite key.
- null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.
Cleaning rules:
1. Trim whitespace around each field.
2. If a field (after trimming) matches any null_tokens case-insensitively, replace it with the empty string "".
3. A row is invalid if any key column is "" after cleaning; invalid rows are dropped.
4. Deduplicate by composite key: keep the last valid occurrence.
Output:
- Return a list of dictionaries (one per row) mapping header -> cleaned_value, sorted by composite key lexicographically.

Examples

lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}

Output: [{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)

lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}

Output: [{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Examples

Example 1

Example 2

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Function Signature

def clean_and_dedup_csv(lines: list[str], key_columns: list[str], null_tokens: set[str]) -> list[dict[str, str]]:

Problem

At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.

Formal Specification

Implement clean_and_dedup_csv(lines, key_columns, null_tokens).

Input:
- lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).
- key_columns: list[str]: column names that form a composite key.
- null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.
Cleaning rules:
1. Trim whitespace around each field.
2. If a field (after trimming) matches any null_tokens case-insensitively, replace it with the empty string "".
3. A row is invalid if any key column is "" after cleaning; invalid rows are dropped.
4. Deduplicate by composite key: keep the last valid occurrence.
Output:
- Return a list of dictionaries (one per row) mapping header -> cleaned_value, sorted by composite key lexicographically.

Examples

lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}

Output: [{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)

lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}

Output: [{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Examples

Example 1

Example 2

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Function Signature

def clean_and_dedup_csv(lines: list[str], key_columns: list[str], null_tokens: set[str]) -> list[dict[str, str]]:

Practice Python

Python 3.10

Open on desktop for the full Python editor with syntax highlighting and autocomplete.

Up next

LClean and Merge Mismatched ColumnsEasy

Next question