Clean and Deduplicate CSV Rows

Problem

At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.

Formal Specification

Implement clean_and_dedup_csv(lines, key_columns, null_tokens).

Input:
- lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).
- key_columns: list[str]: column names that form a composite key.
- null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.
Cleaning rules:
1. Trim whitespace around each field.
2. If a field (after trimming) matches any null_tokens case-insensitively, replace it with the empty string "".
3. A row is invalid if any key column is "" after cleaning; invalid rows are dropped.
4. Deduplicate by composite key: keep the last valid occurrence.
Output:
- Return a list of dictionaries (one per row) mapping header -> cleaned_value, sorted by composite key lexicographically.

Examples

lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}

Output: [{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)

lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}

Output: [{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Problem

At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.

Formal Specification

Implement clean_and_dedup_csv(lines, key_columns, null_tokens).

Input:
- lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).
- key_columns: list[str]: column names that form a composite key.
- null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.
Cleaning rules:
1. Trim whitespace around each field.
2. If a field (after trimming) matches any null_tokens case-insensitively, replace it with the empty string "".
3. A row is invalid if any key column is "" after cleaning; invalid rows are dropped.
4. Deduplicate by composite key: keep the last valid occurrence.
Output:
- Return a list of dictionaries (one per row) mapping header -> cleaned_value, sorted by composite key lexicographically.

Examples

lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}

Output: [{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)

lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}

Output: [{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Problem

At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.

Formal Specification

Implement clean_and_dedup_csv(lines, key_columns, null_tokens).

Input:
- lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).
- key_columns: list[str]: column names that form a composite key.
- null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.
Cleaning rules:
1. Trim whitespace around each field.
2. If a field (after trimming) matches any null_tokens case-insensitively, replace it with the empty string "".
3. A row is invalid if any key column is "" after cleaning; invalid rows are dropped.
4. Deduplicate by composite key: keep the last valid occurrence.
Output:
- Return a list of dictionaries (one per row) mapping header -> cleaned_value, sorted by composite key lexicographically.

Examples

lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}

Output: [{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)

lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}

Output: [{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Problem

At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.

Formal Specification

Implement clean_and_dedup_csv(lines, key_columns, null_tokens).

Input:
- lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).
- key_columns: list[str]: column names that form a composite key.
- null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.
Cleaning rules:
1. Trim whitespace around each field.
2. If a field (after trimming) matches any null_tokens case-insensitively, replace it with the empty string "".
3. A row is invalid if any key column is "" after cleaning; invalid rows are dropped.
4. Deduplicate by composite key: keep the last valid occurrence.
Output:
- Return a list of dictionaries (one per row) mapping header -> cleaned_value, sorted by composite key lexicographically.

Examples

lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}

Output: [{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)

lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}

Output: [{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)

Constraints

1 <= len(lines) <= 2 * 10^5
1 <= number of columns <= 50
Each data row has exactly the same number of comma-separated fields as the header
No quoted commas in fields
1 <= len(key_columns) <= number of columns

Interview Guides

Problem

Formal Specification

Examples

Constraints

Clean and Deduplicate CSV Rows

Problem

Formal Specification

Examples

Constraints

Clean and Deduplicate CSV Rows

Problem

Formal Specification

Examples

Constraints

Clean and Deduplicate CSV Rows

Problem

Formal Specification

Examples

Constraints