At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.
Implement clean_and_dedup_csv(lines, key_columns, null_tokens).
lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).key_columns: list[str]: column names that form a composite key.null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.null_tokens case-insensitively, replace it with the empty string ""."" after cleaning; invalid rows are dropped.header -> cleaned_value, sorted by composite key lexicographically.lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}[{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}[{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)1 <= len(lines) <= 2 * 10^51 <= number of columns <= 501 <= len(key_columns) <= number of columnslines = ["id,email,age", "1,a@x.com,", "1, NULL ,20"], key_columns = ["id"], null_tokens = {"", "null"}Output[{"id": "1", "email": "", "age": "20"}]WhyBoth rows have key id=1; after cleaning, the second row overwrites the first. "NULL" is treated as null and becomes "".lines = ["id,email", " ,a@x.com", "2,b@x.com"], key_columns = ["id"], null_tokens = {""}Output[{"id": "2", "email": "b@x.com"}]WhyThe first row’s id becomes empty after trimming, so it is dropped. The second row is valid and returned.1 <= len(lines) <= 2 * 10^51 <= number of columns <= 50Each data row has exactly the same number of comma-separated fields as the headerNo quoted commas in fields1 <= len(key_columns) <= number of columnsdef clean_and_dedup_csv(lines: list[str], key_columns: list[str], null_tokens: set[str]) -> list[dict[str, str]]: