At Stripe, a batch pipeline receives CSV rows as strings. You must parse, clean null-like values, and deduplicate records before downstream loading.
Implement clean_and_dedup_csv(lines, key_columns, null_tokens).
lines: list[str]: CSV content where lines[0] is the header (comma-separated, no quoted commas).key_columns: list[str]: column names that form a composite key.null_tokens: set[str]: tokens considered null (case-insensitive), e.g., { "", "null", "na" }.null_tokens case-insensitively, replace it with the empty string ""."" after cleaning; invalid rows are dropped.header -> cleaned_value, sorted by composite key lexicographically.lines=["id,email,age","1,a@x.com,","1, NULL ,20"], key_columns=["id"], null_tokens={"","null"}[{"id":"1","email":"","age":"20"}] (last row wins; NULL becomes empty)lines=["id,email"," ,a@x.com","2,b@x.com"], key_columns=["id"], null_tokens={""}[{"id":"2","email":"b@x.com"}] (first row dropped due to empty key)1 <= len(lines) <= 2 * 10^51 <= number of columns <= 501 <= len(key_columns) <= number of columns