You’re on the trust & safety engineering team at a large e-commerce marketplace processing millions of customer support tickets per day. To route tickets to the right specialist queues (refunds, shipping issues, account access), you want to automatically cluster tickets that describe the same issue, even when customers use different wording.
Each ticket is a short text string. You’re also given a list of synonym pairs (e.g., ("ship", "send")) that should be treated as equivalent, transitively (if ship ~ send and send ~ mail, then ship ~ mail).
Two tickets are considered similar if, after synonym normalization and tokenization, they share at least threshold distinct canonical tokens. Tickets should be clustered by connected components: if A is similar to B and B is similar to C, then A, B, C must end up in the same cluster even if A is not directly similar to C.
[a-z]+ sequences (punctuation and numbers are separators).Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.
Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.
Example 1
tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2[[0, 1], [2, 3]]Example 2
tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2[[0], [1, 2], [3]]threshold = 2.1 <= tasks.length <= 10^40 <= synonyms.length <= 5 * 10^41 <= threshold <= 20<= 200a-z, length 1..30