Cluster Tickets Using Synonym DSU

Problem

You’re on the trust & safety engineering team at a large e-commerce marketplace processing millions of customer support tickets per day. To route tickets to the right specialist queues (refunds, shipping issues, account access), you want to automatically cluster tickets that describe the same issue, even when customers use different wording.

Each ticket is a short text string. You’re also given a list of synonym pairs (e.g., ("ship", "send")) that should be treated as equivalent, transitively (if ship ~ send and send ~ mail, then ship ~ mail).

Two tickets are considered similar if, after synonym normalization and tokenization, they share at least threshold distinct canonical tokens. Tickets should be clustered by connected components: if A is similar to B and B is similar to C, then A, B, C must end up in the same cluster even if A is not directly similar to C.

Tokenization rules

Lowercase everything.
Extract tokens as contiguous [a-z]+ sequences (punctuation and numbers are separators).
Within a ticket, duplicate tokens count once (use a set).

Task

Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.

Output format

Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.

Examples

Example 1

Input: tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2
Output: [[0, 1], [2, 3]]
Explanation: After canonicalization, tickets 0 and 1 share at least two tokens (refund/money/back and duplicate/double and charge/billing). Tickets 2 and 3 share at least two tokens (track/where and shipment/package).

Example 2

Input: tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2
Output: [[0], [1, 2], [3]]
Explanation: Tickets 1 and 2 share two canonical tokens (ship/send/mail and package/parcel) so they cluster. Ticket 0 shares only one token with them (ship/send/mail), so it stays separate when threshold = 2.

Constraints

1 <= tasks.length <= 10^4
0 <= synonyms.length <= 5 * 10^4
1 <= threshold <= 20
Each task length <= 200
Synonym words are lowercase a-z, length 1..30

Problem

Tokenization rules

Lowercase everything.
Extract tokens as contiguous [a-z]+ sequences (punctuation and numbers are separators).
Within a ticket, duplicate tokens count once (use a set).

Task

Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.

Output format

Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.

Examples

Example 1

Input: tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2
Output: [[0, 1], [2, 3]]
Explanation: After canonicalization, tickets 0 and 1 share at least two tokens (refund/money/back and duplicate/double and charge/billing). Tickets 2 and 3 share at least two tokens (track/where and shipment/package).

Example 2

Input: tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2
Output: [[0], [1, 2], [3]]
Explanation: Tickets 1 and 2 share two canonical tokens (ship/send/mail and package/parcel) so they cluster. Ticket 0 shares only one token with them (ship/send/mail), so it stays separate when threshold = 2.

Constraints

1 <= tasks.length <= 10^4
0 <= synonyms.length <= 5 * 10^4
1 <= threshold <= 20
Each task length <= 200
Synonym words are lowercase a-z, length 1..30

Problem

Tokenization rules

Lowercase everything.
Extract tokens as contiguous [a-z]+ sequences (punctuation and numbers are separators).
Within a ticket, duplicate tokens count once (use a set).

Task

Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.

Output format

Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.

Examples

Example 1

Input: tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2
Output: [[0, 1], [2, 3]]
Explanation: After canonicalization, tickets 0 and 1 share at least two tokens (refund/money/back and duplicate/double and charge/billing). Tickets 2 and 3 share at least two tokens (track/where and shipment/package).

Example 2

Input: tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2
Output: [[0], [1, 2], [3]]
Explanation: Tickets 1 and 2 share two canonical tokens (ship/send/mail and package/parcel) so they cluster. Ticket 0 shares only one token with them (ship/send/mail), so it stays separate when threshold = 2.

Constraints

1 <= tasks.length <= 10^4
0 <= synonyms.length <= 5 * 10^4
1 <= threshold <= 20
Each task length <= 200
Synonym words are lowercase a-z, length 1..30

Problem

Tokenization rules

Lowercase everything.
Extract tokens as contiguous [a-z]+ sequences (punctuation and numbers are separators).
Within a ticket, duplicate tokens count once (use a set).

Task

Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.

Output format

Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.

Examples

Example 1

Input: tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2
Output: [[0, 1], [2, 3]]
Explanation: After canonicalization, tickets 0 and 1 share at least two tokens (refund/money/back and duplicate/double and charge/billing). Tickets 2 and 3 share at least two tokens (track/where and shipment/package).

Example 2

Input: tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2
Output: [[0], [1, 2], [3]]
Explanation: Tickets 1 and 2 share two canonical tokens (ship/send/mail and package/parcel) so they cluster. Ticket 0 shares only one token with them (ship/send/mail), so it stays separate when threshold = 2.

Constraints

1 <= tasks.length <= 10^4
0 <= synonyms.length <= 5 * 10^4
1 <= threshold <= 20
Each task length <= 200
Synonym words are lowercase a-z, length 1..30

Interview Guides

Problem

Tokenization rules

Task

Output format

Examples

Constraints

Cluster Tickets Using Synonym DSU

Problem

Tokenization rules

Task

Output format

Examples

Constraints

Cluster Tickets Using Synonym DSU

Problem

Tokenization rules

Task

Output format

Examples

Constraints

Cluster Tickets Using Synonym DSU

Problem

Tokenization rules

Task

Output format

Examples

Constraints