Cluster Tickets Using Synonym DSU

Medium

Coding

ArraysStrings

Problem

You’re on the trust & safety engineering team at a large e-commerce marketplace processing millions of customer support tickets per day. To route tickets to the right specialist queues (refunds, shipping issues, account access), you want to automatically cluster tickets that describe the same issue, even when customers use different wording.

Each ticket is a short text string. You’re also given a list of synonym pairs (e.g., ("ship", "send")) that should be treated as equivalent, transitively (if ship ~ send and send ~ mail, then ship ~ mail).

Two tickets are considered similar if, after synonym normalization and tokenization, they share at least threshold distinct canonical tokens. Tickets should be clustered by connected components: if A is similar to B and B is similar to C, then A, B, C must end up in the same cluster even if A is not directly similar to C.

Tokenization rules

Lowercase everything.
Extract tokens as contiguous [a-z]+ sequences (punctuation and numbers are separators).
Within a ticket, duplicate tokens count once (use a set).

Task

Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.

Output format

Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.

Examples

Example 1

Input: tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2
Output: [[0, 1], [2, 3]]
Explanation: After canonicalization, tickets 0 and 1 share at least two tokens (refund/money/back and duplicate/double and charge/billing). Tickets 2 and 3 share at least two tokens (track/where and shipment/package).

Example 2

Input: tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2
Output: [[0], [1, 2], [3]]
Explanation: Tickets 1 and 2 share two canonical tokens (ship/send/mail and package/parcel) so they cluster. Ticket 0 shares only one token with them (ship/send/mail), so it stays separate when threshold = 2.

Constraints

1 <= tasks.length <= 10^4
0 <= synonyms.length <= 5 * 10^4
1 <= threshold <= 20
Each task length <= 200
Synonym words are lowercase a-z, length 1..30

Examples

Example 1

Input

tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2

Output[[0, 1], [2, 3]]WhySynonym DSU collapses related words; tickets 0&1 share 2+ canonical tokens, and tickets 2&3 share 2+ canonical tokens.

Example 2

Input

tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2

Output[[0], [1, 2], [3]]WhyTickets 1 and 2 share both canonical tokens (ship/send/mail and package/parcel). Ticket 0 shares only one token with them, so it does not join when threshold=2.

Constraints

1 <= tasks.length <= 10^4
0 <= synonyms.length <= 5 * 10^4
1 <= threshold <= 20
Each task description length <= 200
Synonym words contain only lowercase English letters a-z and have length 1..30
Tokenization uses only contiguous a-z sequences; all other characters are separators

Function Signature

def group_similar_tasks(tasks: list[str], synonyms: list[list[str]], threshold: int) -> list[list[int]]:

Problem

Tokenization rules

Lowercase everything.
Extract tokens as contiguous [a-z]+ sequences (punctuation and numbers are separators).
Within a ticket, duplicate tokens count once (use a set).

Task

Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.

Output format

Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.

Examples

Example 1

Input: tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2
Output: [[0, 1], [2, 3]]
Explanation: After canonicalization, tickets 0 and 1 share at least two tokens (refund/money/back and duplicate/double and charge/billing). Tickets 2 and 3 share at least two tokens (track/where and shipment/package).

Example 2

Input: tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2
Output: [[0], [1, 2], [3]]
Explanation: Tickets 1 and 2 share two canonical tokens (ship/send/mail and package/parcel) so they cluster. Ticket 0 shares only one token with them (ship/send/mail), so it stays separate when threshold = 2.

Constraints

1 <= tasks.length <= 10^4
0 <= synonyms.length <= 5 * 10^4
1 <= threshold <= 20
Each task length <= 200
Synonym words are lowercase a-z, length 1..30

Examples

Example 1

Input

tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2

Output[[0, 1], [2, 3]]WhySynonym DSU collapses related words; tickets 0&1 share 2+ canonical tokens, and tickets 2&3 share 2+ canonical tokens.

Example 2

Input

tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2

Output[[0], [1, 2], [3]]WhyTickets 1 and 2 share both canonical tokens (ship/send/mail and package/parcel). Ticket 0 shares only one token with them, so it does not join when threshold=2.

Constraints

1 <= tasks.length <= 10^4
0 <= synonyms.length <= 5 * 10^4
1 <= threshold <= 20
Each task description length <= 200
Synonym words contain only lowercase English letters a-z and have length 1..30
Tokenization uses only contiguous a-z sequences; all other characters are separators

Function Signature

def group_similar_tasks(tasks: list[str], synonyms: list[list[str]], threshold: int) -> list[list[int]]:

Practice Python

Python 3.10

Open on desktop for the full Python editor with syntax highlighting and autocomplete.

Up next

Cluster Support Tickets for AnnotationMedium

Classify Support Tickets with TF-IDFMedium ASummarize Customer Support OutcomesMedium

Next question

Problem

Tokenization rules

Lowercase everything.
Extract tokens as contiguous [a-z]+ sequences (punctuation and numbers are separators).
Within a ticket, duplicate tokens count once (use a set).

Task

Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.

Output format

Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.

Examples

Example 1

Input: tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2
Output: [[0, 1], [2, 3]]
Explanation: After canonicalization, tickets 0 and 1 share at least two tokens (refund/money/back and duplicate/double and charge/billing). Tickets 2 and 3 share at least two tokens (track/where and shipment/package).

Example 2

Input: tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2
Output: [[0], [1, 2], [3]]
Explanation: Tickets 1 and 2 share two canonical tokens (ship/send/mail and package/parcel) so they cluster. Ticket 0 shares only one token with them (ship/send/mail), so it stays separate when threshold = 2.

Constraints

1 <= tasks.length <= 10^4
0 <= synonyms.length <= 5 * 10^4
1 <= threshold <= 20
Each task length <= 200
Synonym words are lowercase a-z, length 1..30