Dataford
Interview Guides
Upgrade
All questions/Coding/Cluster Tickets Using Synonym DSU

Cluster Tickets Using Synonym DSU

Medium
Coding
ArraysStrings

Problem

Problem

You’re on the trust & safety engineering team at a large e-commerce marketplace processing millions of customer support tickets per day. To route tickets to the right specialist queues (refunds, shipping issues, account access), you want to automatically cluster tickets that describe the same issue, even when customers use different wording.

Each ticket is a short text string. You’re also given a list of synonym pairs (e.g., ("ship", "send")) that should be treated as equivalent, transitively (if ship ~ send and send ~ mail, then ship ~ mail).

Two tickets are considered similar if, after synonym normalization and tokenization, they share at least threshold distinct canonical tokens. Tickets should be clustered by connected components: if A is similar to B and B is similar to C, then A, B, C must end up in the same cluster even if A is not directly similar to C.

Tokenization rules

  1. Lowercase everything.
  2. Extract tokens as contiguous [a-z]+ sequences (punctuation and numbers are separators).
  3. Within a ticket, duplicate tokens count once (use a set).

Task

Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.

Output format

Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.

Examples

Example 1

  • Input: tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2
  • Output: [[0, 1], [2, 3]]
  • Explanation: After canonicalization, tickets 0 and 1 share at least two tokens (refund/money/back and duplicate/double and charge/billing). Tickets 2 and 3 share at least two tokens (track/where and shipment/package).

Example 2

  • Input: tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2
  • Output: [[0], [1, 2], [3]]
  • Explanation: Tickets 1 and 2 share two canonical tokens (ship/send/mail and package/parcel) so they cluster. Ticket 0 shares only one token with them (ship/send/mail), so it stays separate when threshold = 2.

Constraints

  • 1 <= tasks.length <= 10^4
  • 0 <= synonyms.length <= 5 * 10^4
  • 1 <= threshold <= 20
  • Each task length <= 200
  • Synonym words are lowercase a-z, length 1..30

Examples

Example 1
Inputtasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2Output[[0, 1], [2, 3]]WhySynonym DSU collapses related words; tickets 0&1 share 2+ canonical tokens, and tickets 2&3 share 2+ canonical tokens.
Example 2
Inputtasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2Output[[0], [1, 2], [3]]WhyTickets 1 and 2 share both canonical tokens (ship/send/mail and package/parcel). Ticket 0 shares only one token with them, so it does not join when threshold=2.

Constraints

  • 1 <= tasks.length <= 10^4
  • 0 <= synonyms.length <= 5 * 10^4
  • 1 <= threshold <= 20
  • Each task description length <= 200
  • Synonym words contain only lowercase English letters a-z and have length 1..30
  • Tokenization uses only contiguous a-z sequences; all other characters are separators

Function Signature

def group_similar_tasks(tasks: list[str], synonyms: list[list[str]], threshold: int) -> list[list[int]]:

Problem

Problem

You’re on the trust & safety engineering team at a large e-commerce marketplace processing millions of customer support tickets per day. To route tickets to the right specialist queues (refunds, shipping issues, account access), you want to automatically cluster tickets that describe the same issue, even when customers use different wording.

Each ticket is a short text string. You’re also given a list of synonym pairs (e.g., ("ship", "send")) that should be treated as equivalent, transitively (if ship ~ send and send ~ mail, then ship ~ mail).

Two tickets are considered similar if, after synonym normalization and tokenization, they share at least threshold distinct canonical tokens. Tickets should be clustered by connected components: if A is similar to B and B is similar to C, then A, B, C must end up in the same cluster even if A is not directly similar to C.

Tokenization rules

  1. Lowercase everything.
  2. Extract tokens as contiguous [a-z]+ sequences (punctuation and numbers are separators).
  3. Within a ticket, duplicate tokens count once (use a set).

Task

Implement group_similar_tasks(tasks, synonyms, threshold) returning clusters of ticket indices.

Output format

Return a list of clusters (each cluster is a list of indices). Sort indices within each cluster ascending, and sort clusters by their smallest index.

Examples

Example 1

  • Input: tasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2
  • Output: [[0, 1], [2, 3]]
  • Explanation: After canonicalization, tickets 0 and 1 share at least two tokens (refund/money/back and duplicate/double and charge/billing). Tickets 2 and 3 share at least two tokens (track/where and shipment/package).

Example 2

  • Input: tasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2
  • Output: [[0], [1, 2], [3]]
  • Explanation: Tickets 1 and 2 share two canonical tokens (ship/send/mail and package/parcel) so they cluster. Ticket 0 shares only one token with them (ship/send/mail), so it stays separate when threshold = 2.

Constraints

  • 1 <= tasks.length <= 10^4
  • 0 <= synonyms.length <= 5 * 10^4
  • 1 <= threshold <= 20
  • Each task length <= 200
  • Synonym words are lowercase a-z, length 1..30

Examples

Example 1
Inputtasks = ["Refund for duplicate charge", "Money back for double billing", "Track my shipment", "Where is my package?"], synonyms = [["refund","money"],["back","refund"],["duplicate","double"],["charge","billing"],["shipment","package"],["track","where"]], threshold = 2Output[[0, 1], [2, 3]]WhySynonym DSU collapses related words; tickets 0&1 share 2+ canonical tokens, and tickets 2&3 share 2+ canonical tokens.
Example 2
Inputtasks = ["ship item", "send package", "mail parcel", "refund item"], synonyms = [["ship","send"],["package","parcel"],["send","mail"]], threshold = 2Output[[0], [1, 2], [3]]WhyTickets 1 and 2 share both canonical tokens (ship/send/mail and package/parcel). Ticket 0 shares only one token with them, so it does not join when threshold=2.

Constraints

  • 1 <= tasks.length <= 10^4
  • 0 <= synonyms.length <= 5 * 10^4
  • 1 <= threshold <= 20
  • Each task description length <= 200
  • Synonym words contain only lowercase English letters a-z and have length 1..30
  • Tokenization uses only contiguous a-z sequences; all other characters are separators

Function Signature

def group_similar_tasks(tasks: list[str], synonyms: list[list[str]], threshold: int) -> list[list[int]]:
Practice Python
Python 3.10
Open on desktop for the full Python editor with syntax highlighting and autocomplete.
Up next
Cluster Support Tickets for AnnotationMediumAbzoobaClassify Support Tickets with TF-IDFMediumASummarize Customer Support OutcomesMedium
Next question