Standardize Task Descriptions for Analytics

Problem

You’re building an analytics pipeline for a logistics marketplace with 5M+ daily active users where dispatchers create free-form task descriptions (e.g., “Pick-up @ 5th Ave!!!”, “pickup at fifth avenue”, “PICK UP 5th ave”). Product and ops teams want reliable dashboards for task volumes by type and location, but noisy text causes fragmented counts and missed trends.

Write a function that cleans and standardizes each task description into a canonical form so that semantically identical descriptions normalize to the same string.

Cleaning Rules (apply in this order)

Given a list of strings tasks, produce a list cleaned of the same length where each element is normalized as follows:

Case-fold: convert to lowercase.
Replace separators with spaces: any character that is not a letter (a-z) or digit (0-9) is treated as a separator and replaced by a single space.
Tokenize on whitespace.
Drop stopwords: remove tokens in the provided set stopwords.
Apply synonyms: if a token appears as a key in synonyms, replace it with synonyms[token].
Deduplicate tokens: keep only the first occurrence of each token (preserve first-seen order).
Sort tokens lexicographically.
Join tokens with a single space. If no tokens remain, return the empty string.

Function Signature

Input: tasks: list[str], stopwords: set[str], synonyms: dict[str, str]
Output: list[str]

Examples

Example 1

Input:
- tasks = ["Pick-up @ 5th Ave!!!", "pickup at fifth avenue", "PICK UP 5th ave"]
- stopwords = {"at"}
- synonyms = {"pick": "pickup", "up": "pickup", "fifth": "5th", "avenue": "ave"}
Output:
- ["5th ave pickup", "5th ave pickup", "5th ave pickup"]
Explanation:
- All punctuation becomes separators, stopword at is removed, synonyms map variants to the same tokens, duplicates are removed, then tokens are sorted.

Example 2

Input:
- tasks = [" !!! ", "Deliver--package to Door #2"]
- stopwords = {"to"}
- synonyms = {"deliver": "dropoff", "door": "door"}
Output:
- ["", "2 door dropoff package"]
Explanation:
- The first string has no alphanumeric tokens. In the second, deliver becomes dropoff, to is removed, # is a separator, and tokens are sorted.

Constraints

1 <= tasks.length <= 2 * 10^5
0 <= len(tasks[i]) <= 10^4
Total characters across all tasks <= 2 * 10^6
0 <= |stopwords| <= 10^5
0 <= |synonyms| <= 10^5
synonyms keys/values are lowercase alphanumeric tokens

Notes

Treat any non-alphanumeric character as a separator (including punctuation and emojis).
Stopword removal happens before synonym replacement.
Deduplication happens before sorting (to avoid repeated tokens after synonym mapping).

Problem

Write a function that cleans and standardizes each task description into a canonical form so that semantically identical descriptions normalize to the same string.

Cleaning Rules (apply in this order)

Given a list of strings tasks, produce a list cleaned of the same length where each element is normalized as follows:

Case-fold: convert to lowercase.
Replace separators with spaces: any character that is not a letter (a-z) or digit (0-9) is treated as a separator and replaced by a single space.
Tokenize on whitespace.
Drop stopwords: remove tokens in the provided set stopwords.
Apply synonyms: if a token appears as a key in synonyms, replace it with synonyms[token].
Deduplicate tokens: keep only the first occurrence of each token (preserve first-seen order).
Sort tokens lexicographically.
Join tokens with a single space. If no tokens remain, return the empty string.

Function Signature

Input: tasks: list[str], stopwords: set[str], synonyms: dict[str, str]
Output: list[str]

Examples

Example 1

Input:
- tasks = ["Pick-up @ 5th Ave!!!", "pickup at fifth avenue", "PICK UP 5th ave"]
- stopwords = {"at"}
- synonyms = {"pick": "pickup", "up": "pickup", "fifth": "5th", "avenue": "ave"}
Output:
- ["5th ave pickup", "5th ave pickup", "5th ave pickup"]
Explanation:
- All punctuation becomes separators, stopword at is removed, synonyms map variants to the same tokens, duplicates are removed, then tokens are sorted.

Example 2

Input:
- tasks = [" !!! ", "Deliver--package to Door #2"]
- stopwords = {"to"}
- synonyms = {"deliver": "dropoff", "door": "door"}
Output:
- ["", "2 door dropoff package"]
Explanation:
- The first string has no alphanumeric tokens. In the second, deliver becomes dropoff, to is removed, # is a separator, and tokens are sorted.

Constraints

1 <= tasks.length <= 2 * 10^5
0 <= len(tasks[i]) <= 10^4
Total characters across all tasks <= 2 * 10^6
0 <= |stopwords| <= 10^5
0 <= |synonyms| <= 10^5
synonyms keys/values are lowercase alphanumeric tokens

Notes

Treat any non-alphanumeric character as a separator (including punctuation and emojis).
Stopword removal happens before synonym replacement.
Deduplication happens before sorting (to avoid repeated tokens after synonym mapping).

Problem

Write a function that cleans and standardizes each task description into a canonical form so that semantically identical descriptions normalize to the same string.

Cleaning Rules (apply in this order)

Given a list of strings tasks, produce a list cleaned of the same length where each element is normalized as follows:

Case-fold: convert to lowercase.
Replace separators with spaces: any character that is not a letter (a-z) or digit (0-9) is treated as a separator and replaced by a single space.
Tokenize on whitespace.
Drop stopwords: remove tokens in the provided set stopwords.
Apply synonyms: if a token appears as a key in synonyms, replace it with synonyms[token].
Deduplicate tokens: keep only the first occurrence of each token (preserve first-seen order).
Sort tokens lexicographically.
Join tokens with a single space. If no tokens remain, return the empty string.

Function Signature

Input: tasks: list[str], stopwords: set[str], synonyms: dict[str, str]
Output: list[str]

Examples

Example 1

Input:
- tasks = ["Pick-up @ 5th Ave!!!", "pickup at fifth avenue", "PICK UP 5th ave"]
- stopwords = {"at"}
- synonyms = {"pick": "pickup", "up": "pickup", "fifth": "5th", "avenue": "ave"}
Output:
- ["5th ave pickup", "5th ave pickup", "5th ave pickup"]
Explanation:
- All punctuation becomes separators, stopword at is removed, synonyms map variants to the same tokens, duplicates are removed, then tokens are sorted.

Example 2

Input:
- tasks = [" !!! ", "Deliver--package to Door #2"]
- stopwords = {"to"}
- synonyms = {"deliver": "dropoff", "door": "door"}
Output:
- ["", "2 door dropoff package"]
Explanation:
- The first string has no alphanumeric tokens. In the second, deliver becomes dropoff, to is removed, # is a separator, and tokens are sorted.

Constraints

1 <= tasks.length <= 2 * 10^5
0 <= len(tasks[i]) <= 10^4
Total characters across all tasks <= 2 * 10^6
0 <= |stopwords| <= 10^5
0 <= |synonyms| <= 10^5
synonyms keys/values are lowercase alphanumeric tokens

Notes

Treat any non-alphanumeric character as a separator (including punctuation and emojis).
Stopword removal happens before synonym replacement.
Deduplication happens before sorting (to avoid repeated tokens after synonym mapping).

Problem

Write a function that cleans and standardizes each task description into a canonical form so that semantically identical descriptions normalize to the same string.

Cleaning Rules (apply in this order)

Given a list of strings tasks, produce a list cleaned of the same length where each element is normalized as follows:

Case-fold: convert to lowercase.
Replace separators with spaces: any character that is not a letter (a-z) or digit (0-9) is treated as a separator and replaced by a single space.
Tokenize on whitespace.
Drop stopwords: remove tokens in the provided set stopwords.
Apply synonyms: if a token appears as a key in synonyms, replace it with synonyms[token].
Deduplicate tokens: keep only the first occurrence of each token (preserve first-seen order).
Sort tokens lexicographically.
Join tokens with a single space. If no tokens remain, return the empty string.

Function Signature

Input: tasks: list[str], stopwords: set[str], synonyms: dict[str, str]
Output: list[str]

Examples

Example 1

Input:
- tasks = ["Pick-up @ 5th Ave!!!", "pickup at fifth avenue", "PICK UP 5th ave"]
- stopwords = {"at"}
- synonyms = {"pick": "pickup", "up": "pickup", "fifth": "5th", "avenue": "ave"}
Output:
- ["5th ave pickup", "5th ave pickup", "5th ave pickup"]
Explanation:
- All punctuation becomes separators, stopword at is removed, synonyms map variants to the same tokens, duplicates are removed, then tokens are sorted.

Example 2

Input:
- tasks = [" !!! ", "Deliver--package to Door #2"]
- stopwords = {"to"}
- synonyms = {"deliver": "dropoff", "door": "door"}
Output:
- ["", "2 door dropoff package"]
Explanation:
- The first string has no alphanumeric tokens. In the second, deliver becomes dropoff, to is removed, # is a separator, and tokens are sorted.

Constraints

1 <= tasks.length <= 2 * 10^5
0 <= len(tasks[i]) <= 10^4
Total characters across all tasks <= 2 * 10^6
0 <= |stopwords| <= 10^5
0 <= |synonyms| <= 10^5
synonyms keys/values are lowercase alphanumeric tokens

Notes

Treat any non-alphanumeric character as a separator (including punctuation and emojis).
Stopword removal happens before synonym replacement.
Deduplication happens before sorting (to avoid repeated tokens after synonym mapping).

Interview Guides

Problem

Cleaning Rules (apply in this order)

Function Signature

Examples

Constraints

Notes

Standardize Task Descriptions for Analytics

Problem

Cleaning Rules (apply in this order)

Function Signature

Examples

Constraints

Notes

Standardize Task Descriptions for Analytics

Problem

Cleaning Rules (apply in this order)

Function Signature

Examples

Constraints

Notes

Standardize Task Descriptions for Analytics

Problem

Cleaning Rules (apply in this order)

Function Signature

Examples

Constraints

Notes