You’re building an analytics pipeline for a logistics marketplace with 5M+ daily active users where dispatchers create free-form task descriptions (e.g., “Pick-up @ 5th Ave!!!”, “pickup at fifth avenue”, “PICK UP 5th ave”). Product and ops teams want reliable dashboards for task volumes by type and location, but noisy text causes fragmented counts and missed trends.
Write a function that cleans and standardizes each task description into a canonical form so that semantically identical descriptions normalize to the same string.
Given a list of strings tasks, produce a list cleaned of the same length where each element is normalized as follows:
a-z) or digit (0-9) is treated as a separator and replaced by a single space.stopwords.synonyms, replace it with synonyms[token].tasks: list[str], stopwords: set[str], synonyms: dict[str, str]list[str]Example 1
tasks = ["Pick-up @ 5th Ave!!!", "pickup at fifth avenue", "PICK UP 5th ave"]stopwords = {"at"}synonyms = {"pick": "pickup", "up": "pickup", "fifth": "5th", "avenue": "ave"}["5th ave pickup", "5th ave pickup", "5th ave pickup"]at is removed, synonyms map variants to the same tokens, duplicates are removed, then tokens are sorted.Example 2
tasks = [" !!! ", "Deliver--package to Door #2"]stopwords = {"to"}synonyms = {"deliver": "dropoff", "door": "door"}["", "2 door dropoff package"]deliver becomes dropoff, to is removed, # is a separator, and tokens are sorted.1 <= tasks.length <= 2 * 10^50 <= len(tasks[i]) <= 10^4<= 2 * 10^60 <= |stopwords| <= 10^50 <= |synonyms| <= 10^5synonyms keys/values are lowercase alphanumeric tokens