
You are handed a large set of unstructured text from emails, notes, and documents, and the business wants a practical way to turn it into something useful. The text is messy, inconsistent, and only partially labeled, so you need to decide how to clean it, represent it, and model it before anyone can trust the output.
How would you approach a project involving unstructured data?