You are building an NLP system that needs to route and organize financial documents such as earnings releases, annual reports, fund commentaries, and regulatory filings. The documents vary widely in length, structure, and writing style, and some classes are much more common than others. You have historical labeled documents, but the labels were created by different teams over time, so the data may contain noise and inconsistent formatting.
How would you build a text classification model for financial documents?