Business Context
NorthGrid Energy wants to convert free-text equipment logs and operator notes into structured signals for predictive maintenance and incident triage. You need to design NLP features from noisy operational text and build a model that classifies each note into a failure category.
Data
You have 420,000 historical records from substations and field service systems.
- Sources: SCADA alarm summaries, technician logs, shift handoff notes, operator comments
- Text length: 5-180 tokens (median: 32)
- Language: English, with abbreviations, misspellings, equipment codes, timestamps, and copied alarm strings
- Labels: 6 classes —
power_loss, sensor_fault, communication_issue, mechanical_issue, scheduled_maintenance, other
- Label distribution: moderately imbalanced;
other is 28%, communication_issue is 9%
Success Criteria
A good solution should achieve:
- Macro-F1 >= 0.80 on a held-out test set
- Recall >= 0.88 for
power_loss and mechanical_issue
- Stable performance on unseen sites and new operators
Constraints
- Inference must run in <50 ms per note in a batch scoring service
- The pipeline must be explainable enough for operations analysts
- Notes may contain IDs, device names, and repeated boilerplate that should not dominate predictions
Requirements
- Build an NLP pipeline for noisy logs and operator notes.
- Describe what text normalization and domain-specific preprocessing you would apply.
- Implement a baseline using TF-IDF + linear classifier and a stronger transformer-based model in Python.
- Explain how you would handle abbreviations, rare tokens, duplicated templates, and class imbalance.
- Define an evaluation plan, including split strategy, metrics, and error analysis.
- Identify which features from logs or notes are most useful and which may create leakage or poor generalization.