Business Context
QueryFind, a consumer web search product, wants to predict whether the top search result satisfied the user's need so ranking issues can be detected quickly and low-quality results can be demoted. You need to build an NLP system that estimates satisfaction from the query, result snippet, landing-page text, and lightweight behavioral labels.
Data
- Volume: 2.4M search sessions collected over 6 months
- Unit of prediction: one query-result pair for the top-ranked result
- Text fields: query (2-12 tokens), result title (5-20 tokens), snippet (20-180 tokens), landing-page extract (100-800 tokens)
- Language: English only
- Labels:
Satisfied, Partially Satisfied, Not Satisfied
- Label distribution: 58% satisfied, 24% partially satisfied, 18% not satisfied
- Weak supervision source: reformulation rate, dwell time, pogo-sticking, and explicit thumbs-up/down feedback
Success Criteria
A good solution should achieve macro-F1 >= 0.78, recall >= 0.85 on Not Satisfied, and produce calibrated probabilities that can support ranking and monitoring decisions.
Constraints
- Inference latency must stay below 80 ms per query-result pair at p95
- The model must run in a Python service on a single T4 GPU or CPU fallback
- Training data contains noisy labels derived from behavior, so robustness matters more than leaderboard accuracy
Requirements
- Formulate the task as a supervised NLP problem and define the target label.
- Design a preprocessing pipeline for query, snippet, and landing-page text.
- Implement a baseline and a transformer-based model in Python.
- Explain how you would handle weak labels, class imbalance, and short-vs-long text fields.
- Define an evaluation plan with offline metrics, validation strategy, and error analysis.
- Describe how you would decide whether the model is good enough for production use.