Business Context
QueryFind, a consumer search platform, wants to annotate search results based on how well each result satisfies the user’s need for a given query. These annotations will be used to train downstream ranking models and improve search quality.
Data
You are given query-result pairs with human relevance judgments collected from editorial raters.
- Volume: 850,000 labeled query-result pairs from the last 12 months
- Text fields: query text, result title, snippet, URL path, and optional landing-page body text
- Text length: queries are 2-12 tokens; snippets are 20-180 tokens; landing pages are truncated to 512 tokens
- Language: English only
- Label distribution:
Fully Satisfies 11%, Highly Satisfies 19%, Moderately Satisfies 33%, Slightly Satisfies 22%, Fails to Satisfy 15%
Success Criteria
A strong solution should achieve macro-F1 >= 0.78, weighted F1 >= 0.84, and maintain good separation between adjacent relevance classes. Since these labels feed ranking systems, probability calibration and consistent ordinal behavior matter.
Constraints
- Batch inference on 5M query-result pairs per day
- Per-pair inference latency under 80ms on a T4 GPU
- Model must be deployable in Python and exportable to ONNX
- Training must fit on a single 16GB GPU
Requirements
- Build an NLP model that predicts how well a search result satisfies a user need for a query.
- Define a realistic preprocessing pipeline for query, title, snippet, and page text.
- Implement training and evaluation in modern Python using transformer-based fine-tuning.
- Explain how you would handle class imbalance and ordinal confusion between neighboring labels.
- Describe how the model would be validated before use in a search ranking pipeline.