Business Context
DocuSearch, a document management platform with over 1 million documents, is looking to enhance its search functionality. Currently, the platform relies on the traditional BM25 algorithm for document retrieval. However, the product team is considering implementing a dense retrieval model, such as BERT, to improve user experience and search relevance.
Dataset
| Feature Group | Count | Examples |
|---|
| Documents | 1M | text_content, title, author, tags, date_created |
| Queries | 500K | user_search_terms, user_id, timestamp |
- Size: 1 million documents, 500,000 queries for evaluation
- Target: Relevance score (binary: relevant or not relevant)
- Class balance: Approximately 70% non-relevant, 30% relevant results
- Missing data: Minimal missing data in metadata fields
Requirements
- Compare the performance of BM25 and a dense retrieval model (e.g., BERT) on the document retrieval task.
- Analyze trade-offs in terms of speed, accuracy, and interpretability.
- Provide insights on which model to implement based on evaluation metrics.
- Discuss potential challenges in deploying the chosen model in production.
Constraints
- The retrieval speed must be under 200ms per query for a seamless user experience.
- The model should be interpretable enough to explain why certain documents are ranked higher than others.
- Budget constraints limit the use of extensive cloud resources for model training and inference.