Context
ResearchCorp, a scientific research organization, has accumulated thousands of internal research PDFs that are difficult to search and analyze. Currently, researchers spend excessive time manually searching through documents, hindering productivity. To improve accessibility, ResearchCorp aims to build an ETL pipeline that extracts text from PDFs, indexes the data for a retrieval-augmented generation (RAG) system, and makes it searchable.
Scale Requirements
- Data Volume: Process 10,000 PDFs per day, averaging 5MB each.
- Throughput: Support 1,000 documents/hour during peak hours.
- Latency: Indexing should complete within 2 hours of PDF ingestion.
- Storage: Use a scalable storage solution for raw PDFs and indexed data, targeting a retention period of 5 years.
Requirements
- Extraction: Implement a robust PDF text extraction process using tools like Apache Tika or PyMuPDF.
- Transformation: Clean and standardize extracted text, applying NLP techniques for entity recognition and keyword extraction.
- Loading: Index transformed data into a search engine (e.g., Elasticsearch) for fast retrieval.
- Data Quality: Implement validation checks to ensure that only valid and complete documents are indexed.
- Orchestration: Use Apache Airflow for scheduling and monitoring the ETL jobs, ensuring fault tolerance and retries.
Constraints
- Infrastructure: Limited to existing AWS resources (EC2, S3, RDS).
- Budget: Monthly cloud spend should not exceed $5,000.
- Compliance: Ensure all data processing complies with internal data governance policies.