Business Context
LexiSearch is building a Transformer-based encoder for classifying legal clauses and summarizing contract sections. The team noticed that a bag-of-words baseline misses meaning when token order changes, so they want to evaluate positional encodings and explain why they are necessary in self-attention models.
Data
- Volume: 180,000 English legal text segments from contracts and policy documents
- Text length: 20-512 tokens, with a median of 140 tokens
- Language: English only
- Labels: 6 clause types for the downstream classification task; moderately imbalanced, with the largest class at 31% and the smallest at 8%
- Special challenge: Many examples contain the same tokens in different orders, where order changes meaning (e.g., obligations, exceptions, termination conditions)
Constraints
- Inference latency must remain under 120ms per document on a single T4 GPU
- The solution must support sequence lengths up to 512 tokens
- The team wants an implementation in modern Python using PyTorch and Hugging Face Transformers
- The explanation must compare fixed sinusoidal and learned positional embeddings
Requirements
- Explain why self-attention alone is permutation-invariant and why positional information is required.
- Implement a preprocessing pipeline for tokenization, padding, truncation, and attention masks.
- Build and compare two Transformer encoders: one with sinusoidal positional encodings and one with learned positional embeddings.
- Fine-tune both models on the clause classification task and report performance and training trade-offs.
- Show, with examples, how removing positional encodings affects predictions on texts with reordered tokens.
- Recommend which positional encoding strategy to deploy and justify the choice.