Compare Positional Encodings in Transformers

Business Context

LexiSearch is building a Transformer-based encoder for classifying legal clauses and summarizing contract sections. The team noticed that a bag-of-words baseline misses meaning when token order changes, so they want to evaluate positional encodings and explain why they are necessary in self-attention models.

Data

Volume: 180,000 English legal text segments from contracts and policy documents
Text length: 20-512 tokens, with a median of 140 tokens
Language: English only
Labels: 6 clause types for the downstream classification task; moderately imbalanced, with the largest class at 31% and the smallest at 8%
Special challenge: Many examples contain the same tokens in different orders, where order changes meaning (e.g., obligations, exceptions, termination conditions)

Constraints

Inference latency must remain under 120ms per document on a single T4 GPU
The solution must support sequence lengths up to 512 tokens
The team wants an implementation in modern Python using PyTorch and Hugging Face Transformers
The explanation must compare fixed sinusoidal and learned positional embeddings

Requirements

Explain why self-attention alone is permutation-invariant and why positional information is required.
Implement a preprocessing pipeline for tokenization, padding, truncation, and attention masks.
Build and compare two Transformer encoders: one with sinusoidal positional encodings and one with learned positional embeddings.
Fine-tune both models on the clause classification task and report performance and training trade-offs.
Show, with examples, how removing positional encodings affects predictions on texts with reordered tokens.
Recommend which positional encoding strategy to deploy and justify the choice.

Business Context

Data

Volume: 180,000 English legal text segments from contracts and policy documents
Text length: 20-512 tokens, with a median of 140 tokens
Language: English only
Labels: 6 clause types for the downstream classification task; moderately imbalanced, with the largest class at 31% and the smallest at 8%
Special challenge: Many examples contain the same tokens in different orders, where order changes meaning (e.g., obligations, exceptions, termination conditions)

Constraints

Inference latency must remain under 120ms per document on a single T4 GPU
The solution must support sequence lengths up to 512 tokens
The team wants an implementation in modern Python using PyTorch and Hugging Face Transformers
The explanation must compare fixed sinusoidal and learned positional embeddings

Requirements

Explain why self-attention alone is permutation-invariant and why positional information is required.
Implement a preprocessing pipeline for tokenization, padding, truncation, and attention masks.
Build and compare two Transformer encoders: one with sinusoidal positional encodings and one with learned positional embeddings.
Fine-tune both models on the clause classification task and report performance and training trade-offs.
Show, with examples, how removing positional encodings affects predictions on texts with reordered tokens.
Recommend which positional encoding strategy to deploy and justify the choice.

Business Context

Data

Volume: 180,000 English legal text segments from contracts and policy documents
Text length: 20-512 tokens, with a median of 140 tokens
Language: English only
Labels: 6 clause types for the downstream classification task; moderately imbalanced, with the largest class at 31% and the smallest at 8%
Special challenge: Many examples contain the same tokens in different orders, where order changes meaning (e.g., obligations, exceptions, termination conditions)

Constraints

Inference latency must remain under 120ms per document on a single T4 GPU
The solution must support sequence lengths up to 512 tokens
The team wants an implementation in modern Python using PyTorch and Hugging Face Transformers
The explanation must compare fixed sinusoidal and learned positional embeddings

Requirements

Explain why self-attention alone is permutation-invariant and why positional information is required.
Implement a preprocessing pipeline for tokenization, padding, truncation, and attention masks.
Build and compare two Transformer encoders: one with sinusoidal positional encodings and one with learned positional embeddings.
Fine-tune both models on the clause classification task and report performance and training trade-offs.
Show, with examples, how removing positional encodings affects predictions on texts with reordered tokens.
Recommend which positional encoding strategy to deploy and justify the choice.

Business Context

Data

Volume: 180,000 English legal text segments from contracts and policy documents
Text length: 20-512 tokens, with a median of 140 tokens
Language: English only
Labels: 6 clause types for the downstream classification task; moderately imbalanced, with the largest class at 31% and the smallest at 8%
Special challenge: Many examples contain the same tokens in different orders, where order changes meaning (e.g., obligations, exceptions, termination conditions)

Constraints

Inference latency must remain under 120ms per document on a single T4 GPU
The solution must support sequence lengths up to 512 tokens
The team wants an implementation in modern Python using PyTorch and Hugging Face Transformers
The explanation must compare fixed sinusoidal and learned positional embeddings

Requirements

Explain why self-attention alone is permutation-invariant and why positional information is required.
Implement a preprocessing pipeline for tokenization, padding, truncation, and attention masks.
Build and compare two Transformer encoders: one with sinusoidal positional encodings and one with learned positional embeddings.
Fine-tune both models on the clause classification task and report performance and training trade-offs.
Show, with examples, how removing positional encodings affects predictions on texts with reordered tokens.
Recommend which positional encoding strategy to deploy and justify the choice.

Interview Guides

Business Context

Data

Constraints

Requirements

Compare Positional Encodings in Transformers

Business Context

Data

Constraints

Requirements

Your Answer

Compare Positional Encodings in Transformers

Business Context

Data

Constraints

Requirements

Compare Positional Encodings in Transformers

Business Context

Data

Constraints

Requirements

Your Answer