Classify Prompts with Transformer Features

Business Context

OpenAI wants a lightweight classifier that routes incoming prompts on the Chat Completions API into product-relevant categories such as coding, summarization, safety-sensitive, and factual Q&A. You are given embeddings and metadata derived from a GPT-style transformer, and your task is to build a supervised model that performs accurate multi-class classification under tight latency constraints.

Dataset

The dataset was generated from 1.2M historical prompts scored offline using a frozen GPT-style transformer encoder head. Each row represents one prompt and includes dense embedding features, prompt statistics, and limited metadata.

Feature Group	Count	Examples
Transformer embeddings	768	pooled_hidden_0 ... pooled_hidden_767
Prompt statistics	9	token_count, avg_token_idf, punctuation_ratio, code_block_count
Categorical metadata	5	surface, customer_tier, locale, model_family, hour_bucket
Quality flags	4	language_detect_confidence, pii_flag, truncation_flag, moderation_score

Size: 1.2M prompts, 786 features
Target: 6-class label — coding, summarization, extraction, factual_qa, creative_writing, safety_sensitive
Class balance: Moderately imbalanced; safety_sensitive is 4.6% and extraction is 8.9%
Missing data: 7% missing in locale, 3% missing in language confidence, sparse missingness in metadata due to logging gaps

Success Criteria

A good solution should achieve macro-F1 of at least 0.78 and safety_sensitive recall above 0.90 on a held-out test set, while keeping p95 online inference below 20 ms per request.

Constraints

The final model must run in a low-latency online service
Explainability should be sufficient for error analysis and policy review
Retraining should be feasible weekly on newly labeled traffic
You may not fine-tune the full GPT-style backbone; use the provided features only

Deliverables

Train a multi-class classifier using transformer-derived features and metadata.
Justify your model choice relative to simpler baselines such as multinomial logistic regression.
Show how you handle class imbalance, missing values, and mixed feature types.
Evaluate the model with class-aware metrics and provide per-class performance.
Describe how you would deploy, monitor, and retrain the model in production at OpenAI scale.

Business Context

Dataset

Feature Group	Count	Examples
Transformer embeddings	768	pooled_hidden_0 ... pooled_hidden_767
Prompt statistics	9	token_count, avg_token_idf, punctuation_ratio, code_block_count
Categorical metadata	5	surface, customer_tier, locale, model_family, hour_bucket
Quality flags	4	language_detect_confidence, pii_flag, truncation_flag, moderation_score

Size: 1.2M prompts, 786 features
Target: 6-class label — coding, summarization, extraction, factual_qa, creative_writing, safety_sensitive
Class balance: Moderately imbalanced; safety_sensitive is 4.6% and extraction is 8.9%
Missing data: 7% missing in locale, 3% missing in language confidence, sparse missingness in metadata due to logging gaps

Success Criteria

A good solution should achieve macro-F1 of at least 0.78 and safety_sensitive recall above 0.90 on a held-out test set, while keeping p95 online inference below 20 ms per request.

Constraints

The final model must run in a low-latency online service
Explainability should be sufficient for error analysis and policy review
Retraining should be feasible weekly on newly labeled traffic
You may not fine-tune the full GPT-style backbone; use the provided features only

Deliverables

Train a multi-class classifier using transformer-derived features and metadata.
Justify your model choice relative to simpler baselines such as multinomial logistic regression.
Show how you handle class imbalance, missing values, and mixed feature types.
Evaluate the model with class-aware metrics and provide per-class performance.
Describe how you would deploy, monitor, and retrain the model in production at OpenAI scale.

Business Context

Dataset

Feature Group	Count	Examples
Transformer embeddings	768	pooled_hidden_0 ... pooled_hidden_767
Prompt statistics	9	token_count, avg_token_idf, punctuation_ratio, code_block_count
Categorical metadata	5	surface, customer_tier, locale, model_family, hour_bucket
Quality flags	4	language_detect_confidence, pii_flag, truncation_flag, moderation_score

Size: 1.2M prompts, 786 features
Target: 6-class label — coding, summarization, extraction, factual_qa, creative_writing, safety_sensitive
Class balance: Moderately imbalanced; safety_sensitive is 4.6% and extraction is 8.9%
Missing data: 7% missing in locale, 3% missing in language confidence, sparse missingness in metadata due to logging gaps

Success Criteria

A good solution should achieve macro-F1 of at least 0.78 and safety_sensitive recall above 0.90 on a held-out test set, while keeping p95 online inference below 20 ms per request.

Constraints

The final model must run in a low-latency online service
Explainability should be sufficient for error analysis and policy review
Retraining should be feasible weekly on newly labeled traffic
You may not fine-tune the full GPT-style backbone; use the provided features only

Deliverables

Train a multi-class classifier using transformer-derived features and metadata.
Justify your model choice relative to simpler baselines such as multinomial logistic regression.
Show how you handle class imbalance, missing values, and mixed feature types.
Evaluate the model with class-aware metrics and provide per-class performance.
Describe how you would deploy, monitor, and retrain the model in production at OpenAI scale.

Business Context

Dataset

Feature Group	Count	Examples
Transformer embeddings	768	pooled_hidden_0 ... pooled_hidden_767
Prompt statistics	9	token_count, avg_token_idf, punctuation_ratio, code_block_count
Categorical metadata	5	surface, customer_tier, locale, model_family, hour_bucket
Quality flags	4	language_detect_confidence, pii_flag, truncation_flag, moderation_score

Size: 1.2M prompts, 786 features
Target: 6-class label — coding, summarization, extraction, factual_qa, creative_writing, safety_sensitive
Class balance: Moderately imbalanced; safety_sensitive is 4.6% and extraction is 8.9%
Missing data: 7% missing in locale, 3% missing in language confidence, sparse missingness in metadata due to logging gaps

Success Criteria

A good solution should achieve macro-F1 of at least 0.78 and safety_sensitive recall above 0.90 on a held-out test set, while keeping p95 online inference below 20 ms per request.

Constraints

The final model must run in a low-latency online service
Explainability should be sufficient for error analysis and policy review
Retraining should be feasible weekly on newly labeled traffic
You may not fine-tune the full GPT-style backbone; use the provided features only

Deliverables

Train a multi-class classifier using transformer-derived features and metadata.
Justify your model choice relative to simpler baselines such as multinomial logistic regression.
Show how you handle class imbalance, missing values, and mixed feature types.
Evaluate the model with class-aware metrics and provide per-class performance.
Describe how you would deploy, monitor, and retrain the model in production at OpenAI scale.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify Prompts with Transformer Features

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Classify Prompts with Transformer Features

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify Prompts with Transformer Features

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer