Business Context
Squarespace wants to power multiple NLP features inside Squarespace AI Assistant, including content generation, intent routing, moderation, and merchant support summarization. Design an AI model serving stack that can reliably serve these language workloads to website owners and commerce customers with low latency and safe fallbacks.
Data
You will serve models over mixed text traffic from several Squarespace surfaces: website copy prompts, customer support chats, help-center search queries, and Commerce product descriptions.
- Volume: ~8M requests/day, with 5x peak traffic during launches and seasonal commerce events
- Text length: 5-2,500 tokens per request; median 180 tokens
- Language: English-first, with growing multilingual traffic
- Workload mix: ~55% generation, 25% classification/routing, 15% summarization, 5% moderation
- Label distribution: Highly imbalanced for safety/moderation classes; most traffic is benign
Success Criteria
A strong solution should support p95 latency under 300ms for lightweight NLP tasks, under 2.5s for generation, 99.9% availability, safe degradation during traffic spikes, and measurable quality monitoring for drift and hallucination.
Constraints
- Some workloads require streaming responses in the Squarespace editor
- Sensitive customer content must remain in approved infrastructure
- GPU capacity is limited and expensive
- Models will be updated frequently with new prompts, adapters, and safety policies
Requirements
- Architect a serving system for multiple NLP task types, not just one model.
- Explain request routing, batching, autoscaling, caching, and fallback behavior.
- Describe preprocessing for prompts, long inputs, and multilingual traffic.
- Propose how you would serve both fine-tuned transformer classifiers and larger generative models.
- Define monitoring for latency, cost, safety, and model-quality regressions.
- Include a modern Python implementation sketch for inference routing and evaluation.