Test Smart Compose Feature

Context

InboxPro, an email productivity app, wants to launch Smart Compose, a feature that suggests the rest of a sentence while a user writes an email. Product leadership wants a rigorous online experiment before rollout.

Hypothesis Seed

The team believes Smart Compose will reduce email drafting friction, leading to more sent emails per active user and faster compose completion. However, there is risk that low-quality suggestions could frustrate users, increase undo/delete behavior, or reduce reply quality.

Constraints

Eligible traffic: 120,000 daily active compose users/day
Only 60% of DAU are eligible because the feature is available only in English on mobile and desktop
Maximum experiment duration: 14 days
Randomization must be stable for the same user across sessions
False positives are costly because poor suggestions could damage trust and retention; false negatives are acceptable if the feature can be iterated and re-tested next month
Engineering asks for a clear ship/no-ship decision by the end of the 14-day window

Deliverables

Define the experiment hypothesis, the primary metric, and 2-4 guardrail metrics. Be explicit about the metric formula, unit of analysis, baseline, and minimum detectable effect (MDE).
Calculate the required sample size per arm using explicit assumptions for alpha, power, baseline rate/variance, and MDE. Translate that into expected runtime given the available traffic.
Choose the unit of randomization and justify it. Explain whether the analysis unit should match the randomization unit and how you would handle any mismatch.
Pre-register the analysis plan: statistical test, peeking policy, treatment of secondary metrics, and how you will check for sample ratio mismatch.
State a professional decision rule for ship / don’t ship / iterate that respects guardrails, and discuss key pitfalls such as novelty effects, interference across users, and SUTVA violations.

Assume the current baseline send-through rate per compose session is 0.38, and Smart Compose is expected to improve it by roughly 2% relative if it works. Also assume each eligible user starts 1.8 compose sessions/day on average.

Constraints

Eligible traffic: 120,000 daily active compose users/day

Only 60% of DAU are eligible because the feature is available only in English on mobile and desktop

Maximum experiment duration: 14 days

Randomization must be stable for the same user across sessions

False positives are costly because poor suggestions could damage trust and retention; false negatives are acceptable if the feature can be iterated and re-tested next month

Engineering asks for a clear ship/no-ship decision by the end of the 14-day window

Deliverables

Define the experiment hypothesis, the primary metric, and 2-4 guardrail metrics. Be explicit about the metric formula, unit of analysis, baseline, and minimum detectable effect (MDE).

Calculate the required sample size per arm using explicit assumptions for alpha, power, baseline rate/variance, and MDE. Translate that into expected runtime given the available traffic.

Choose the unit of randomization and justify it. Explain whether the analysis unit should match the randomization unit and how you would handle any mismatch.

Pre-register the analysis plan: statistical test, peeking policy, treatment of secondary metrics, and how you will check for sample ratio mismatch.

State a professional decision rule for ship / don’t ship / iterate that respects guardrails, and discuss key pitfalls such as novelty effects, interference across users, and SUTVA violations.

Constraints

Eligible traffic: 120,000 daily active compose users/day

Only 60% of DAU are eligible because the feature is available only in English on mobile and desktop

Maximum experiment duration: 14 days

Randomization must be stable for the same user across sessions

False positives are costly because poor suggestions could damage trust and retention; false negatives are acceptable if the feature can be iterated and re-tested next month

Engineering asks for a clear ship/no-ship decision by the end of the 14-day window

Deliverables

Define the experiment hypothesis, the primary metric, and 2-4 guardrail metrics. Be explicit about the metric formula, unit of analysis, baseline, and minimum detectable effect (MDE).

Calculate the required sample size per arm using explicit assumptions for alpha, power, baseline rate/variance, and MDE. Translate that into expected runtime given the available traffic.

Choose the unit of randomization and justify it. Explain whether the analysis unit should match the randomization unit and how you would handle any mismatch.

Pre-register the analysis plan: statistical test, peeking policy, treatment of secondary metrics, and how you will check for sample ratio mismatch.

State a professional decision rule for ship / don’t ship / iterate that respects guardrails, and discuss key pitfalls such as novelty effects, interference across users, and SUTVA violations.

Constraints

Eligible traffic: 120,000 daily active compose users/day

Only 60% of DAU are eligible because the feature is available only in English on mobile and desktop

Maximum experiment duration: 14 days

Randomization must be stable for the same user across sessions

False positives are costly because poor suggestions could damage trust and retention; false negatives are acceptable if the feature can be iterated and re-tested next month

Engineering asks for a clear ship/no-ship decision by the end of the 14-day window

Deliverables

Define the experiment hypothesis, the primary metric, and 2-4 guardrail metrics. Be explicit about the metric formula, unit of analysis, baseline, and minimum detectable effect (MDE).

Calculate the required sample size per arm using explicit assumptions for alpha, power, baseline rate/variance, and MDE. Translate that into expected runtime given the available traffic.

Choose the unit of randomization and justify it. Explain whether the analysis unit should match the randomization unit and how you would handle any mismatch.

Pre-register the analysis plan: statistical test, peeking policy, treatment of secondary metrics, and how you will check for sample ratio mismatch.

State a professional decision rule for ship / don’t ship / iterate that respects guardrails, and discuss key pitfalls such as novelty effects, interference across users, and SUTVA violations.

Interview Guides

Context

Hypothesis Seed

Constraints

Deliverables

Test Smart Compose Feature

Context

Hypothesis Seed

Constraints

Deliverables

Your Answer

Test Smart Compose Feature

Context

Hypothesis Seed

Constraints

Deliverables

Test Smart Compose Feature

Context

Hypothesis Seed

Constraints

Deliverables

Your Answer