Testing Non-Deterministic GenAI Outputs

Core Question

Explain the main behaviors of GenAI systems that affect testing, and how you would adapt a coding-oriented test strategy to handle them.

Address these sub-questions:

Which GenAI behaviors most commonly break standard deterministic test assumptions?

How would you design automated tests for outputs that are variable but still valid?

What failure modes should be covered beyond simple correctness, such as hallucination, formatting drift, and prompt sensitivity?

Scope Guidance

The interviewer expects a practical engineering explanation, not a product discussion. Focus on test design, validation logic, reproducibility, robustness checks, and how algorithmic techniques such as normalization, rule-based validation, and similarity scoring can be used in code-driven evaluation pipelines.

Non-Determinism

GenAI outputs are often stochastic, so the same prompt may produce different valid responses across runs. Testing must therefore verify properties or constraints of the output rather than rely only on exact string equality.

valid = output.startswith('{') and 'answer' in output

Prompt Sensitivity

Small wording, formatting, or context changes in prompts can cause large output changes. Good testing includes prompt perturbation tests to measure whether behavior remains within acceptable bounds.

variants = [prompt, prompt + '
Be concise.', prompt.replace('summarize', 'briefly summarize')]

Oracle Design

A test oracle defines how correctness is judged. For GenAI, the oracle is often a combination of exact checks, schema validation, keyword coverage, semantic similarity, and safety rules.

required_keys = {'label', 'reason'}
passed = isinstance(result, dict) and required_keys.issubset(result)

Metamorphic Testing

When exact expected outputs are hard to define, test transformations can still verify consistency. For example, reordering irrelevant context or changing whitespace should not materially change a classification result.

assert classify(text) == classify('  ' + text + '  ')

Failure Mode Coverage

Testing should cover more than task success. Common GenAI-specific failures include hallucinations, instruction omission, unsafe content, malformed structured output, and unstable behavior across repeated runs.

if not json_valid(output):
    failures.append('schema_error')

Core Question

Explain the main behaviors of GenAI systems that affect testing, and how you would adapt a coding-oriented test strategy to handle them.

Address these sub-questions:

Which GenAI behaviors most commonly break standard deterministic test assumptions?

How would you design automated tests for outputs that are variable but still valid?

What failure modes should be covered beyond simple correctness, such as hallucination, formatting drift, and prompt sensitivity?

Scope Guidance

Non-Determinism

valid = output.startswith('{') and 'answer' in output

Prompt Sensitivity

Small wording, formatting, or context changes in prompts can cause large output changes. Good testing includes prompt perturbation tests to measure whether behavior remains within acceptable bounds.

variants = [prompt, prompt + '
Be concise.', prompt.replace('summarize', 'briefly summarize')]

Oracle Design

A test oracle defines how correctness is judged. For GenAI, the oracle is often a combination of exact checks, schema validation, keyword coverage, semantic similarity, and safety rules.

required_keys = {'label', 'reason'}
passed = isinstance(result, dict) and required_keys.issubset(result)

Metamorphic Testing

assert classify(text) == classify('  ' + text + '  ')

Failure Mode Coverage

if not json_valid(output):
    failures.append('schema_error')

Problem

Context

Core Question

Scope Guidance

Key Concepts

Non-Determinism

Prompt Sensitivity

Oracle Design

Metamorphic Testing

Failure Mode Coverage

Testing Non-Deterministic GenAI Outputs

Problem

Context

Core Question

Scope Guidance

Key Concepts

Non-Determinism

Prompt Sensitivity

Oracle Design

Metamorphic Testing

Failure Mode Coverage