Interview Guides

Unit Test Spark Transformation Output

Hard

Coding

Problem

In Databricks, unit-testing a Spark transformation often means verifying that an output collection of rows matches the expected result even when row order is not guaranteed. Implement a Python function that compares two datasets and reports whether they are equivalent after applying deterministic normalization rules.

Formal Specification

Write a function assert_spark_rows_equal(actual_rows, expected_rows, key_columns).

actual_rows: list of dictionaries representing transformed output rows
expected_rows: list of dictionaries representing expected rows
key_columns: list of column names used to sort rows deterministically before comparison

The function should return True if:

Both datasets contain the same number of rows
Every row has the same set of columns after normalization
Rows are equal after sorting by key_columns

Normalization rules:

Missing keys should be treated as None
String values should be compared exactly
Numeric and boolean values should be compared exactly
Row order in the input lists should not affect the result

Examples

Example 1 Input: actual_rows = [{"id": 2, "value": "b"}, {"id": 1, "value": "a"}], expected_rows = [{"id": 1, "value": "a"}, {"id": 2, "value": "b"}], key_columns = ["id"] Output: True Explanation: The rows are identical after sorting by id.

Example 2 Input: actual_rows = [{"id": 1, "value": "a"}], expected_rows = [{"id": 1, "value": "x"}], key_columns = ["id"] Output: False Explanation: The value field differs.

Constraints

0 <= len(actual_rows), len(expected_rows) <= 10^4
Each row contains at most 50 columns
0 <= len(key_columns) <= 10
Keys in key_columns are strings

Unit Test Spark Transformation Output

Hard

Coding

Problem

Formal Specification

Write a function assert_spark_rows_equal(actual_rows, expected_rows, key_columns).

actual_rows: list of dictionaries representing transformed output rows
expected_rows: list of dictionaries representing expected rows
key_columns: list of column names used to sort rows deterministically before comparison

The function should return True if:

Both datasets contain the same number of rows
Every row has the same set of columns after normalization
Rows are equal after sorting by key_columns

Normalization rules:

Missing keys should be treated as None
String values should be compared exactly
Numeric and boolean values should be compared exactly
Row order in the input lists should not affect the result

Examples

Example 2 Input: actual_rows = [{"id": 1, "value": "a"}], expected_rows = [{"id": 1, "value": "x"}], key_columns = ["id"] Output: False Explanation: The value field differs.

Constraints

0 <= len(actual_rows), len(expected_rows) <= 10^4
Each row contains at most 50 columns
0 <= len(key_columns) <= 10
Keys in key_columns are strings

Python 3.10

Unit Test Spark Transformation Output

Hard

Coding

Problem

Formal Specification

Write a function assert_spark_rows_equal(actual_rows, expected_rows, key_columns).

actual_rows: list of dictionaries representing transformed output rows
expected_rows: list of dictionaries representing expected rows
key_columns: list of column names used to sort rows deterministically before comparison

The function should return True if:

Both datasets contain the same number of rows
Every row has the same set of columns after normalization
Rows are equal after sorting by key_columns

Normalization rules:

Missing keys should be treated as None
String values should be compared exactly
Numeric and boolean values should be compared exactly
Row order in the input lists should not affect the result

Examples

Example 2 Input: actual_rows = [{"id": 1, "value": "a"}], expected_rows = [{"id": 1, "value": "x"}], key_columns = ["id"] Output: False Explanation: The value field differs.

Constraints

0 <= len(actual_rows), len(expected_rows) <= 10^4
Each row contains at most 50 columns
0 <= len(key_columns) <= 10
Keys in key_columns are strings

Unit Test Spark Transformation Output

Hard

Coding

Problem

Formal Specification

Write a function assert_spark_rows_equal(actual_rows, expected_rows, key_columns).

actual_rows: list of dictionaries representing transformed output rows
expected_rows: list of dictionaries representing expected rows
key_columns: list of column names used to sort rows deterministically before comparison

The function should return True if:

Both datasets contain the same number of rows
Every row has the same set of columns after normalization
Rows are equal after sorting by key_columns

Normalization rules:

Missing keys should be treated as None
String values should be compared exactly
Numeric and boolean values should be compared exactly
Row order in the input lists should not affect the result

Examples

Example 2 Input: actual_rows = [{"id": 1, "value": "a"}], expected_rows = [{"id": 1, "value": "x"}], key_columns = ["id"] Output: False Explanation: The value field differs.

Constraints

0 <= len(actual_rows), len(expected_rows) <= 10^4
Each row contains at most 50 columns
0 <= len(key_columns) <= 10
Keys in key_columns are strings

Python 3.10