Tokenize Databricks Agent JSON Logs

Problem

In Databricks, AI agent traces from the Mosaic AI Agent Framework or Model Serving can be exported as nested JSON logs. Write a Python function that parses a JSON log string and returns a normalized sequence of lowercase tokens from selected interaction fields.

The function should traverse a JSON object that may contain nested dictionaries and lists, extract string values only from these keys: user_message, agent_response, tool_name, tool_input, tool_output, retrieved_context, and model. Tokenization rules: convert to lowercase, split on any non-alphanumeric character, discard empty tokens, and preserve encounter order. Ignore all other keys and all non-string values.

Input / Output

Input: log_json — a valid JSON string representing a nested object or array
Output: a list of tokens in the order they are encountered during a left-to-right depth-first traversal

Examples

Example 1

Input: log_json = '{"trace_id":"t1","user_message":"How do I query Delta Lake?","steps":[{"tool_name":"Vector Search","tool_input":"Unity Catalog docs"}],"agent_response":"Use spark.read.format"}'
Output: ["how","do","i","query","delta","lake","vector","search","unity","catalog","docs","use","spark","read","format"]

Explanation: Only allowed keys are processed, strings are lowercased, and punctuation is removed.

Example 2

Input: log_json = '[{"model":"DBRX","agent_response":"Call Foundation Model APIs."},{"ignored":"x","tool_output":"mlflow.evaluate scores"}]'
Output: ["dbrx","call","foundation","model","apis","mlflow","evaluate","scores"]

Explanation: The key ignored is skipped because it is not in the allowed set.

Constraints

The JSON string is valid
Total number of JSON nodes is at most 10^5
Total length of all string values is at most 10^6
Keys are case-sensitive and must match exactly

Problem

Input / Output

Input: log_json — a valid JSON string representing a nested object or array
Output: a list of tokens in the order they are encountered during a left-to-right depth-first traversal

Examples

Example 1

Input: log_json = '{"trace_id":"t1","user_message":"How do I query Delta Lake?","steps":[{"tool_name":"Vector Search","tool_input":"Unity Catalog docs"}],"agent_response":"Use spark.read.format"}'
Output: ["how","do","i","query","delta","lake","vector","search","unity","catalog","docs","use","spark","read","format"]

Explanation: Only allowed keys are processed, strings are lowercased, and punctuation is removed.

Example 2

Input: log_json = '[{"model":"DBRX","agent_response":"Call Foundation Model APIs."},{"ignored":"x","tool_output":"mlflow.evaluate scores"}]'
Output: ["dbrx","call","foundation","model","apis","mlflow","evaluate","scores"]

Explanation: The key ignored is skipped because it is not in the allowed set.

Constraints

The JSON string is valid
Total number of JSON nodes is at most 10^5
Total length of all string values is at most 10^6
Keys are case-sensitive and must match exactly

Problem

Input / Output

Input: log_json — a valid JSON string representing a nested object or array
Output: a list of tokens in the order they are encountered during a left-to-right depth-first traversal

Examples

Example 1

Input: log_json = '{"trace_id":"t1","user_message":"How do I query Delta Lake?","steps":[{"tool_name":"Vector Search","tool_input":"Unity Catalog docs"}],"agent_response":"Use spark.read.format"}'
Output: ["how","do","i","query","delta","lake","vector","search","unity","catalog","docs","use","spark","read","format"]

Explanation: Only allowed keys are processed, strings are lowercased, and punctuation is removed.

Example 2

Input: log_json = '[{"model":"DBRX","agent_response":"Call Foundation Model APIs."},{"ignored":"x","tool_output":"mlflow.evaluate scores"}]'
Output: ["dbrx","call","foundation","model","apis","mlflow","evaluate","scores"]

Explanation: The key ignored is skipped because it is not in the allowed set.

Constraints

The JSON string is valid
Total number of JSON nodes is at most 10^5
Total length of all string values is at most 10^6
Keys are case-sensitive and must match exactly

Problem

Input / Output

Input: log_json — a valid JSON string representing a nested object or array
Output: a list of tokens in the order they are encountered during a left-to-right depth-first traversal

Examples

Example 1

Input: log_json = '{"trace_id":"t1","user_message":"How do I query Delta Lake?","steps":[{"tool_name":"Vector Search","tool_input":"Unity Catalog docs"}],"agent_response":"Use spark.read.format"}'
Output: ["how","do","i","query","delta","lake","vector","search","unity","catalog","docs","use","spark","read","format"]

Explanation: Only allowed keys are processed, strings are lowercased, and punctuation is removed.

Example 2

Input: log_json = '[{"model":"DBRX","agent_response":"Call Foundation Model APIs."},{"ignored":"x","tool_output":"mlflow.evaluate scores"}]'
Output: ["dbrx","call","foundation","model","apis","mlflow","evaluate","scores"]

Explanation: The key ignored is skipped because it is not in the allowed set.

Constraints

The JSON string is valid
Total number of JSON nodes is at most 10^5
Total length of all string values is at most 10^6
Keys are case-sensitive and must match exactly

Interview Guides

Problem

Input / Output

Examples

Constraints

Tokenize Databricks Agent JSON Logs

Problem

Input / Output

Examples

Constraints

Tokenize Databricks Agent JSON Logs

Problem

Input / Output

Examples

Constraints

Tokenize Databricks Agent JSON Logs

Problem

Input / Output

Examples

Constraints