
In Databricks, AI agent traces from the Mosaic AI Agent Framework or Model Serving can be exported as nested JSON logs. Write a Python function that parses a JSON log string and returns a normalized sequence of lowercase tokens from selected interaction fields.
The function should traverse a JSON object that may contain nested dictionaries and lists, extract string values only from these keys: user_message, agent_response, tool_name, tool_input, tool_output, retrieved_context, and model. Tokenization rules: convert to lowercase, split on any non-alphanumeric character, discard empty tokens, and preserve encounter order. Ignore all other keys and all non-string values.
log_json — a valid JSON string representing a nested object or arrayExample 1
Input: log_json = '{"trace_id":"t1","user_message":"How do I query Delta Lake?","steps":[{"tool_name":"Vector Search","tool_input":"Unity Catalog docs"}],"agent_response":"Use spark.read.format"}'
Output: ["how","do","i","query","delta","lake","vector","search","unity","catalog","docs","use","spark","read","format"]
Explanation: Only allowed keys are processed, strings are lowercased, and punctuation is removed.
Example 2
Input: log_json = '[{"model":"DBRX","agent_response":"Call Foundation Model APIs."},{"ignored":"x","tool_output":"mlflow.evaluate scores"}]'
Output: ["dbrx","call","foundation","model","apis","mlflow","evaluate","scores"]
Explanation: The key ignored is skipped because it is not in the allowed set.
10^510^6