In a Databricks observability workflow, you need to process a massive web server log and identify which client IP addresses generated the most server-side failures. Write a function that scans log lines and returns the top k IP addresses by number of HTTP 5xx responses.
Implement a function that takes:
log_lines: a list of strings, where each string is one log entryk: an integerEach valid log line follows this simplified format:
"<ip> - - [timestamp] \"METHOD PATH HTTP/1.1\" status bytes"
A line should count toward an IP only if:
500 to 599Return a list of [ip, count] pairs sorted by:
Ignore malformed lines and lines with non-integer status codes.
Example 1
Input: log_lines = ["10.0.0.1 - - [a] \"GET / HTTP/1.1\" 500 12", "10.0.0.2 - - [a] \"GET /x HTTP/1.1\" 503 9", "10.0.0.1 - - [a] \"GET /y HTTP/1.1\" 502 7"], k = 2
Output: [["10.0.0.1", 2], ["10.0.0.2", 1]]
Explanation: 10.0.0.1 has two 5xx responses and 10.0.0.2 has one.
Example 2
Input: log_lines = ["1.1.1.1 - - [a] \"GET / HTTP/1.1\" 200 10", "bad line", "2.2.2.2 - - [a] \"GET / HTTP/1.1\" 500 8"], k = 10
Output: [["2.2.2.2", 1]]
Explanation: non-5xx and malformed lines are ignored.
1 <= len(log_lines) <= 10^51 <= len(log_lines[i]) <= 3001 <= k <= 10^4