Audit Logs Example Queries(Python)

CLUSTERS

Admins typically need to monitor the number of clusters that their users create. Combine this with the ability to track the number of clusters without autotermination and you can identify any idle clusters that won't ever shut down. If you're following along with the sample data, then you'll see that all the data occur on one date: 12/28/19.

%sql
USE audit_logs;
%sql
SELECT date, count(*) AS num_clusters 
FROM clusters 
WHERE actionName = 'create' 
GROUP BY 1 
ORDER BY 1 ASC

There’s not much context because we don’t have data from other days, but for the sake of this exercise, let’s assume that the number of clusters more than tripled any other day and the number of users did not change meaningfully during that time period. If this were truly the case, then the only reasonable explanation would be that the clusters were likely created programmatically using jobs. Additionally, 12/28/19 was a Saturday, so we don't expect there to be many interactive clusters created.

Since we know that the cluster_creator tag in the requestParams field tell us who / what created a cluster, we can use to see what caused the spike.

%sql
SELECT requestParams.cluster_creator, actionName, count(*) 
FROM clusters 
WHERE date = '2019-12-28' 
GROUP BY 1,2 
ORDER BY 3 DESC

Based on the results above, we notice that JOB_LAUNCHER created 709 clusters, out of 714 total clusters created on 12/28/19, which confirms our intuition.

Our next step is to figure out which jobs created this, which we can extract from the cluster name. Databricks jobs clusters follow this naming convention job-<jobId>-run-<runId>, so we can parse the jobId using split().

%sql
SELECT split(requestParams.cluster_name, "-")[1] AS jobId, count(*) 
FROM clusters 
WHERE actionName = 'create' AND date = '2019-12-28'
GROUP BY 1 
ORDER BY 2 DESC

Here we see that jobId 31303 is the culprit for the vast majority of clusters created on 12/28/19. Another piece of information that the audit logs store in requestParams is the user_id of the user who created the job. Since the creator of the job is immutable, we can just take the first record.

%sql
SELECT requestParams.user_id 
FROM clusters 
WHERE actionName = 'create' AND date = '2019-12-28' AND split(requestParams.cluster_name, "-")[1] = '31303' 
LIMIT 1

You can utilize the SCIM API to get the email address of the user_id to discuss or just inspect the job yourself.

As discussed above, it's helpful for Databricks administrators to understand the number of interactive clusters created without any autotermination, as DBUs accrue based on cluster uptime, not utilization.

%sql
SELECT date, count(*) AS num_clusters 
FROM clusters 
WHERE actionName = 'create' AND requestParams.autotermination_minutes = 0 AND requestParams.cluster_creator IS null 
GROUP BY 1
ORDER BY 1 ASC

If you're utilizing our example data, you'll notie that there are 5 clusters whose cluster_creator is null which means that they were created by users and not by jobs.

%sql
SELECT DISTINCT email, requestParams.cluster_name
FROM clusters 
WHERE actionName = 'create' AND requestParams.autotermination_minutes = 0 AND requestParams.cluster_creator IS null 

By selecting the creator's email address and the cluster's name, we can identify which cluster we need to terminate and which user we need to better educate on best practices around resource management.