[Script]Installing Databricks Libraries on Clusters Using API and CLI
I received a ticket to install multiple libraries on multiple databricks clusters. The task was also mentioning the fact that I should automate such process, because it’s boring and a waste of time to install all the libraries we need using GUI.
Using Databricks CLI
The Databricks command-line interface (CLI) provides an easy-to-use interface to the Azure Databricks platform. The open source project is hosted on GitHub. The CLI is built on top of the Databricks REST API and is organized into command groups based on primary endpoints.
First, install databricks-cli if you don’t have it yet:
pip install databricks-cli
Then, configure the databricks CLI using:
databricks configure --token
It will ask you for the databricks url and a token, it should be like this:
This is what you need to generate a token:
Now, get the ID of the clusters, it doesn’t matter if you have 1 or many:
CLUSTERS_ID=$(databricks clusters list --output json | jq -c -r '.clusters[].cluster_id'))
To install the libraries, run the for each loop using the $clusters_ID variable:
for CLUSTER_ID in $CLUSTERS_ID
do
databricks libraries install --maven-coordinates "com.databricks.labs:overwatch_2.12:0.6.1.1" --cluster-id $CLUSTER_ID
databricks libraries install --pypi-package adal --cluster-id $CLUSTER_ID
done
Here you can get the maven coordinates:
Here you can look for pip packages:
Using the API
The Databricks REST API allows for programmatic management of various Cloud Databricks resources.
To authenticate to the API, you need both the URL and a token. Get the workspace url in your browser:
Then, you have to create the following variables:
DATABRICKS_TOKEN=dapie5f72dbdsfsdg28adsfdsgfc59efdgfda48ac-9
WORKSPACEURL=https://adb-359439532458.3.azuredatabricks.net
Get the clusters ID with the following command:
CLUSTERS_ID=$(curl --header "Authorization: Bearer $DATABRICKS_TOKEN" -X GET \
${WORKSPACEURL}/api/2.0/clusters/list | jq -c -r '.clusters[].cluster_id')
Finally, install the libraries with the following for loop:
for CLUSTER_ID in $CLUSTERS_ID
do
cat << EOF > install-libraries.json
{
"cluster_id": "${CLUSTER_ID}",
"libraries": [
{
"maven": {
"coordinates": "com.databricks.labs:overwatch_2.12:0.6.1.1"
}
},
{
"pypi": {
"package": "adal"
}
}
]
}
EOF
curl --header "Authorization: Bearer $DATABRICKS_TOKEN" --request POST \
${WORKSPACEURL}/api/2.0/libraries/install \
--data @install-libraries.json
done
The loop above will create a file called “install-libraries.json” during the runtime, this is the best approach in case you don’t have access to the machine filesystem, in a pipeline agent for instance.
The cluster ID will be also added to the file, making it easier to run against multiple clusters.
All you need to do is to change the libraries packages and add anything you need.
You can also find out more here:
And that’s it! If you have any questions, please drop them in the comments section.
Happy Studying!