[Script]Installing Databricks Libraries on Clusters Using API and CLI

Rafael Medeiros
3 min readDec 15, 2022

--

I received a ticket to install multiple libraries on multiple databricks clusters. The task was also mentioning the fact that I should automate such process, because it’s boring and a waste of time to install all the libraries we need using GUI.

Using Databricks CLI

The Databricks command-line interface (CLI) provides an easy-to-use interface to the Azure Databricks platform. The open source project is hosted on GitHub. The CLI is built on top of the Databricks REST API and is organized into command groups based on primary endpoints.

First, install databricks-cli if you don’t have it yet:

pip install databricks-cli

Then, configure the databricks CLI using:

databricks configure --token

It will ask you for the databricks url and a token, it should be like this:

This is what you need to generate a token:

Now, get the ID of the clusters, it doesn’t matter if you have 1 or many:

CLUSTERS_ID=$(databricks clusters list --output json | jq -c -r '.clusters[].cluster_id'))

To install the libraries, run the for each loop using the $clusters_ID variable:

for CLUSTER_ID in $CLUSTERS_ID
do
databricks libraries install --maven-coordinates "com.databricks.labs:overwatch_2.12:0.6.1.1" --cluster-id $CLUSTER_ID
databricks libraries install --pypi-package adal --cluster-id $CLUSTER_ID
done

Here you can get the maven coordinates:

https://mvnrepository.com/

Here you can look for pip packages:

Using the API

The Databricks REST API allows for programmatic management of various Cloud Databricks resources.

To authenticate to the API, you need both the URL and a token. Get the workspace url in your browser:

Then, you have to create the following variables:

DATABRICKS_TOKEN=dapie5f72dbdsfsdg28adsfdsgfc59efdgfda48ac-9
WORKSPACEURL=https://adb-359439532458.3.azuredatabricks.net

Get the clusters ID with the following command:

CLUSTERS_ID=$(curl --header "Authorization: Bearer $DATABRICKS_TOKEN" -X GET \
${WORKSPACEURL}/api/2.0/clusters/list | jq -c -r '.clusters[].cluster_id')

Finally, install the libraries with the following for loop:

for CLUSTER_ID in $CLUSTERS_ID
do
cat << EOF > install-libraries.json
{
"cluster_id": "${CLUSTER_ID}",
"libraries": [
{
"maven": {
"coordinates": "com.databricks.labs:overwatch_2.12:0.6.1.1"
}
},
{
"pypi": {
"package": "adal"
}
}
]
}
EOF

curl --header "Authorization: Bearer $DATABRICKS_TOKEN" --request POST \
${WORKSPACEURL}/api/2.0/libraries/install \
--data @install-libraries.json

done

The loop above will create a file called “install-libraries.json” during the runtime, this is the best approach in case you don’t have access to the machine filesystem, in a pipeline agent for instance.

The cluster ID will be also added to the file, making it easier to run against multiple clusters.

All you need to do is to change the libraries packages and add anything you need.

You can also find out more here:

And that’s it! If you have any questions, please drop them in the comments section.

Happy Studying!

--

--

Rafael Medeiros
Rafael Medeiros

Written by Rafael Medeiros

DevOps Engineer | 3x Azure | Kubestronaut | Terraform Fanatic | Another IT Professional willing to help the community

No responses yet