I am used to Snowflake. The last few years of my career has largely been focused on management and data engineering aspects of Snowflake. Now, I have to learn Databricks. Largely because Databricks doesn’t require a procurement process. It is procurement-driven architecture.
Shifting focus from a Database-first, python-as-an-afterthought platform to a Python-first, Database-as-an-afterthought platform is frustrating.
Snowflake has had it’s 15 minutes of infamy when it comes to security. I suspect it is a mere coincidence that the hacking crew was targeting snowflake instead of Databricks. Snowflake has tightened its security defaults, but Databricks is still lagging. At least in one respect.
The tokens 🔗
One of the main issues behind the Snowflake hack was an infostealer. Basically, credentials stored on a laptop aren’t sufficiently secure when dealing with sensitive company data. With snowflake, passwords and account identifiers were commonly stored in .env
files, .snowflake
folders etc. This is all changed now. MFA is required, SSO is strongly recommended.
Databricks, on the other hand, nudges you towards generating user access tokens when working locally. Which increasingly is an anti-pattern. Do not store long-lived authentication credentials on your computer. It is that simple. But you have to fight Databricks if you want to avid their tokens. It can be done though.
Configuring SSO-auth on the CLI 🔗
In the databricks-cli
documentation, you will be instructed to run databricks configure
- which prompts you for one of these long-lived tokens that you create in the Databricks portal. OAuth is not an option. You can do it though, with some needless effort.
The databricks configure
command simply creates a ~/.databrbickscfg
file. You might as well do that manually. It isn’t hard. But it is hardly documented.
The structure of these files are familiar to most. If you ignore the advice of your CISO and generate a token, the file will look something like this:
; The profile defined in the DEFAULT section is to be used as a fallback when no profile is explicitly specified.
[DEFAULT]
host = https://adb-123-xyz.azuredatabricks.net
token = <my-insecure-and-long-lived-token-that-should-be-a-fireable-offense>
Instead of this though, you can replace the token with an auth_type
entry:
; The profile defined in the DEFAULT section is to be used as a fallback when no profile is explicitly specified.
[DEFAULT]
host = https://adb-123-xyz.azuredatabricks.net
auth_type = databricks-cli
So this lets you authenticate with your browser, as you probably already do with the Azure CLI, GCP CLI etc.
If you have several workspaces, as one does, you can name them like this:
; The profile defined in the DEFAULT section is to be used as a fallback when no profile is explicitly specified.
[DEFAULT]
host = https://adb-123-xyz.azuredatabricks.net
auth_type = databricks-cli
[DEV]
host = https://adb-123-dev.azuredatabricks.net
auth_type = databricks-cli
[PROD]
host = https://adb-123-prod.azuredatabricks.net
auth_type = databricks-cli
Workspace vs account level accounts 🔗
The configurations above work for authenticating with a given workspace and running workspace-level commands. Running an account-level command such as listing groups with databricks account groups list
, will give you an error.
These commands require a different, account-level configuration. Instead of using your workspace as host, you use https://accounts.azuredatabricks.net
. And you add your account ID as a new parameter. Building on the file above:
; The profile defined in the DEFAULT section is to be used as a fallback when no profile is explicitly specified.
[DEFAULT]
host = https://adb-123-xyz.azuredatabricks.net
auth_type = databricks-cli
[DEV]
host = https://adb-123-dev.azuredatabricks.net
auth_type = databricks-cli
[PROD]
host = https://adb-123-prod.azuredatabricks.net
auth_type = databricks-cli
[ADMIN]
host = https://accounts.azuredatabricks.net
account_id = <my-account-id>
auth_type = databricks-cli
Running account level commands requires you to use the ADMIN
profile (you can of course call it whatever you want), while running commands such as deploying an app in a workspace requires one of the profiles configured with a workspace host URL (DEV
, PROD
and DEFAULT
in this case).
Preventing tokens 🔗
Because there may be other users on Databricks whose job isn’t to think deeply about the risks of long-lived tokens, you might want to prevent everyone from generating these tokens. You can do this either via the portal, or, if you are not an absurd human, through Terraform:
# Disable personal access tokens for the workspace
resource "databricks_workspace_conf" "this" {
custom_config = {
"enableTokensConfig" : "false"
}
}
You can verify that this toggles the same UI switch in the workspace settings.
Documentation is lagging 🔗
I have had a similar frustration with Azure. While their security documentation is very clear that everyone should use managed identities and SSO/OAuth authentication, almost every tutorial starts with a CLI command to create a service principal (in practice a username/password). Perhaps because it is bulletproof and gets users from A to B the quickest. Perhaps because the documentation simply isn’t updated to reflect the changing security landscape. In any case: These kinds of shortcurs in documentation can be a security issue. I had to do a lot of googling (and LLMing) in order to get this set up.