Connecting to Databricks
Connections are how CloudZero manages the various Cost Sources that bring Billing, Resource, and other types of data into the platform.
How the Databricks Connection Works
The CloudZero Databricks connection utilizes API access to a single Databricks workspace to gather consumption and pricing data for all workspaces in the account by querying the Billable Usage, Pricing and compute system tables.
This billing connection is needed for Databricks on AWS purchased directly from Databricks. CloudZero gathers usage and cost information from Databricks purchased through the Azure or GCP Marketplace directly from the billing connections to those cloud providers.
Connection Prerequisites
Overview
Unity Catalog
To access Databricks system tables, you must have a workspace enabled for Unity Catalog. Databricks Unity Catalog
Billing and Compute Schemas
The system.billing
and system.compute
"system schemas" must then be enabled in that workspace.
Service Principal
CloudZero requires credentials for a Databricks Service principal that can access tables in those schemas in that workspace. It is recommended you create a new service principal with narrowly scoped permissions for this purpose and there are instructions for that below.
Warehouse
CloudZero also requires the warehouse id of a warehouse to use while querying billing and usage information. CloudZero does not require a dedicated warehouse.
Note: Warehouse sizing
If creating a new warehouse for CloudZero billing queries we recommend specifying a Serverless warehouse with the lowest Auto Stop, Scaling and Cluster Size settings possible.
Enabling the Billing and Compute Schemas
The goal is to make available the billing and usage data CloudZero needs to query. The data will be made available in the Unity Catalog enabled workspace identified in the pre-requisites section. This can be done through the Databricks CLI.
Installing Databricks CLI
If you've never used it before, Download and install the Databricks CLI.
You can set up a Databricks CLI profile that connects to your account with the command:
databricks auth login
This command will prompt for 3 pieces of information
- Databricks Profile Name:
account
- Databricks Host:
https://accounts.cloud.databricks.com
- Databricks Account ID:
<Account ID>
Locate your account ID
Commands to Enable
To get the Metastore-ID first make sure you have the ID of the workspace. You can see all the workspaces with:
databricks account workspaces list
Then you can list what metastores are available to that workspace:
databricks account metastore-assignments get <workspace-id>
Once you have the Metastore-ID you can enable the system-schemas for that metastore.
databricks system-schemas enable <METASTORE-ID> compute
databricks system-schemas enable <METASTORE-ID> billing
Additional Databricks documentation about system tables can be found here
Configuring a Databricks Service Principal
This section will give us 3 pieces of information needed to create the Databricks connection:
- databricks host: Url for the workspace
- client id: UUID for the service principal
- client secret: secret so CloudZero may use Databricks API as the service principal
Creating the Service principal and secret
- Log into the Databricks account console and navigate to "User Management" https://accounts.cloud.databricks.com/users
- Click on "Service principals"
- Click "Add Service principal"
- Fill out a name and click "Add".
- Click on the new principal in list of service principals
- Click on Generate Secret. Note the Secret and Client ID for later. (the Client ID is the UUID for the service principal and can always be viewed)
Giving Service Principal access to the workspace
- Log into the Databricks account console and Navigate to "Workspaces" https://accounts.cloud.databricks.com/workspaces
- Find the workspace which has the billing and compute schemas enabled and click on the kebab on the far right to "Update" it
- Click on permissions then "Add permissions"
- You can add the Service principal by its Client ID (UUID guid). It only needs "User" permissions in the workspace.
Ensure Service Principal has sql access
- Follow the Databricks documentation to find entitlement management for the workspace
- Ensure the Service Principal has the "Databricks SQL access" entitlement enabled.
Alternatively you can manage the service principal's entitlements via its group membership.
Give Service Principal access to the system tables
- Log into the workspace
- Open an SQL Editor and issue the following commands
GRANT USE SCHEMA ON SCHEMA system.compute TO `<service principal client id>`;
GRANT SELECT ON TABLE system.compute.clusters TO `<service principal client id>`;
GRANT USE SCHEMA ON SCHEMA system.billing TO `<service principal client id>`;
GRANT SELECT ON TABLE system.billing.list_prices TO `<service principal client id>`;
GRANT SELECT ON TABLE system.billing.usage TO `<service principal client id>`;
- The service principal now has permission to query tables in the compute and billing schemas.
Create a Databricks Connection
Open the Connections page
This can be found by selecting the gear icon from the sidebar and selecting Connections, or alternatively going to https://app.cloudzero.com/organization/connections
Navigate to the Databricks Connection Creation page
Select the Add New Connection button. Then select the Databricks tile.
Connection Metadata
- Connection Name: A connection name that will appear in the CloudZero UI
- Billing Account ID: Your Databricks parent account ID (Locate your account ID)
- Workspace URL: URL to access the workspace configured the billing and compute system schemas.
- Warehouse ID: ID of the warehouse to use to query for billing and usage information.
- Client ID: ID of the service principal created for CloudZero to access billing and compute data.
- Client Secret: Secret for that service principal.
Save the Connection
Select the Save button. You will be redirected back to the Connection Details page in the CloudZero platform, where you should see your newly created connection.
Databricks Connection Notes
Billing Period Ingest Windows
- Newly Created Connection: CloudZero will ingest the most recent 12 months worth of billing periods if available.
- Re-enabled Connection: CloudZero will attempt to ingest up to 24 months of billing periods starting from the current billing period and going back to the most recent billing period ingested.
- Steady State: CloudZero will ingest the current billing period and the previous billing period if it is likely to have changed.
Tag Prefix
Some information from the Databricks platform will be provided in CloudZero as tags with a prefix of dbx_cz
. For example cluster name is available via the CloudZero tag dbx_cz:cluster_name
.
Customer created tags will be passed through exactly as they appear in Databricks.
Multiple Workspaces in an Account
Access to one workspace as described in this document will provide CloudZero with data for all spend associated with the Databricks account. It is not necessary to set up a connection for each workspace.
Default Pricing
The current version of the Databricks cost adaptor currently uses default pricing for Databricks SKUs.
Overrides for SKU rates can be configured upon request.
Updated 4 months ago