Connecting to Databricks

Connections are how CloudZero manages the various Cost Sources that bring Billing, Resource, and other types of data into the platform.

How the Databricks Connection Works

The CloudZero Databricks connection utilizes API access to a single Databricks workspace to gather consumption and pricing data for all workspaces in the account by querying the Billable Usage, Pricing and compute system tables.

This billing connection is needed for Databricks on AWS purchased directly from Databricks. CloudZero gathers usage and cost information from Databricks purchased through the Azure or GCP Marketplace directly from the billing connections to those cloud providers.

Connection Prerequisites

Overview

Unity Catalog

To access Databricks system tables, you must have a workspace enabled for Unity Catalog. Databricks Unity Catalog

Billing and Compute Schemas

The system.billing and system.compute "system schemas" must then be enabled in that workspace.

Service Principal

CloudZero requires credentials for a Databricks Service principal that can access tables in those schemas in that workspace. It is recommended you create a new service principal with narrowly scoped permissions for this purpose and there are instructions for that below.

Warehouse

CloudZero also requires the warehouse id of a warehouse to use while querying billing and usage information. CloudZero does not require a dedicated warehouse.

📘

Note: Warehouse sizing

If creating a new warehouse for CloudZero billing queries we recommend specifying a Serverless warehouse with the lowest Auto Stop, Scaling and Cluster Size settings possible.

Enabling the Billing and Compute Schemas

The goal is to make available the billing and usage data CloudZero needs to query. The data will be made available in the Unity Catalog enabled workspace identified in the pre-requisites section. This can be done through the Databricks CLI.

Installing Databricks CLI

If you've never used it before, Download and install the Databricks CLI.
You can set up a Databricks CLI profile that connects to your account with the command:

databricks auth login

This command will prompt for 3 pieces of information

  • Databricks Profile Name: account
  • Databricks Host: https://accounts.cloud.databricks.com
  • Databricks Account ID: <Account ID> Locate your account ID

Commands to Enable

To get the Metastore-ID first make sure you have the ID of the workspace. You can see all the workspaces with:

databricks account workspaces list

Then you can list what metastores are available to that workspace:

databricks account metastore-assignments get <workspace-id>

Once you have the Metastore-ID you can enable the system-schemas for that metastore.

databricks system-schemas enable <METASTORE-ID> compute
databricks system-schemas enable <METASTORE-ID> billing

Additional Databricks documentation about system tables can be found here

Configuring a Databricks Service Principal

This section will give us 3 pieces of information needed to create the Databricks connection:

  • databricks host: Url for the workspace
  • client id: UUID for the service principal
  • client secret: secret so CloudZero may use Databricks API as the service principal

Creating the Service principal and secret

  • Log into the Databricks account console and navigate to "User Management" https://accounts.cloud.databricks.com/users
  • Click on "Service principals"
  • Click "Add Service principal"
  • Fill out a name and click "Add".
  • Click on the new principal in list of service principals
  • Click on Generate Secret. Note the Secret and Client ID for later. (the Client ID is the UUID for the service principal and can always be viewed)

Giving Service Principal access to the workspace

  • Log into the Databricks account console and Navigate to "Workspaces" https://accounts.cloud.databricks.com/workspaces
  • Find the workspace which has the billing and compute schemas enabled and click on the kebab on the far right to "Update" it
  • Click on permissions then "Add permissions"
  • You can add the Service principal by its Client ID (UUID guid). It only needs "User" permissions in the workspace.

Ensure Service Principal has warehouse access

  • Log into the workspace
  • Select the warehouse provided in the connection configuration
  • Click "Permissions" (You must have "admin" access to the workspace to view this)
  • Ensure the service principal has the "Can Use" permission (If you enabled after it was previously disabled, it may take a while for the Databricks connection to read from the warehouse)

Ensure Service Principal has sql access

Give Service Principal access to the system tables

  • Log into the workspace
  • Open an SQL Editor and issue the following commands
GRANT USE SCHEMA ON SCHEMA system.compute TO `<service principal client id>`;
GRANT SELECT ON TABLE system.compute.clusters TO `<service principal client id>`;
GRANT USE SCHEMA ON SCHEMA system.billing TO `<service principal client id>`;
GRANT SELECT ON TABLE system.billing.list_prices TO `<service principal client id>`;
GRANT SELECT ON TABLE system.billing.usage TO `<service principal client id>`;
  • The service principal now has permission to query tables in the compute and billing schemas.

Create a Databricks Connection

Open the Connections page

This can be found by selecting the gear icon from the sidebar and selecting Connections, or alternatively going to https://app.cloudzero.com/organization/connections
CloudZero Connections

Navigate to the Databricks Connection Creation page

Select the Add New Connection button. Then select the Databricks tile.
Databricks Connection Tile

Connection Metadata

  • Connection Name: A connection name that will appear in the CloudZero UI
  • Billing Account ID: Your Databricks parent account ID (Locate your account ID)
  • Workspace URL: URL to access the workspace configured the billing and compute system schemas.
  • Warehouse ID: ID of the warehouse to use to query for billing and usage information.
  • Client ID: ID of the service principal created for CloudZero to access billing and compute data.
  • Client Secret: Secret for that service principal.
  • Use Fixed IP Egress: Enable to use Databricks's fixed IP egress functionality. See the below.

Save the Connection

Select the Save button. You will be redirected back to the Connection Details page in the CloudZero platform, where you should see your newly created connection.

Databricks Connection Notes

Billing Period Ingest Windows

  • Newly Created Connection: CloudZero will ingest the most recent 12 months worth of billing periods if available.
  • Re-enabled Connection: CloudZero will attempt to ingest up to 24 months of billing periods starting from the current billing period and going back to the most recent billing period ingested.
  • Steady State: CloudZero will ingest the current billing period and the previous billing period if it is likely to have changed.

Tag Prefix

Some information from the Databricks platform will be provided in CloudZero as tags with a prefix of dbx_cz. For example cluster name is available via the CloudZero tag dbx_cz:cluster_name.

Customer created tags will be passed through exactly as they appear in Databricks.

Multiple Workspaces in an Account

Access to one workspace as described in this document will provide CloudZero with data for all spend associated with the Databricks account. It is not necessary to set up a connection for each workspace.

Default Pricing

The current version of the Databricks cost adaptor currently uses default pricing for Databricks SKUs.

Overrides for SKU rates can be configured upon request.

Fixed IP Egress

To establish a Databricks connection, you must provide a Client ID and a Client Secret. Databricks allows access to be restricted to specific IP addresses, which can be configured as follows:

  1. Enable Use Fixed IP Address for the CloudZero Managed Databricks Connection.
  2. In your Databricks account, navigate to Account Console > Settings > Security tab > IP Access List.
  3. Add a rule that allows the following IP addresses: 52.0.118.180, 52.0.33.111