Connecting to Databricks

Connections are how CloudZero manages the various Cost Sources that bring Billing, Resource, and other types of data into the platform.

How the Databricks Connection Works

The CloudZero Databricks connection uses API access to a single Databricks workspace to gather consumption and pricing data for all workspaces in the account by querying the Billable Usage, Pricing and compute system tables.

This billing connection is needed for Databricks on AWS purchased directly from Databricks. CloudZero gathers usage and cost information from Databricks purchased through the Azure or GCP Marketplace directly from the billing connections to those cloud providers.

Connection Prerequisites

Databricks settings

Unity Catalog

To access Databricks system tables, you must have a workspace enabled for Unity Catalog. For details, see Databricks Unity Catalog

Billing and Compute Schemas

The system.billing and system.compute "system schemas" must then be enabled in that workspace.

Service Principal

CloudZero requires credentials for a Databricks service principal that can access tables in those schemas in that workspace. CloudZero recommends you create a new service principal with narrowly scoped permissions for this purpose, as explained further on in these instructions.

Warehouse

CloudZero requires the warehouse id of a warehouse to use while querying billing and usage information. CloudZero does not require a dedicated warehouse.

ℹ️

If you are creating a new warehouse for CloudZero billing queries, CloudZero recommends specifying a serverless warehouse with the lowest Auto Stop, Scaling, and Cluster Size settings possible.

Enabling the Billing and Compute Schemas

The goal is to make available the billing and usage data CloudZero needs to query. The data will be made available in the Unity Catalog enabled workspace identified in the pre-requisites section. This can be done through the Databricks CLI.

Installing Databricks CLI

If you have not used it before, Download and install the Databricks CLI. You can set up a Databricks CLI profile that connects to your account with the command databricks auth login.

This command prompts for the following information:

  • Databricks Profile Name: account

  • Databricks Host: https://accounts.cloud.databricks.com

  • Databricks Account ID: <Account ID> Locate your account ID

Commands to Enable

To get the Metastore-ID first, make sure you have the ID of the workspace. Use the following command to see all the workspaces: databricks account workspaces list.

Then you can list what metastores are available to that workspace: databricks account metastore-assignments get <workspace-id>.

When you have the Metastore-ID you can enable the system-schemas for that metastore: databricks system-schemas enable <METASTORE-ID> compute databricks system-schemas enable <METASTORE-ID> billing

For more information about system tables, see the Databricks documentation.

Configuring a Databricks Service Principal

You must have the following information to create the Databricks connection:

  • databricks host: Url for the workspace
  • client id: UUID for the service principal
  • client secret: secret so CloudZero may use Databricks API as the service principal

Create the Service Principal and Secret

  • Log into the Databricks account console and navigate to User Management.https://accounts.cloud.databricks.com/users
  • Click Service principals.
  • Click Add Service principal.
  • Enter a name and click Add.
  • Click the new Principal in the list of Service Principals
  • Click Generate Secret. Note the Secret and Client ID for later. You can always view the Client ID, which is the UUID for the service principal.
ℹ️

You must be sure to generate an OAuth secret in order for the Service principal to function correctly. For more information, refer to Databricks authorization methods.

Give the Service Principal access to the workspace

  • Log in to the Databricks account console and navigate to Workspaces https://accounts.cloud.databricks.com/workspaces.
  • Find the workspace that has the billing and compute schemas enabled and click on the kebab on the far right to Update.
  • Click Permissions > Add permissions.
  • Add the Service Principal by its Client ID (UUID guid). It needs only User permissions in the workspace.

Ensure the Service Principal has warehouse access

  • Log int o the workspace.
  • Select the warehouse provided in the connection configuration.
  • Click Permissions. You must have admin access to the workspace to see this.
  • Ensure the Service Principal has the Can Use permission. If you enabled the permission after it was previously disabled, it may take a while for the Databricks connection to read from the warehouse.

Ensure the Service Principal has sql access

Give the Service Principal access to the system tables

  • Log in to the workspace.
  • Open an SQL editor and issue the following commands:
GRANT USE SCHEMA ON SCHEMA system.compute TO `<service principal client id>`;
GRANT SELECT ON TABLE system.compute.clusters TO `<service principal client id>`;
GRANT USE SCHEMA ON SCHEMA system.billing TO `<service principal client id>`;
GRANT SELECT ON TABLE system.billing.list_prices TO `<service principal client id>`;
GRANT SELECT ON TABLE system.billing.usage TO `<service principal client id>`;

The Service Principal now has permission to query tables in the compute and billing schemas.

Create a Databricks Connection

Select the gear icon from the sidebar and select Connections or navigate to https://app.cloudzero.com/organization/connections.

Settings

On the Databricks Connection Creation page, select the Add New Connection button. Then select the Databricks tile.

Enter the connection metadata:

  • Connection Name: A connection name that will appear in the CloudZero UI.
  • Billing Account ID: Your Databricks parent account ID (Locate your account ID).
  • Workspace URL: The URL to access the workspace where you have enabled the billing and compute system schemas that CloudZero will use to pull your cost and usage data.
  • Warehouse ID: ID of the warehouse to use to query for billing and usage information.
  • Client ID: ID of the Service Principal created for CloudZero to access billing and compute data.
  • Client Secret: Secret for that Service Principal.
  • Use Fixed IP Egress: Enable to use Databricks fixed IP egress functionality. See the Fixed IP Egress section.

To save the connection, select the Save button. You will return to the Connection Details page in the CloudZero platform, where you should see your newly created connection.

Databricks Connection Notes

Billing Period Ingest Windows

  • Newly Created Connection: CloudZero will ingest the most recent 12 months of billing periods if available.
  • Re-enabled Connection: CloudZero will attempt to ingest up to 24 months of billing periods starting from the current billing period and going back to the most recent billing period ingested.
  • Steady State: CloudZero will ingest the current billing period and the previous billing period if it is likely to have changed.

Tag Prefix

Some information from the Databricks platform will be provided in CloudZero as tags with a prefix of dbx_cz. For example, cluster name is available when you use the CloudZero tag dbx_cz:cluster_name.

Customer-created tags will be passed through exactly as they appear in Databricks.

A list of Databricks information that can be assigned a dbx_cz tag follows:

  • cluster_name
  • cluster_id
  • cluster_source
  • dbr_version
  • dlt_pipeline_id
  • driver_instance_pool_id
  • driver_node_type
  • instance_pool_id
  • job_id
  • job_run_id
  • notebook_id
  • owned_by
  • warehouse_id
  • worker_instance_pool_id
  • worker_node_type
  • workspace_id

Multiple Workspaces in an Account

Access to one workspace as described in this document will provide CloudZero with data for all spend associated with the Databricks account. It is not necessary to set up a connection for each workspace.

Default Pricing

The Databricks cost adaptor uses default pricing for Databricks SKUs.

Overrides for SKU rates can be configured upon request.

Fixed IP Egress

Databricks allows access to be restricted to specific IP addresses at both the account and workspace level. If your organization restricts IP access at the account level, you can configure access as follows:

  1. Enable Use Fixed IP Address for the CloudZero managed Databricks connection.
  2. In your Databricks account, navigate to Account Console > Settings > Security tab > IP Access List.
  3. Add a rule that allows the following IP addresses: 52.0.118.180, 52.0.33.111

If your organization also restricts access to specific IP addresses at the workspace level, you must add the same IP addresses to the workspace IP Access List.

Databricks Region and Service Details

You may see a discrepancy between raw data in Databricks and CloudZero data in the Explorer.

This happens because when Databricks data is ingested into CloudZero, the sku_name_with_region field is split into two separate fields, one for the SKU name, displayed as Service in CloudZero, and one for the Region.

For example:

sku_name_with_region = ENTERPRISE_SERVERLESS_SQL_COMPUTE_US_EAST_N_VIRGINIAis split into

Service = ENTERPRISE_SERVERLESS_SQL_COMPUTE and Region = US_EAST_N_VIRGINIA.

This results in slightly different behavior when you filter and group spend in CloudZero compared to Databricks.

For example, in Databricks, entries like ⁠ENTERPRISE_SERVERLESS_SQL_COMPUTE_US_WEST_OREGON and ⁠ENTERPRISE_SERVERLESS_SQL_COMPUTE_US_EAST_N_VIRGINIA appear as separate entities. In CloudZero, grouping by Service combines costs from different regions, such as US-West and US-East into one service titled ENTERPRISE_SERVERLESS_SQL_COMPUTE.

To replicate a Databricks view in CloudZero, you can group by service and add a filter for region.

Note that some Databricks SKUs do not contain any region information, for example: INTER_AVAILABILITY_ZONE_EGRESS. In these cases, the Region in CloudZero will be set to None.