Databricks metadata collector
The Databricks metadata collector provides read-only access to metadata from an external Databricks account.
The collector harvests metadata from data assets in Databricks Hive Metadata, Unity Catalog (including Delta Lake), Workflows, and Notebooks.
Metadata cataloged
The Databricks collector catalogs the following information.
| Object | Information cataloged |
|---|---|
| Columns |
Name, Description, JDBC type, Column Type, Is Nullable, Default Value, Column size, Column index Extended metadata: Tags Note:
Deprecated columns and any lineage related to these deprecated columns are not cataloged. |
| Table |
Name, Description, Schema, Primary key, Foreign key Extended metadata: Tags, Owner, Type, Creation date, Last Modified, Location, Provider, Version, Size, File Count, Partition Columns, Properties |
| Model |
Name, Owner, Description, Created By, Created At, Last Modified By, Last Modified At, Securable Kind, Securable Type |
| Views |
Name, Description, SQL definition, Tags |
| Schema |
Name Extended metadata: Tags |
| Database |
Type, Name, Server, Port, Environment, JDBC URL Extended metadata: Tags |
| Notebook |
Notebook ID, Path, Language Type (SQL, Python, Scala, R) |
| Function |
Name, Description, Function Type |
| Job |
Title, Description, Creator, Created At, Job run as, Format, Max Concurrent Runs, Notification On Start, Timeouts (sec), Notification On Success, Schedule, Git Source, Notification on Failure, Tags, List of tasks, List of clusters |
| Cluster |
Name, Description, Node Type ID, Driver Node Type ID, Spark Version, Number of Workers, Autoscale Max Workers, Autoscale Min Workers, AWS Attributes, Tags |
| Task |
Task Key, Type of Task (Notebook, dbt, Spark jar, Python script, Python wheel, Pipeline task, SQL), Task timeout, Retry interval, Cluster used by the task, Max retries, Depends on, Libraries, Notifications (On start, On success, On failure), Notebook File Path, Notebook Source, Notebook Parameters, Spark Jar Main Class Name, Spark Jar Parameters, Python Script File path, Python Script Parameters, Spark Submit Parameters, Pipeline ID, Pipeline Full Refresh, Python Wheel Package Name, Python Wheel Entry Point, Python Wheel Parameters, SQL Warehouse, SQL Query ID, SQL Dashboard ID, SQL Alert ID, Dbt Project Directory, Dbt Profiles Directory, Dbt warehouse, Dbt catalog, Dbt schema, Dbt commands |
| External Location |
Name, External URL, Description, Data Source Type, Created Date, Created By, Owner |
| Storage Credential |
Name, Description, Credential, Created Date, Created By, Owner |
| Volume |
Name, Description, Type, Owner, Created By, Created At, Last Modified By, Last Modified At, Metastore ID |
| Materialized View |
Name, SQL Definition, Created, Last Modified |
| Metric View |
Name, Description, YAML Definition, Source Table, Source Table Type, Filter, Created, Last Modified |
Following additional information is cataloged when you run the collector with the Enable Governance Metadata Collection option.
| Object | Information Cataloged |
|---|---|
| Row Filter Access Control |
Name |
| Column Mask Access Control |
Name |
| Attribute Based Access Control |
Name, Description, Created by, Created at, Modified by, Modified at, On securable type, For securable type, To principals, Except principals |
| Workspace bindings |
Workspace ID, Binding type |
| Privileges |
Granted to, Granted by, Privilege type, Granted on object, Inherited from |
Relationships between objects
The harvested metadata includes catalog pages for the following data asset types. Each catalog page has a relationship to the other related data asset types.
| Data asset page | Relationships |
|---|---|
| Table |
Columns contained in Table |
| Schema |
|
| Database | Schema contained in Database |
| Columns |
|
| Table Indexes | Columns |
| Job |
|
| Cluster |
|
| Task |
|
| Notebook |
|
| Folder |
|
| External Location |
|
| Storage Credential |
|
| Model |
|
| Volume |
|
| Materialized View |
|
| Metric View |
|
| Pipeline |
|
| Unity Catalog Metastore | Databases contained in metastore |
| Row Filter Access Control |
|
| Column Mask Access Control |
|
| Attribute Based Access Control |
|
| Catalog |
|
Lineage for Databricks
The following lineage information is collected by the Databricks collector.
| Object | Lineage available |
|---|---|
| Column in view | The collector identifies the associated column in an upstream view or table for both Hive metastore and Unity Catalog:
Note: Deprecated columns and any lineage related to these deprecated columns are not cataloged. |
| Notebook | Tasks that reference Notebook. (Only if Databricks Unity Catalog is enabled). |
| Table |
|
Authentication supported
The Databricks collector supports Personal access token authentication and Oauth service principal authentication.