Databricks Unity Catalog: Your Key to Streamlined Data Access and Security
In this post, I’ll give you a quick introduction to Databricks Unity Catalog. Unity Catalog only became generally available around late 2022 on Azure.
Unity Catalog is a databricks offered unified solution for implementing data governance in the Lakehouse.
Let’s explore what this means by describing what data governance is and the importance of data governance in today’s data driven business. And then we’ll look at how Unity Catalog helps implement a unified data governance solution.
Data Governance
Data Governance is the process of managing the availability, usability, integrity and security of all the data present in an enterprise. An enterprise should be able to control access to the data to only the users who should have the access.
Because not having the right access control could compromise the customer data and the reputation of the enterprise.
A successful implementation of a data governance solution will ensure that the data is trustworthy and not misused. It requires the ability to trace the data back to its source for authenticity and also the ability to audit the usage of the data to ensure that it is not misused. It must help implement privacy regulations such as GDPR, CCPA, etc.
Non-compliance with these regulations could land companies with hefty penalties and bad reputation with the customers. So it’s important that companies implement the right data governance solution.
Let’s now look at the four key areas of the data governance solution, that Unity Catalog can help it.
How Unity Catalog can help us?
Data Access Control:
Allows for fine-grained access control, enabling organizations to define permissions at various levels (e.g., table, column, or row). This ensures that sensitive data is only accessible to authorized users, thereby enhancing data security and compliance.
Data Audit:
Unity Catalog tracks data lineage, providing insights into where data comes from and how it has been transformed over time. This feature is crucial for auditing and understanding the data lifecycle, which is essential for compliance and regulatory requirements.
Data Lineage:
Data lineage refers to the process of tracking and visualizing the flow of data from its origin through its various transformations and uses in a data pipeline. It provides a comprehensive view of how data moves and changes over time, enabling organizations to understand the entire lifecycle of their data.
Data Discoverability:
Unity Catalog provides a centralized repository for metadata, making it easier to discover and classify data assets. Users can search for datasets, understand their lineage, and see how they are categorized, helping to improve transparency and organization.
Advantages Databricks Unity Catalog over working without it
Using Databricks Unity Catalog provides several key advantages over not using it. It centralizes metadata, making data discovery and management easier, and allows for fine-grained access control, enhancing security and compliance. Unity Catalog automatically tracks data lineage, offering insights into data flow and transformations, which is essential for ensuring data quality.
Additionally, it supports governance features that simplify compliance with regulations and fosters collaboration by providing a shared understanding of data assets. Secure and efficient data sharing between teams is also facilitated.
In contrast, working without Unity Catalog can lead to decentralized metadata management, limited access controls, and challenges in tracking data lineage, making auditing and compliance more difficult. Teams may operate in silos, resulting in duplicated efforts and inconsistent data interpretations. Overall, Unity Catalog streamlines data governance, security, and collaboration, enhancing the efficiency and security of data management.
What’s Catalog within Databricks?
Catalog is a new concept within Databricks. Catalog is just a logical container within the metastore to organize your data assets.
For example, you may decide to have one catalog per business unit or one per development environment. Each Catalog may contain one or more schemas or databases. They’re just the next level containers within Catalogs.
Schemas and Databases are synonymous within Databricks.
In the past, most of the documentation referred to this as database, but now the recommendation is to use the term schema. This is just to avoid any confusion with database systems and platforms.
If you use database in SQL statements, for example, create database instead of create schema, it will still work, but I would encourage you to use schema. Each Schema may contain one or more Tables, Views or Functions. There are no changes between Legacy Hive Metastore and Unity Catalog in this respect.
How managed tables works on Unity Catalog?
Firstly, all managed tables in Unity Catalog are Delta tables. You cannot create a table with format Parquet, JSON, CSV, etc., as Managed Tables. For that you will have to use External Tables.
Second, by default, the data for Managed Tables will be written to the default storage used to configure the metastore. It used to be the DBFS root in Hive Metastore, but now it’s the storage account we attached to the metastore while creation. You can change this to a different location by specifying a managed location in your create schema or create catalog statements.
In Hive Metastore when you drop a managed table, the underlying data is deleted immediately. But in Unity Catalog enabled workspace, the data will be retained for 30 days. This is a great addition which a lot of people were requesting for. It takes away the fear of accidentally losing the data.
Managed tables also benefit from automatic maintenance and performance optimizations. So Databricks recommends everyone to use Managed Tables where possible.
With this hierarchical object structure, you will have to use the three level namespace to access a Table, View or a Function.
Having said that, there are some important changes to the way managed tables work in Unity Catalog.
Briefing: Unity Catalog set up
Setting up Databricks Unity Catalog involves several key steps. Here’s a brief guide to get you started:
Pre-requisits:
- An Azure databricks workspace;
- A Storage Account Azure data lake gen2
- An Access connector for Azure Databricks
- Add role storage blob data contributor on Storage account
Set up:
Step 1: Enable Unity Catalog
Log in to your Databricks workspace, Navigate to the Admin Console, then select “Workspace Settings” to enable Unity Catalog if it’s not already enabled.
Step 2: Configure Storage Credential and External Locations
Storage credentials in Databricks Unity Catalog play a vital role in securely managing access to external data storage, ensuring that organizations can maintain robust security practices while efficiently accessing and utilizing their data assets.
Accessing Admin Console: https://accounts.azuredatabricks.net/
Now define external locations in Unity Catalog where your data will reside (e.g., Azure Blob Storage or AWS S3).
External locations define where data is stored in external systems, allowing you to access and work with that data without moving it into Databricks.
They enable you to set up security policies and access controls, ensuring that only authorized users and services can access the data in those locations.
CREATE EXTERNAL LOCATION IF NOT EXISTS <location_name>
URL "abfss://<container>@<StorageAccount>.dfs.core.windows.net/"
WITH (STORAGE CREDENTIAL `<Storage_Credential_name>`);
Conclusion
Databricks Unity Catalog serves as a powerful tool for implementing effective data governance within the Lakehouse architecture. By providing a unified solution, it enhances data access, security, and compliance. With its capabilities for fine-grained access control, automated data lineage tracking, and centralized metadata management, Unity Catalog ensures that organizations can manage their data assets transparently and securely.
The importance of robust data governance cannot be overstated, as it safeguards sensitive information, ensures compliance with regulations like GDPR and CCPA, and maintains the integrity and usability of data. Unity Catalog addresses these challenges by enabling organizations to trace data back to its source, audit its usage, and manage access efficiently.
Moreover, the introduction of concepts such as catalogs and schemas simplifies the organization of data assets, while managed tables optimize performance and maintenance.
Overall, adopting Databricks Unity Catalog not only streamlines data management but also empowers organizations to harness their data responsibly, driving better business outcomes and fostering a culture of collaboration and trust in data.