Understanding the role of Hive Meta Store in Spark SQL and how Databricks organizes databases, tables and views

Rafael Rampineli
5 min readSep 22, 2024

--

First of all, we need understand what’s Hive. The Hive Meta store is a centralized repository that stores metadata about tables, partitions, and other data structures used in data processing frameworks like Apache Hive and Apache Spark. It allows Spark SQL to access and manage data stored in formats like Parquet, ORC, and others.

Understanding the role of Hive Meta Store in Spark

Our data is stored in our Data Lake as files of varying types such as CSV, JSON and Parquet.
In order for Spark to treat this data as tables and columns, we need to register this data in a metastore.

Meta Store is nothing but a storage for storing the metadata about the data files.
For example, things like the location of the file, the format of the data, column names, etc. Spark uses a meta store provided by the Apache Hive project for this, which is called Hive Meta Store. Hive Meta Store is the most commonly used metadata in the Data Lake space.

When it comes to choosing the storage for hive meta store, we have a choice: We can either choose the Default Databricks managed Meta Store or the External Storage option of your own. So you’ve got a choice between Azure SQL, MySQL, MariaDB and a few others.

Once we’ve registered our tables in the hive meta store, we can then use Spark SQL to access these tables like we would in a relational database.

Just to summarize, the data is usually stored in an object storage which is ADLs. Hive Meta Store keeps the information about the file, such as the location, name of the file, table, column, etc..

When you run a Spark SQL command, Spark uses the meta store to apply the schema and access the files accordingly, as if we are accessing any relational table.

Let’s look at how Databricks organizes databases, tables and views.

At Databricks Workspace can have a number of databases, they are also referred as schemas. Within a database, they can have a number of tables and views. Tables are basically structures given to the data stored in an object storage.

There are two types of tables in Spark: The first one is called the Managed Table and the other one is called External or Unmanaged Table.

In the case of Managed Table, Spark maintains both the metadata in Hive Meta Store and also the data files associated with the table, which is stored in ADLs.

In the case of External Tables, Spark only manages the metadata and we manage the data files.

What I mean by that is, for External Tables we specify the location of the data files and Spark doesn’t decide it for itself.

Also, the main properties that dropping the table for an External Table doesn’t delete the files, whereas in the case of Managed Table, it deletes the files as well as drop the table.

Think of an external table as a way to access data that’s stored somewhere else, like in cloud storage (e.g., AWS S3, Azure Blob Storage). When you create an external table, you tell Databricks where to find the data instead of storing it in Databricks itself.

I hope that explained the difference between Managed and External Tables.

Views can be built on the tables with a selection of data. For example, you could apply a filter to a table, or you could select just a certain number of columns and then create a view with that information.

Hive Metastore Keys:

  1. Metadata Storage: It stores information about database schemas, tables, columns, data types, and partitioning. This helps Spark SQL understand the structure of the data.
  2. Data Location: It provides the location of data files, which allows Spark to efficiently read the data when executing queries.
  3. Compatibility: By using the Hive Metastore, Spark can interact with existing Hive tables and datasets, enabling interoperability between Spark and Hive.
  4. Catalog Management: It supports multiple catalogs, allowing users to organize and manage different datasets in a structured way.
  5. Schema Evolution: The metastore can handle schema changes over time, accommodating updates without breaking existing queries.

Insights to keep in mind:

  1. Centralized Metadata Management: The Hive Meta store acts as a critical component for managing metadata, allowing Spark SQL to treat raw data files as structured tables. This centralized approach simplifies data governance and organization.
  2. Interoperability: By leveraging the Hive Meta store, Spark can seamlessly integrate with existing Hive tables and datasets. This compatibility is particularly valuable for organizations transitioning from Hive to Spark or operating in hybrid environments.
  3. Storage Options: The choice between a managed metastore (like Databricks) and an external metastore (like Azure SQL or MySQL) offers flexibility, allowing users to select a solution that best fits their infrastructure and scalability needs.
  4. Table Management: The distinction between Managed and External Tables is crucial for users to understand how Spark interacts with data. Managed Tables give Spark full control, while External Tables provide users with flexibility and control over their data locations.
  5. Schema Evolution: The Hive Meta store’s capability to handle schema changes is essential for maintaining the integrity of data processing workflows, ensuring that updates can occur without disrupting existing applications.

The Hive Meta store is a vital element of Spark SQL, facilitating efficient metadata management and streamlined data access. It serves as a centralized repository, allowing structured interaction with data, akin to traditional databases. The option to choose between managed and external storage enhances its versatility across various environments. Understanding its functionalities is crucial for organizations aiming to optimize data processing workflows and fully leverage their data lakes. As data complexity increases, the Hive Meta store will be essential for effective data management.

--

--

Rafael Rampineli
Rafael Rampineli

Written by Rafael Rampineli

I specialize in managing and optimizing complex database environments to ensure high availability, performance, and data integrity.

No responses yet