Hive vs Iceberg Tables in AWS Athena

10 May 2024
Blog Image

For businesses that rely heavily on data analytics or run data warehouses, the table structures they use are critical for optimizing data operations. The way tables are designed affects how fast, accurate, and reliable data pipelines are, which in turn affects how well businesses can make decisions, improve operational efficiency, and gain an edge in the market. In this blog post, we'll examine how Hive and Iceberg tables differ in AWS Athena and how they affect your data workflows, especially when you use dbt. dbt works on existing data tables to do transformations and create new datasets, so the choice of table structures influences how efficient, performant, consistent, scalable and flexible the data transformation processes are. The right choice of table structures with dbt will help users get the most out of their data analysis and achieve better business results.

What are Hive Tables?

The Hive table format is compatible with many technologies that enable distributed data processing (e.g. Presto, Spark). In this article, we are talking about the Hive table format (Hive metastore), not the processing engine you may know from the Hadoop world. The Hive processing engine, which uses MapReduce under the hood, is less prevalent today as there are better technologies available for processing this data.

Hive tables are defined and managed through a central metadata repository, the Hive metastore. Beneath, Hive tables are made of one or more column-oriented file types (usually Parquet) and can be accessed as a table with various technologies.

What are the limitations of Hive Tables?

Hive tables have a potential drawback in how they can be manipulated. Some of you may have used a database before and know the concept of transactional workflows. In a database, you can add new data to an existing table, modify it, or delete it. This transactional way of working is not supported when using the Hive metastore. Suppose that you've stored data for a specific day in your underlying storage. Later, you find out that a column needs to be changed. If you use the Hive metastore to access the data as a table, you will have to load all this data again (with a technology of your choice), make the change, and completely overwrite it at the same location. Therefore, there is no way to execute an update statement on the existing data.

What are Iceberg Tables and how do these differ from Hive tables?

Iceberg tables and Hive tables are very similar. Some people believe that the data itself is stored in a different way, but actually, it's only the metadata that has a different storage and processing mechanism. Iceberg tables rely on Apache Iceberg, which is a table format that anyone can use. This doesn't tell you much, right? The main thing you need to know is that the data (usually in Parquet format) does not change at all. So, how do Iceberg and Hive differ? Iceberg uses a transactional log. This means that the transactions that happen on an Iceberg table are kept with the data itself, allowing a user to do transactional workflows (ACID compliance) that are not very different from those available in a relational database.

Iceberg not only supports transactional workflows on column-oriented file formats, but also allows going back to previous states of the data. Iceberg uses the transaction log to access certain states of the data, but the data itself stays physically unchanged. So, suppose you want to remove a row from an existing Iceberg table tomorrow. A new entry will be added to the transaction log with information about the removal itself and where it happened. When a user queries the same table in the most recent state, Iceberg, using the transaction log, will know that this removed entry is not part of the table's current state. Does this mean that this entry was erased underneath? No, when the same user runs a query using an earlier state of the table, it will show up again. Therefore, understand that only the metadata changes constantly with transactions, not the data. This way, users can quickly and effectively query previous states of the data.

This means that the data is not rewritten or altered by Iceberg unless the data partitioning changes. When we partition the data, we put it in a specific folder structure that allows the processing technology to access it efficiently. For instance, you can store your data by day (one folder per day) so that you can read only the data in the folders that you need for processing later. Of course, if you change the Iceberg partitioning, the data will be stored in the new folder structure.

How does this relate to AWS Athena and dbt?

By default, AWS Athena uses the Hive table format and metastore. So, dbt creates all objects under this format when you use it with AWS Athena. But you can also use Iceberg tables with the dbt connector for AWS Athena. You can set the table format in the dbt config (for the project or the model), and dbt does the rest automatically (see the above paragraphs for the differences and benefits of Iceberg). Also, we saw in previous projects that dbt locked some tables while writing to them or when a dbt query failed, if we didn't use Iceberg. This made some tables unavailable when our dbt run failed, and this was a big problem. So, Iceberg can help you avoid this problem too.

What is the Practical Implication for Data Teams

If you use dbt with Athena, think carefully about how you design your tables. Consider how fast you need to query, how often you change your schema, and how much you want to grow your data to choose between Hive or Iceberg tables for your scenario. To help you in choosing between Hive & Iceberg table, we have created this handy overview.

Iceberg

Hive

Performance

  • Iceberg's design allows for more precise scanning of data files based on predicates, potentially reducing the amount of data scanned and improving query performance.

  • Efficient transactional workflows can speed up certain processes
  • Query performance may be slower, but impact is limited when the data hardly needs to be accessed transactionally.

Robustness

  • When using Iceberg in combination with dbt, you can ensure that a table remains accessible while being written to or is not completely deleted when a specific query fails.

Schema evolution

  • Supports non-destructive schema evolution
  • Schema evolution can be challenging and may require complex alterations to underlying data files

Ecosystem integration

  • Growing support and adoption within the data ecosystem
  • Well-integrated and compatible with existing various Big Data ecosystems and tools

Whether you need to migrate, consolidate, enrich, or harmonize your data, Datashift can help you achieve your goals. Datashift’s knowledge helps you leverage the full potential of your data assets. If you want to learn more about how Datashift can help you with your data challenges, contact us. We would love to hear from you and show you how Datashift can make your data work for you.