Introduction to Apache Iceberg. Exploring Apache Iceberg… | by Pier Paolo Ippolito

Thanks to the advent of Data Lakes easily accessible through cloud providers such as GCP, Azure, and AWS, it has been possible for more and more organizations to cheaply store their unstructured data. Although Data Lakes come with many limitations such as:

Inconsistent reads can happen when mixing batch and streaming or appending new data.
Fine-grained modification of existing data can become complex (e.g. to meet GDPR requirements)
Performance degradation when handling millions of small files.
No ACID (Atomicity, Consistency, Isolation, Durability) transaction support.
No schema enforcement/evolution.

To try to alleviate these issues, Apache Iceberg was ideated by Nextflix in 2017. Apache Iceberg is a table format able to provide an additional layer of abstraction to support ACID transactions, time travel, etc.. while working with various types of data sources and workloads. The main objective of a table format is to define a protocol on how to best manage and organize all the files composing a table. Apart from Apache Iceberg, other currently popular open table formats are Hudi and Delta Lake.

For example, Apache Iceberg and Delta Lake mostly have the same characteristics although for example, Iceberg can support also other file formats like ORC and Avro. Delta Lake on the other hand is currently heavily supported by Databricks and the open-source community and able to provide a greater variety of APIs (Figure 1).

Figure 1: Apache Iceberg vs Delta Lake (Image by Author).

Throughout the years, Apache Iceberg has been open-sourced by Nexflix and many other companies such as SnowFlake and Dremio have decided to invest in the project.

Each Apache Iceberg table follows a 3 layers architecture:

Iceberg Catalog
Metadata Layer (with metadata files, manifest lists, and manifest files)