Optimizing Data Pipelines with Apache Iceberg and S3: How We Accelerated Parquet Processing

Optimizing Data Pipelines with Apache Iceberg and S3: How We Accelerated Parquet Processing

Overview

Initial task and context

We faced a typical task for data engineers: to process a stream of Parquet files with data on the customer's internal technical processes. The key query is to extract metadata from the table in order to work with it faster and more conveniently in the future.

We chose Apache Iceberg as our main tool, a proven metadata management framework in our projects. Initially, the pipeline looked like this:

Parsing and saving metadata in a Postgres database.

▪️ Uploading files to memory (in chunks of row_group size).

▪️ Splitting files into "chunks".

▪️ Adding a file_name column to identify the source.

▪️ Writing and registering files in Iceberg.

▪️ Marking files as read.

At first glance, this is a standard scheme. But scaling revealed serious problems.

Problems of the first implementation

When we switched to parallel downloads, Iceberg started giving errors. The documentation suggested a solution via retry (optimistic concurrency) — repeated write attempts in case of conflicts. However, in the Python library (PyIceberg), this functionality was missing in the right form (processing time did not meet expectations due to the fact that the system was wasting resources on unnecessary data loading operations in S3 in case of errors) and we had to implement it ourselves.

Finding a solution: separation of operations

We decided to rebuild the process by dividing the two stages.:

▪️ Uploading data (in our case via PyArrow).

▪️ Updating metadata (via PyIceberg).

How it works:

The files are written separately to S3.

PyIceberg only registers changes in metadata: it checks for uniqueness, updates links, but does not touch the files themselves.

from pyiceberg.io.pyarrow import parquet_files_to_data_files

.
.
.


table = catalog.load_table((self.namespace, table_name))

data_file = next(
	parquet_files_to_data_files(
	    io=table.io,
	    file_paths=iter([file_path]),
	    table_metadata=table.metadata,
	)
)

.
.
.

@retry(
    wait=wait_random_exponential(multiplier=0.5, max=2),
    stop=stop_after_attempt(20),
    reraise=True,
    before_sleep=_log_retry,
)
def _add_data_file_with_retry(
    self, table: IcebergTable, data_file: DataFile
) -> None:
	table.refresh()
    with table.transaction() as transaction:
        transaction.update_snapshot().fast_append().append_data_file(
            data_file
        ).commit()

_add_data_file_with_retry(table=table, data_file=data_file)

Note that when creating a table and adding pyiceberg data, two logical "folders" are formed:

▪️ metadata/ — stores meta information (schemas, partitions, versions);

▪️ data/ — contains files in S3, but files can be located anywhere in the S3 bucket (even in another bucket). Iceberg only records their paths in the metadata.

For example:

"The data_part_1.parquet file is located at s3://bucket/raw/2024/, and it contains data for January 2024."

This gives you flexibility: you don't have to move files inside S3, just update the metadata.

New scheme: minimum operations, maximum speed

We abandoned the cumbersome unloading/loading scheme and switched to a simplified pipeline.:

▪️ Parsing and saving metadata in a Postgres database (as before).

▪️ Registration of files in Iceberg with their current paths in S3.

What has changed:

There is no need to download files from S3 for processing.

Iceberg analyzes metadata directly in the bucket, without moving data.

Operations are performed in fractions of a second — most of the time is spent on inter-service interaction.

After the introduction of the new scheme:

▪️ The processing speed has increased significantly: an operation that took minutes is now performed in half a second.

▪️ The load on the infrastructure has dropped: There are no excessive data movements, and the memory and network are not overloaded.

▪️ The flexibility of the system has increased: files can not be moved inside S3, and Iceberg automatically updates links.

▪️ Scalability has improved: the system is ready for volume growth without an urgent hardware upgrade.

This case shows that sometimes the solution to a problem lies not in increasing resources, but in revising the architecture. The Iceberg + S3 bundle allowed:

▪️ reduce data processing time;

▪️ reduce infrastructure costs;

▪️ leave a foundation for future growth.

The system now runs faster, consumes fewer resources, and remains understandable for support.

Read also