Remove Events Remove Metadata Remove Snapshot
article thumbnail

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

Consider a streaming pipeline ingesting real-time event data while a scheduled compaction job runs to optimize file sizes. However, commits can still fail if the latest metadata is updated after the base metadata version is established. Generate new metadata files. Commit the metadata files to the catalog.

Snapshot 116
article thumbnail

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata 106
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

Branching Branches are independent lineage of snapshot history that point to the head of each lineage. An Iceberg table’s metadata stores a history of snapshots, which are updated with each transaction. Iceberg implements features such as table versioning and concurrency control through the lineage of these snapshots.

article thumbnail

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

Central to a transactional data lake are open table formats (OTFs) such as Apache Hudi , Apache Iceberg , and Delta Lake , which act as a metadata layer over columnar formats. XTable isn’t a new table format but provides abstractions and tools to translate the metadata associated with existing formats.

article thumbnail

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

In this blog post, we’ll discuss how the metadata layer of Apache Iceberg can be used to make data lakes more efficient. You will learn about an open-source solution that can collect important metrics from the Iceberg metadata layer. This ensures that each change is tracked and reversible, enhancing data governance and auditability.

Metadata 118
article thumbnail

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

The following diagram illustrates an indexing flow involving a metadata update in OR1 During indexing operations, individual documents are indexed into Lucene and also appended to a write-ahead log also known as a translog. In the event of an infrastructure failure, an OpenSearch domain can end up losing one or more nodes.

article thumbnail

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.