2023, Metadata and Snapshot - Data Leaders Brief

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Central to a transactional data lake are open table formats (OTFs) such as Apache Hudi , Apache Iceberg , and Delta Lake , which act as a metadata layer over columnar formats. Originally open sourced in November 2023 under the name OneTable, with contributions from amongst others OneHouse , it was licensed under Apache 2.0.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Amazon OpenSearch Service H1 2023 in review

AWS Big Data

AUGUST 23, 2023

Since its release in January 2021, the OpenSearch project has released 14 versions through June 2023. In this post, we provide a review of all the exciting features releases in OpenSearch Service in the first half of 2023. In July 2023, we previewed support for a third collection type: vector search. in OpenSearch Service).

Snapshot

Snapshot Dashboards Visualization Metrics

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

This means the data files in the data lake aren’t modified during the migration and all Apache Iceberg metadata files (manifests, manifest files, and table metadata files) are generated outside the purview of the data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

These formats enable ACID (atomicity, consistency, isolation, durability) transactions, upserts, and deletes, and advanced features such as time travel and snapshots that were previously only available in data warehouses. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake

Data Lake Data Processing Metadata Snapshot

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

RIO is really great",date("2023-04-06"),2023)""") You can check the new snapshot is created after this append operation by querying the Iceberg snapshot: spark.sql("""SELECT * FROM dev.db.amazon_reviews_iceberg.snapshots""").show() We expire the old snapshots from the table and keep only the last two.

Data Lake

Data Lake Snapshot Metadata Optimization

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

Optimization

Optimization Snapshot Data Lake Metadata

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

The snapshotId of the source tables involved in the materialized view are also maintained in the metadata. Subsequently, these snapshot IDs are used to determine the delta changes that should be applied to the materialized view rows. Furthermore, it is partitioned on the d_year column.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Iceberg employs internal metadata management that keeps track of data and empowers a set of rich features at scale. The Data Catalog provides a central location to govern and keep track of the schema and metadata. For example, from 2023/02/20 14:40:41 to 2023-02-20 14:40:41.000 UTC.

Data Lake

Data Lake Sales Data Warehouse Snapshot

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

AWS Big Data

AUGUST 6, 2024

At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. In 2023, AWS announced the upcoming deprecation of Data Pipeline , one of the core services used by Langley.

Cost-Benefit

Cost-Benefit Metadata Snapshot Metrics

Introducing Amazon MWAA support for Apache Airflow version 2.7.2 and deferrable operators

AWS Big Data

NOVEMBER 6, 2023

You can see the time each task spends idling while waiting for the Redshift cluster to be created, snapshotted, and paused. Airflow will cache variables and connections locally so that they can be accessed faster during DAG parsing, without having to fetch them from the secrets backend, environments variables, or metadata database.

Metrics

Metrics Metadata Snapshot Management

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

In fact, according to the Identity Theft Resource Center (ITRC) Annual Data Breach Report , there were 2,365 cyber attacks in 2023 with more than 300 million victims, and a 72% increase in data breaches since 2021. Today, cyber defenders face an unprecedented set of challenges as they work to secure and protect their organizations.

Analytics

Analytics Metadata Snapshot Data-driven

Ethics in action: Building trust through responsible AI development

CIO Business Intelligence

MARCH 5, 2025

Regulations and laws are often volatile as political influences can upend them on a moments notice: The USA Executive Order (EO) on Safe, Secure and Trustworthy Artificial Intelligence (14110) issued on October 30, 2023, was rescinded on January 20, 2025. associated with every decision made by the AI system.

Risk

Risk Risk Management Measurement Modeling

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

We fetch the metadata of the users_xxxxxx table from Athena. The following are a few important considerations regarding how the Lambda function handles Iceberg table metadata changes: In this approach, target metadata takes precedence during DML operations. It’s imperative that the source and target metadata match.

Data Lake

Data Lake Metadata Testing Snapshot

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

By using features like Icebergs compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance. The Data Catalog manages the metadata for the datasets.

Data Lake

Data Lake IoT Metadata Testing

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

In fact, according to the Identity Theft Resource Center (ITRC) Annual Data Breach Report , there were 2,365 cyber attacks in 2023 with more than 300 million victims, and a 72% increase in data breaches since 2021. Today, cyber defenders face an unprecedented set of challenges as they work to secure and protect their organizations.

Analytics

Analytics Metadata Snapshot Data-driven

Talk to Your Graph Client for GraphDB

Ontotext

JANUARY 16, 2025

The first version of Talk to Your Graph (or TTYG for short) was released in 2023 and it was my baby. Use of assistant and thread metadata. So what is that metadata? The OpenAI Assistants API provides a set of custom metadata fields for both assistants and threads. Both must be strings. version Internal use only.

Metadata

Metadata Modeling Snapshot Interactive

A Summary Of Gartner’s Recent Innovation Insight Into Data Observability

DataKitchen

AUGUST 8, 2023

On 20 July 2023, Gartner released the article “ Innovation Insight: Data Observability Enables Proactive Data Quality ” by Melody Chien. Data Observability leverages five critical technologies to create a data awareness AI engine: data profiling, active metadata analysis, machine learning, data monitoring, and data lineage.

Data Quality

Data Quality Testing Snapshot Reporting

Data Leaders Brief

Build a high-performance quant research platform with Apache Iceberg

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Trending Sources

Amazon OpenSearch Service H1 2023 in review

Webinars

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Use Apache Iceberg in a data lake to support incremental data processing

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Materialized Views in Hive for Iceberg Table Format

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

Introducing Amazon MWAA support for Apache Airflow version 2.7.2 and deferrable operators

Empower Your Cyber Defenders with Real-Time Analytics

Ethics in action: Building trust through responsible AI development

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Talk to Your Graph Client for GraphDB

A Summary Of Gartner’s Recent Innovation Insight Into Data Observability

Stay Connected