Data Analytics, Metadata and Snapshot

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

In this blog post, we dive into different data aspects and how Cloudinary breaks the two concerns of vendor locking and cost efficient data analytics by using Apache Iceberg, Amazon Simple Storage Service (Amazon S3 ), Amazon Athena , Amazon EMR , and AWS Glue. This concept makes Iceberg extremely versatile. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

These formats enable ACID (atomicity, consistency, isolation, durability) transactions, upserts, and deletes, and advanced features such as time travel and snapshots that were previously only available in data warehouses. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

The following are the key components and steps in the integration process: Zero-ETL extracts and loads the data into Amazon S3 , a highly scalable object storage service. The data is also registered in the Glue Data Catalog , a metadata repository. Kamen Sharlandjiev is a Sr.

Data Integration

Data Integration Data Lake Statistics Data-driven

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

In this post, we discuss ways to modernize your legacy, on-premises, real-time analytics architecture to build serverless data analytics solutions on AWS using Amazon Managed Service for Apache Flink. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds.

Management

Management Metadata Analytics Dashboards

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

Iceberg doesn’t optimize file sizes or run automatic table services (for example, compaction or clustering) when writing, so streaming ingestion will create many small data and metadata files. Offers different query types , allowing to prioritize data freshness (Snapshot Query) or read performance (Read Optimized Query).

Data Lake

Data Lake Metadata Statistics Optimization

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

AI at Scale isn’t Magic, it’s Data – Hybrid Data

Cloudera

OCTOBER 11, 2022

Al needs machine learning (ML), ML needs data science. Data science needs analytics. And they all need lots of data. The takeaway – businesses need control over all their data in order to achieve AI at scale and digital business transformation. Doing data at scale requires a data platform. .

Data Science

Data Science Snapshot Data Warehouse Metadata

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

Using Apache Iceberg’s compaction results in significant performance improvements, especially for large tables, making a noticeable difference in query performance between compacted and uncompacted data. These files are then reconciled with the remaining data during read time.

Data Lake

Data Lake Analytics Snapshot Data Quality

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

Introducing Amazon MWAA support for Apache Airflow version 2.7.2 and deferrable operators

AWS Big Data

NOVEMBER 6, 2023

You can see the time each task spends idling while waiting for the Redshift cluster to be created, snapshotted, and paused. Airflow will cache variables and connections locally so that they can be accessed faster during DAG parsing, without having to fetch them from the secrets backend, environments variables, or metadata database.

Metrics

Metrics Metadata Snapshot Management

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane. By selecting the corresponding asset, you can understand its content through the readme, glossary terms , and technical and business metadata.

Data Quality

Data Quality Visualization Metadata Metrics

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

However, there is a fundamental challenge standing in the way of being successful: data. Optimized for analytics: Iceberg tables are designed to deliver analytics faster and more effectively. The metadata-driven approach ensures quick query planning so defenders don’t have to deal with slow processes when they need fast answers.

Analytics

Analytics Metadata Snapshot Data-driven

Reliable Data Exchange with the Outbox Pattern and Cloudera DiM

Cloudera

MARCH 15, 2023

The record in the “outbox” table contains information about the event that happened inside the application, as well as some metadata that is required for further processing or routing. NOTE: Cloudera Data Platform (CDP) is a hybrid data platform designed for unmatched freedom to choose—any cloud, any analytics, any data.

Data-driven

Data-driven Snapshot Publishing Metadata

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor. The result is made available to the application by querying the latest snapshot. This allows the model to adapt to the latest changes in price and availability.

Data Lake

Data Lake Unstructured Data Management Snapshot

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

Organizations across the world are increasingly relying on streaming data, and there is a growing need for real-time data analytics, considering the growing velocity and volume of data being collected. For more information about checkpointing, see the appendix at the end of this post.

Management

Management Metadata Internet of Things Testing

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.

Data Lake

Data Lake Dashboards Metrics Metadata

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

We fetch the metadata of the users_xxxxxx table from Athena. The following are a few important considerations regarding how the Lambda function handles Iceberg table metadata changes: In this approach, target metadata takes precedence during DML operations. It’s imperative that the source and target metadata match.

Data Lake

Data Lake Metadata Testing Snapshot

Apache HBase online migration to Amazon EMR

AWS Big Data

OCTOBER 23, 2024

Running HBase on Amazon S3 has several added benefits, including lower costs, data durability, and easier scalability. And during HBase migration, you can export the snapshot files to S3 and use them for recovery. HBase provided by other cloud platforms doesn’t support snapshots.

Snapshot

Snapshot Recreation/Entertainment Testing Data Processing

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Cloudera

OCTOBER 10, 2024

The open data lakehouse is quickly becoming the standard architecture for unified multifunction analytics on large volumes of data. It combines the flexibility and scalability of data lake storage with the data analytics, data governance, and data management functionality of the data warehouse.

Optimization

Optimization Snapshot Data Lake Cost-Benefit

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

However, there is a fundamental challenge standing in the way of being successful: data. Optimized for analytics: Iceberg tables are designed to deliver analytics faster and more effectively. The metadata-driven approach ensures quick query planning so defenders don’t have to deal with slow processes when they need fast answers.

Analytics

Analytics Metadata Snapshot Data-driven

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

To optimize their security operations, organizations are adopting modern approaches that combine real-time monitoring with scalable data analytics. They are using data lake architectures and Apache Iceberg to efficiently process large volumes of security data while minimizing operational overhead.

Snapshot

Snapshot Optimization Data Lake Metadata

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

REST Catalog Value Proposition It provides open, metastore-agnostic APIs for Iceberg metadata operations, dramatically simplifying the Iceberg client and metastore/engine integration. It provides real time metadata access by directly integrating with the Iceberg-compatible metastore. spark.sql(SELECT * FROM airlines_data.carriers).show()

Metadata

Metadata Data Warehouse ROI Snapshot

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

AWS Big Data

NOVEMBER 6, 2024

For example, you can write some records using a batch ETL Spark job and other data from a Flink application at the same time and into the same table. Third, it allows scenarios such as time travel and rollback, so you can run SQL queries on a point-in-time snapshot of your data, or rollback data to a previously known good version.

Metadata

Metadata Data Lake Management Internet of Things

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

OTFs, such as Iceberg, address key limitations in traditional data lakes by offering features like ACID transactions, which maintain data consistency across concurrent operations, and compaction, which helps manage the issue of small files by merging them efficiently. The data is sent to Amazon MSK, which acts as a streaming table.

Data Lake

Data Lake IoT Metadata Testing

A Summary Of Gartner’s Recent Innovation Insight Into Data Observability

DataKitchen

AUGUST 8, 2023

Data Observability leverages five critical technologies to create a data awareness AI engine: data profiling, active metadata analysis, machine learning, data monitoring, and data lineage. It’s primarily used to understand where data came from and its transformations.

Data Quality

Data Quality Testing Snapshot Reporting

“You Complete Me,” said Data Lineage to DataOps Observability.

DataKitchen

JANUARY 23, 2023

To capture a more complete picture of the data’s journey, it is important to have a DataOps Observability system in place. Data lineage is static and often lags by weeks or months. Data lineage is often considered static because it is typically based on snapshots of data and metadata taken at a specific time.

Testing

Testing Data Governance Data Quality Data-driven

Data Leaders Brief

Use Apache Iceberg in a data lake to support incremental data processing

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Webinars

Trending Sources

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Webinars

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Choosing an open table format for your transactional data lake on AWS

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AI at Scale isn’t Magic, it’s Data – Hybrid Data

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Introducing Amazon MWAA support for Apache Airflow version 2.7.2 and deferrable operators

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Empower Your Cyber Defenders with Real-Time Analytics

Reliable Data Exchange with the Outbox Pattern and Cloudera DiM

Exploring real-time streaming for generative AI Applications

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Apache HBase online migration to Amazon EMR

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

A Summary Of Gartner’s Recent Innovation Insight Into Data Observability

“You Complete Me,” said Data Lineage to DataOps Observability.

Stay Connected