Big Data, Data Architecture and Snapshot

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. The Data Catalog provides the functionality as the Iceberg catalog. Determine the changes in transaction, and write new data files.

Snapshot

Snapshot Management Metadata Big Data

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

This post was co-written with Dipankar Mazumdar, Staff Data Engineering Advocate with AWS Partner OneHouse. Data architecture has evolved significantly to handle growing data volumes and diverse workloads. Querying all snapshots, we can see that we created three snapshots with overwrites after the initial one.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

Over the past decade, the successful deployment of large scale data platforms at our customers has acted as a big data flywheel driving demand to bring in even more data, apply more sophisticated analytics, and on-board many new data practitioners from business analysts to data scientists.

Snapshot

Snapshot Metadata Cost-Benefit Data Architecture

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions. Snowflake can query across Iceberg and Snowflake table formats.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.

Data Integration

Data Integration Data Lake Statistics Data-driven

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Apache Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for processing engines such as Apache Spark, Trino, Apache Flink, Presto, Apache Hive, and Impala to safely work with the same tables at the same time. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

Data migration must be performed separately using methods such as S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication. This utility has two modes for replicating Lake Formation and Data Catalog metadata: on-demand and real-time. Nivas Shankar is a Principal Product Manager for AWS Lake Formation.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. By decoupling storage and compute, data lakes promote cost-effective storage and processing of big data. Why did Orca choose Apache Iceberg?

Data Lake

Data Lake Analytics Snapshot Data Quality

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

Over the years, data lakes on Amazon Simple Storage Service (Amazon S3) have become the default repository for enterprise data and are a common choice for a large set of users who query data for a variety of analytics and machine leaning use cases. Analytics use cases on data lakes are always evolving.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale. Clustering data for better data colocation using z-ordering.

Data Lake

Data Lake Metadata Statistics Optimization

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

Kinesis Data Streams has native integrations with other AWS services such as AWS Glue and Amazon EventBridge to build real-time streaming applications on AWS. Refer to Amazon Kinesis Data Streams integrations for additional details. State snapshot in Amazon S3 – You can store the state snapshot in Amazon S3 for tracking.

Analytics

Analytics IoT Data-driven Snapshot

Load data incrementally from transactional data lakes to data warehouses

AWS Big Data

OCTOBER 19, 2023

Data lakes and data warehouses are two of the most important data storage and management technologies in a modern data architecture. Data lakes store all of an organization’s data, regardless of its format or structure. The original source file 2022.csv He works based in Tokyo, Japan.

Data Lake

Data Lake Data Warehouse Visualization Snapshot

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. Table data storage mode – There are two options: Historical – This table in the data lake stores historical updates to records (always append).

Data Lake

Data Lake Data Processing Metadata Snapshot

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Success criteria alignment by all stakeholders (producers, consumers, operators, auditors) is key for successful transition to a new Amazon Redshift modern data architecture. The success criteria are the key performance indicators (KPIs) for each component of the data workflow. The following figure shows a daily usage KPI.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Use Batch Processing Gateway to automate job management in multi-cluster Amazon EMR on EKS environments

AWS Big Data

SEPTEMBER 13, 2024

Suvojit Dasgupta is a Principal Data Architect at Amazon Web Services. He leads a team of skilled engineers in designing and building scalable data solutions for AWS customers. He specializes in developing and implementing innovative data architectures to address complex business challenges.

Management

Management Snapshot Cost-Benefit Testing

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 2

AWS Big Data

JUNE 12, 2024

He has over 20 years of experience in software engineering, software architecture, and cloud architecture. He has over 25 years of experience in Enterprise data architecture, databases and data warehousing. outputs: - Name: ClusterStatus Selector: '$.Clusters[0].ClusterStatus' Clusters[0].ClusterStatus'

Data-driven

Data-driven Snapshot Optimization Management

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

This post is designed to be implemented for a real customer use case, where you get full snapshot data on a daily basis. Vijay Velpula is a Data Architect with AWS Professional Services. He helps customers implement Big Data and Analytics Solutions.

Data Lake

Data Lake Testing Snapshot Sales

Estimating Scope 1 Carbon Footprint with Amazon Athena

AWS Big Data

AUGUST 2, 2023

The data architecture diagram below shows an example of how you could use AWS services to calculate and visualize an organization’s estimated carbon footprint. Customers have the flexibility to choose the services in each stage of the data pipeline based on their use case.

Data Lake

Data Lake Measurement Visualization Data Architecture

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor. The result is made available to the application by querying the latest snapshot. Data streaming enables you to ingest data from a variety of databases across various systems.

Data Lake

Data Lake Unstructured Data Management Snapshot

Migrate Microsoft Azure Synapse Analytics to Amazon Redshift using AWS SCT

AWS Big Data

OCTOBER 18, 2023

Modern analytics is much wider than SQL-based data warehousing. With Amazon Redshift, you can build lake house architectures and perform any kind of analytics, such as interactive analytics , operational analytics , big data processing , visual data preparation , predictive analytics, machine learning , and more.

Analytics

Analytics Data Warehouse Dashboards Testing

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

By analyzing the historical report snapshot, you can identify areas for improvement, implement changes, and measure the effectiveness of those changes. He focuses on modern data architectures and helping customers accelerate their cloud journey with serverless technologies.

Data Quality

Data Quality Visualization Metadata Metrics

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

AWS Big Data

FEBRUARY 9, 2023

With scheduled flows, you can choose either full or incremental data transfer: With full transfer, Amazon AppFlow transfers a snapshot of all records at the time of the flow run from the source to the destination. He’s on a mission to make life easier for customers who are facing complex data integration challenges.

Data Warehouse

Data Warehouse Data-driven Snapshot Testing

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.

Data Lake

Data Lake Dashboards Metrics Metadata

How Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance

AWS Big Data

MARCH 28, 2024

Amazon EMR stands as a dynamic force in the cloud, delivering unmatched capabilities for organizations seeking robust big data solutions. Its seamless integration, powerful features, and adaptability make it an indispensable tool for navigating the complexities of data analytics and ML on AWS.

Optimization

Optimization IT Big Data Data Processing

Jumia builds a next-generation data platform with metadata-driven specification frameworks

AWS Big Data

DECEMBER 20, 2024

The following maintenance tasks are supported by the framework: Expire snapshots Snapshots can be used for rollback operations as well as time traveling queries. Its highly recommended to regularly expire snapshots that are no longer needed. Remove old metadata files Metadata files can accumulate over time just like snapshots.

Metadata

Metadata Data-driven Snapshot Data Lake

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

Many enterprises have heterogeneous data platforms and technology stacks across different business units or data domains. For decades, they have been struggling with scale, speed, and correctness required to derive timely, meaningful, and actionable insights from vast and diverse big data environments.

Metadata

Metadata Data Warehouse ROI Snapshot

Data Leaders Brief

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Trending Sources

Introducing Apache Iceberg in Cloudera Data Platform

Webinars

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Choosing an open table format for your transactional data lake on AWS

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Load data incrementally from transactional data lakes to data warehouses

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Use Batch Processing Gateway to automate job management in multi-cluster Amazon EMR on EKS environments

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 2

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Estimating Scope 1 Carbon Footprint with Amazon Athena

Exploring real-time streaming for generative AI Applications

Migrate Microsoft Azure Synapse Analytics to Amazon Redshift using AWS SCT

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

How Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance

Jumia builds a next-generation data platform with metadata-driven specification frameworks

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Stay Connected