Data Lake and Snapshot - Data Leaders Brief

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

AWS Big Data

NOVEMBER 11, 2024

This post focuses on introducing an active-passive approach using a snapshot and restore strategy. Snapshot and restore in OpenSearch Service The snapshot and restore strategy in OpenSearch Service involves creating point-in-time backups, known as snapshots , of your OpenSearch domain.

Snapshot

Snapshot Strategy Dashboards Data Lake

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. The Data Catalog provides the functionality as the Iceberg catalog. Determine the changes in transaction, and write new data files.

Snapshot

Snapshot Management Metadata Big Data

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. The snapshot points to the manifest list. AWS Glue 3.0

Data Lake

Data Lake Data Processing Metadata Snapshot

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Solving the small file problem and improving query performance In modern data architectures, stream processing engines such as Amazon EMR are often used to ingest continuous streams of data into data lakes using Apache Iceberg. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg provides time travel and snapshotting capabilities out of the box to manage lookahead bias that could be embedded in the data (such as delayed data delivery). Simplified data corrections and updates Iceberg enhances data management for quants in capital markets through its robust insert, delete, and update capabilities.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Load data incrementally from transactional data lakes to data warehouses

AWS Big Data

OCTOBER 19, 2023

Data lakes and data warehouses are two of the most important data storage and management technologies in a modern data architecture. Data lakes store all of an organization’s data, regardless of its format or structure.

Data Lake

Data Lake Data Warehouse Visualization Snapshot

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. availability. parquet") df.sortWithinPartitions("review_date").writeTo("dev.db.amazon_reviews_iceberg").append()

Data Lake

Data Lake Snapshot Metadata Optimization

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

Amazon Athena supports the MERGE command on Apache Iceberg tables, which allows you to perform inserts, updates, and deletes in your data lake at scale using familiar SQL statements that are compliant with ACID (Atomic, Consistent, Isolated, Durable).

Data Lake

Data Lake Snapshot Optimization Data Transformation

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

AWS Big Data

DECEMBER 12, 2024

In the context of comprehensive data governance, Amazon DataZone offers organization-wide data lineage visualization using Amazon Web Services (AWS) services, while dbt provides project-level lineage through model analysis and supports cross-project integration between data lakes and warehouses.

Snapshot

Snapshot Recreation/Entertainment Experimentation Data Lake

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

AWS Big Data

SEPTEMBER 12, 2024

Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance. Iceberg creates a new version called a snapshot for every change to the data in the table.

Optimization

Optimization Snapshot Metadata Metrics

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Big Data

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

AWS Big Data

DECEMBER 9, 2024

Anytime when you need SCD Type-2 snapshot of your Iceberg table, you can create the corresponding representation. This approach combines the power of Icebergs efficient data management with the historical tracking capabilities of SCD Type-2. You can obtain the table snapshots by querying for db.table.snapshots. In the WITH AS (.)

Snapshot

Snapshot Data Warehouse Data Lake Data Quality

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale. This is where the tagging feature in Apache Iceberg comes in handy.

Snapshot

Snapshot Data Lake Testing Strategy

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. Under Administration , choose Data catalog settings.

Data Lake

Data Lake Snapshot Metadata Optimization

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

ML use cases rarely dictate the master data management solution, so the ML stack needs to integrate with existing data warehouses. To manage the dynamism, we can resort to taking snapshots that represent immutable points in time: of models, of data, of code, and of internal state. Enter the software development layers.

IT

IT Testing Experimentation Software

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor

AWS Big Data

MARCH 20, 2023

In the first post of this series , we described how AWS Glue for Apache Spark works with Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg datasets tables using the native support of those data lake formats. Even without prior experience using Hudi, Delta Lake or Iceberg, you can easily achieve typical use cases.

Visualization

Visualization Data Lake Snapshot Big Data

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

Verify all table metadata is stored in the AWS Glue Data Catalog. Consume data with Athena or Amazon EMR Trino for business analysis. Update and delete source records in Amazon RDS for MySQL and validate the reflection of the data lake tables. Continue the subsequent steps to complete your EMR cluster creation.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

AWS Big Data

AUGUST 9, 2024

These processes retrieve data from around 90 different data sources, resulting in updating roughly 2,000 tables in the data warehouse and 3,000 external tables in Parquet format, accessed through Amazon Redshift Spectrum and a data lake on Amazon Simple Storage Service (Amazon S3). TB of data.

Data Lake

Data Lake Analytics Data Warehouse Data-driven

Manage your data warehouse cost allocations with Amazon Redshift Serverless tagging

AWS Big Data

MARCH 27, 2023

Amazon Redshift Serverless makes it simple to run and scale analytics without having to manage your data warehouse infrastructure. For Filter by resource type , you can filter by Workgroup , Namespace , Snapshot , and Recovery Point. For this post, we don’t include any tag filters, so we can view all the resources across our account.

Data Warehouse

Data Warehouse Management Snapshot Data Lake

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

With built-in features such as automated snapshots and cross-Region replication, you can enhance your disaster resilience with Amazon Redshift. Amazon Redshift supports two kinds of snapshots: automatic and manual, which can be used to recover data. Snapshots are point-in-time backups of the Redshift data warehouse.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

AWS Big Data

MAY 15, 2024

By extracting detailed information from CloudTrail and querying it using Athena, this solution streamlines the process of data collection, analysis, and reporting of EIP usage within an AWS account. Additionally, you can analyze activity logs with AWS CloudTrail Lake and Amazon Athena.

Snapshot

Snapshot Data Lake Optimization Reporting

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities. Let’s say that this company is located in Europe and the data product must comply with the GDPR.

Data Lake

Data Lake Management Metrics Data Warehouse

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

AWS Big Data

FEBRUARY 12, 2024

It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Moreover, no separate effort is required to process historical data versus live streaming data. Apart from incremental analytics, Redshift simplifies a lot of operational aspects.

Analytics

Analytics Data Warehouse Snapshot Cost-Benefit

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

The data sourcing problem To ensure the reliability of PySpark data pipelines, it’s essential to have consistent record-level data from both dimensional and fact tables stored in the Enterprise Data Warehouse (EDW). These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime.

Data Processing

Data Processing Data Lake Data Warehouse Optimization

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

Every table change creates an Iceberg snapshot, this helps to resolve concurrency issues and allows readers to scan a stable table state every time. During queries the query engines scan both the data files and delete files belonging to the same snapshot and merge them together (i.e. ID, TBL_ICEBERG_PART_2.NAME,

Data Warehouse

Data Warehouse Snapshot Metadata Cost-Benefit

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. With each crawler run, the crawler inspects each of the S3 paths and catalogs the schema information, such as new tables, deletes, and updates to schemas in the Data Catalog.

Data Lake

Data Lake Metadata Snapshot Management

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Subsequently, these snapshot IDs are used to determine the delta changes that should be applied to the materialized view rows. Incremental and full rebuild of materialized view We will insert rows into the base table and examine how the materialized view can be updated to reflect the new data.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Webinars

Trending Sources

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Webinars

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Run Apache XTable in AWS Lambda for background conversion of open table formats

Use Apache Iceberg in a data lake to support incremental data processing

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Build a high-performance quant research platform with Apache Iceberg

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Choosing an open table format for your transactional data lake on AWS

Load data incrementally from transactional data lakes to data warehouses

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Introducing Apache Hudi support with AWS Glue crawlers

MLOps and DevOps: Why Data Makes It Different

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor

Build a data lake with Apache Flink on Amazon EMR

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

Manage your data warehouse cost allocations with Amazon Redshift Serverless tagging

Implement disaster recovery with Amazon Redshift

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Materialized Views in Hive for Iceberg Table Format

Unleashing the power of Presto: The Uber case study

Stay Connected