Data Lake, Snapshot and Strategy

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

AWS Big Data

NOVEMBER 11, 2024

Disaster recovery is vital for organizations, offering a proactive strategy to mitigate the impact of unforeseen events like system failures, natural disasters, or cyberattacks. In Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for Recovery in the Cloud , we introduced four major strategies for disaster recovery (DR) on AWS.

Snapshot

Snapshot Strategy Dashboards Data Lake

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. The Data Catalog provides the functionality as the Iceberg catalog. Determine the changes in transaction, and write new data files.

Snapshot

Snapshot Management Metadata Big Data

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

In our previous post Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg , we showed how to use Apache Iceberg in the context of strategy backtesting. Our analysis shows that Iceberg can accelerate query performance by up to 52%, reduce operational costs, and significantly improve data management at scale.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Webinars

Going Beyond Chatbots: Connecting AI to Your Tools, Systems, & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

MORE WEBINARS

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. The snapshot points to the manifest list. AWS Glue 3.0

Data Lake

Data Lake Data Processing Metadata Snapshot

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

A modern data strategy redefines and enables sharing data across the enterprise and allows for both reading and writing of a singular instance of the data using an open table format.

Data Lake

Data Lake Metadata Snapshot Analytics

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. availability. You still need to set appropriate EMRFS retries to provide additional resiliency.

Data Lake

Data Lake Snapshot Metadata Optimization

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Backtesting is a process used in quantitative finance to evaluate trading strategies using historical data. This helps traders determine the potential profitability of a strategy and identify any risks associated with it, enabling them to optimize it for better performance.

Snapshot

Snapshot Data Lake Testing Strategy

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches.

Snapshot

Snapshot Data Lake Metadata Optimization

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Big Data

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

Introduction Apache Iceberg has recently grown in popularity because it adds data warehouse-like capabilities to your data lake making it easier to analyze all your data — structured and unstructured. You can take advantage of a combination of the strategies provided and adapt them to your particular use cases.

Optimization

Optimization Strategy Snapshot Metadata

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

With built-in features such as automated snapshots and cross-Region replication, you can enhance your disaster resilience with Amazon Redshift. To develop your disaster recovery plan, you should complete the following tasks: Define your recovery objectives for downtime and data loss (RTO and RPO) for data and metadata.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities. Let’s say that this company is located in Europe and the data product must comply with the GDPR.

Data Lake

Data Lake Management Metrics Data Warehouse

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

AWS Big Data

MAY 15, 2024

By extracting detailed information from CloudTrail and querying it using Athena, this solution streamlines the process of data collection, analysis, and reporting of EIP usage within an AWS account. Additionally, you can analyze activity logs with AWS CloudTrail Lake and Amazon Athena.

Snapshot

Snapshot Data Lake Optimization Reporting

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

AWS Big Data

MARCH 23, 2023

Register the S3 path storing the table using Lake Formation We register the S3 full path in Lake Formation: Navigate to the Lake Formation console. In the navigation pane, under Register and ingest , choose Data lake locations. The Iceberg table keeps track of the snapshots. We’d love to hear your feedback!

Interactive

Interactive Snapshot Data Lake Software

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

AWS Big Data

JULY 28, 2023

To successfully respond to a data subject’s requests, organizations should have a clear strategy to determine how data is forgotten, flagged, anonymized, or deleted, and they should have clear guidelines in place for data audits. Note that putting a comprehensive data strategy in place is not in scope for this post.

Snapshot

Snapshot Metadata Measurement Data Warehouse

Build an Amazon Redshift data warehouse using an Amazon DynamoDB single-table design

AWS Big Data

JUNE 21, 2023

A typical ask for this data may be to identify sales trends as well as sales growth on a yearly, monthly, or even daily basis. A key pillar of AWS’s modern data strategy is the use of purpose-built data stores for specific use cases to achieve performance, cost, and scale. This is achieved by partitioning the data.

Data Warehouse

Data Warehouse Data Lake OLAP Cost-Benefit

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data. This introduces the need for both polling and pushing the data to access and analyze in near-real time.

Optimization

Optimization Forecasting Data Lake Metadata

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

It can receive the events from an input Kinesis data stream and route the resulting stream to an output data stream. State snapshot in Amazon S3 – You can store the state snapshot in Amazon S3 for tracking. You can create a stateful functions cluster with Apache Flink based on your application business logic.

Analytics

Analytics IoT Data-driven Snapshot

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Effective planning, thorough risk assessment, and a well-designed migration strategy are crucial to mitigating these challenges and implementing a successful transition to the new data warehouse environment on Amazon Redshift. Organic strategy – This strategy uses a lift and shift data schema using migration tools.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Estimating Scope 1 Carbon Footprint with Amazon Athena

AWS Big Data

AUGUST 2, 2023

The data architecture diagram below shows an example of how you could use AWS services to calculate and visualize an organization’s estimated carbon footprint. Customers have the flexibility to choose the services in each stage of the data pipeline based on their use case.

Data Lake

Data Lake Measurement Visualization Data Architecture

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes.

Data Lake

Data Lake Dashboards Metrics Metadata

Dimensional modeling in Amazon Redshift

AWS Big Data

JULY 19, 2023

We show how to perform extract, transform, and load (ELT), an integration process focused on getting the raw data from a data lake into a staging layer to perform the modeling. He is passionate about using data to drive insights, while aligning with business requirements and objectives.

Modeling

Modeling Sales Data Warehouse Snapshot

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

From detailed design to a beta release, Tricentis had customers expecting to consume data from a data lake specific to only their data, and all of the data that had been generated for over a decade. Data export As stated earlier, some customers want to get an export of their test data and create their data lake.

Software

Software Data Lake Testing Cost-Benefit

Enrich your customer data with geospatial insights using Amazon Redshift, AWS Data Exchange, and Amazon QuickSight

AWS Big Data

MARCH 18, 2024

Namespaces group together all of the resources you use in Redshift Serverless, such as schemas, tables, users, datashares, and snapshots. First, we need to give our Redshift namespace permission via AWS Identity and Access Management (IAM) to access subscriptions on AWS Data Exchange. Workgroup – A collection of compute resources.

Data Warehouse

Data Warehouse Visualization Snapshot Data-driven

Accelerating revenue growth with real-time analytics: Poshmark’s journey

AWS Big Data

MARCH 20, 2023

The data from the Kinesis data stream is consumed by two applications: A Spark streaming application on Amazon EMR is used to write data from the Kinesis data stream to a data lake hosted on Amazon Simple Storage Service (Amazon S3) in a partitioned way.

Analytics

Analytics Data Processing Slice and Dice Data Lake

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Apache Iceberg is an open table format that brings atomicity, consistency, isolation, and durability (ACID) transactions to data lakes, streamlining data management. One of its key features is the ability to manage data using branches. We discuss two common strategies to verify the quality of published data.

Data Quality

Data Quality Publishing Snapshot Data Lake

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Iceberg also helps guarantee data correctness under concurrent write scenarios. We use a sample JSON file as input to Amazon DynamoDB.

Data Lake

Data Lake Metadata Testing Snapshot

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

Corinium

APRIL 25, 2019

Most of my days focus on understanding what’s happening in the market, defining overall product strategy and direction, and translating into execution across the various teams. Then when there is a breach, it comes as a shock, “wow, I didn’t even know that application had access to so much sensitive data”. And then there is the Cloud.

Insurance

Insurance Risk IoT Data-driven

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

AWS Big Data

MAY 19, 2023

Organizations across all industries have complex data processing requirements for their analytical use cases across different analytics systems, such as data lakes on AWS , data warehouses ( Amazon Redshift ), search ( Amazon OpenSearch Service ), NoSQL ( Amazon DynamoDB ), machine learning ( Amazon SageMaker ), and more.

Machine Learning

Machine Learning Metrics Big Data Management

CIOs must mind their own data confidence gap

CIO Business Intelligence

APRIL 17, 2025

Moreover, 68% of vice presidents in charge of AI or data management already see their companies making decisions based on bad data all or most of the time, versus 47% of C-level IT leaders. That emphasis can erode an organizations data foundation over time. Teams tend to prioritize short-term wins over a long-term outlook.

Data-driven

Data-driven Data Quality Snapshot Dashboards

Jumia builds a next-generation data platform with metadata-driven specification frameworks

AWS Big Data

DECEMBER 20, 2024

After the tables are created, the second task transfers HDFS data to a landing bucket in Amazon S3 using AWS DataSync to sync customer data. This process brings data from all the different layers of the data lake. Its highly recommended to regularly expire snapshots that are no longer needed.

Metadata

Metadata Data-driven Snapshot Data Lake

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

To optimize their security operations, organizations are adopting modern approaches that combine real-time monitoring with scalable data analytics. They are using data lake architectures and Apache Iceberg to efficiently process large volumes of security data while minimizing operational overhead.

Snapshot

Snapshot Optimization Data Lake Metadata

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

AWS Big Data

NOVEMBER 6, 2024

Second, because traditional data warehousing approaches are unable to keep up with the volume, velocity, and variety of data, engineering teams are building data lakes and adopting open data formats such as Parquet and Apache Iceberg to store their data.

Metadata

Metadata Data Lake Management Internet of Things

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Webinars

Trending Sources

Build a high-performance quant research platform with Apache Iceberg

Webinars

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Use Apache Iceberg in a data lake to support incremental data processing

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Choosing an open table format for your transactional data lake on AWS

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Optimization Strategies for Iceberg Tables

Implement disaster recovery with Amazon Redshift

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Unleashing the power of Presto: The Uber case study

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

Build an Amazon Redshift data warehouse using an Amazon DynamoDB single-table design

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Estimating Scope 1 Carbon Footprint with Amazon Athena

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Dimensional modeling in Amazon Redshift

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Enrich your customer data with geospatial insights using Amazon Redshift, AWS Data Exchange, and Amazon QuickSight

Accelerating revenue growth with real-time analytics: Poshmark’s journey

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

CIOs must mind their own data confidence gap

Jumia builds a next-generation data platform with metadata-driven specification frameworks

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

Stay Connected