Data Lake, Enterprise and Snapshot

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

AWS Big Data

NOVEMBER 11, 2024

This post focuses on introducing an active-passive approach using a snapshot and restore strategy. Snapshot and restore in OpenSearch Service The snapshot and restore strategy in OpenSearch Service involves creating point-in-time backups, known as snapshots , of your OpenSearch domain.

Snapshot

Snapshot Strategy Dashboards Data Lake

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Businesses are constantly evolving, and data leaders are challenged every day to meet new requirements. For many enterprises and large organizations, it is not feasible to have one processing engine or tool to deal with the various business requirements. This post is co-written with Andries Engelbrecht and Scott Teal from Snowflake.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. The snapshot points to the manifest list. AWS Glue 3.0

Data Lake

Data Lake Data Processing Metadata Snapshot

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

AWS Big Data

DECEMBER 12, 2024

One-time and complex queries are two common scenarios in enterprise data analytics. Complex queries, on the other hand, refer to large-scale data processing and in-depth analysis based on petabyte-level data warehouses in massive data scenarios.

Snapshot

Snapshot Recreation/Entertainment Experimentation Data Lake

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Enterprises and organizations across the globe want to harness the power of data to make better decisions by putting data at the center of every decision-making process. Expire snapshots Each write to an Iceberg table creates a new snapshot , or version, of a table.

Data Lake

Data Lake Metadata Snapshot Analytics

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg provides time travel and snapshotting capabilities out of the box to manage lookahead bias that could be embedded in the data (such as delayed data delivery). Simplified data corrections and updates Iceberg enhances data management for quants in capital markets through its robust insert, delete, and update capabilities.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. Jupyter Enterprise Gateway 2.6.0, availability. This example is demonstrated on an EMR version emr-6.10.0

Data Lake

Data Lake Snapshot Metadata Optimization

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale. This is where the tagging feature in Apache Iceberg comes in handy.

Snapshot

Snapshot Data Lake Testing Strategy

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

To build a data-driven business, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. Verify all table metadata is stored in the AWS Glue Data Catalog.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor

AWS Big Data

MARCH 20, 2023

In the first post of this series , we described how AWS Glue for Apache Spark works with Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg datasets tables using the native support of those data lake formats. Even without prior experience using Hudi, Delta Lake or Iceberg, you can easily achieve typical use cases.

Visualization

Visualization Data Lake Snapshot Big Data

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

This is both frustrating for companies that would prefer making ML an ordinary, fuss-free value-generating function like software engineering, as well as exciting for vendors who see the opportunity to create buzz around a new category of enterprise software. The new category is often called MLOps. Enter the software development layers.

IT

IT Testing Experimentation Software

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

AWS Lake Formation helps with enterprise data governance and is important for a data mesh architecture. It works with the AWS Glue Data Catalog to enforce data access and governance. This solution only replicates metadata in the Data Catalog, not the actual underlying data.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

With built-in features such as automated snapshots and cross-Region replication, you can enhance your disaster resilience with Amazon Redshift. Amazon Redshift supports two kinds of snapshots: automatic and manual, which can be used to recover data. Snapshots are point-in-time backups of the Redshift data warehouse.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

The data sourcing problem To ensure the reliability of PySpark data pipelines, it’s essential to have consistent record-level data from both dimensional and fact tables stored in the Enterprise Data Warehouse (EDW). These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime.

Data Processing

Data Processing Data Lake Data Warehouse Optimization

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

AWS Big Data

FEBRUARY 12, 2024

It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Moreover, no separate effort is required to process historical data versus live streaming data. Apart from incremental analytics, Redshift simplifies a lot of operational aspects.

Analytics

Analytics Data Warehouse Snapshot Cost-Benefit

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities.

Data Lake

Data Lake Management Metrics Data Warehouse

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

AWS Big Data

MARCH 23, 2023

Register the S3 path storing the table using Lake Formation We register the S3 full path in Lake Formation: Navigate to the Lake Formation console. In the navigation pane, under Register and ingest , choose Data lake locations. The Iceberg table keeps track of the snapshots.

Interactive

Interactive Snapshot Data Lake Software

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

Every table change creates an Iceberg snapshot, this helps to resolve concurrency issues and allows readers to scan a stable table state every time. During queries the query engines scan both the data files and delete files belonging to the same snapshot and merge them together (i.e. ID, TBL_ICEBERG_PART_2.NAME,

Data Warehouse

Data Warehouse Snapshot Metadata Cost-Benefit

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Cloudera

MARCH 5, 2024

But even with its rise, AI is still a struggle for some enterprises. AI, and any analytics for that matter, are only as good as the data upon which they are based. Cloudera is now the only provider to offer an open data lakehouse with Apache Iceberg for cloud and on-premises. And that’s where the rub is.

Snapshot

Snapshot Data Lake Enterprise Data Governance

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor. The result is made available to the application by querying the latest snapshot. For more details, refer to Create a low-latency source-to-data lake pipeline using Amazon MSK Connect, Apache Flink, and Apache Hudi.

Data Lake

Data Lake Unstructured Data Management Snapshot

Estimating Scope 1 Carbon Footprint with Amazon Athena

AWS Big Data

AUGUST 2, 2023

In this blog, we will walk through how we can apply existing enterprise data to better understand and estimate Scope 1 carbon footprint using Amazon Simple Storage Service (S3) and Amazon Athena , a serverless interactive analytics service that makes it easy to analyze data using standard SQL.

Data Lake

Data Lake Measurement Visualization Data Architecture

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

AWS Big Data

JULY 28, 2023

Tagging Consider tagging your Amazon Redshift resources to quickly identify which clusters and snapshots contain the PII data, the owners, the data retention policy, and so on. Redshift resources, such as namespaces, workgroups, snapshots, and clusters can be tagged. Tags provide metadata about resources at a glance.

Snapshot

Snapshot Metadata Measurement Data Warehouse

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

It can receive the events from an input Kinesis data stream and route the resulting stream to an output data stream. State snapshot in Amazon S3 – You can store the state snapshot in Amazon S3 for tracking. You can create a stateful functions cluster with Apache Flink based on your application business logic.

Analytics

Analytics IoT Data-driven Snapshot

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

AWS Big Data

JULY 28, 2023

Amazon Redshift is a petabyte-scale, enterprise-grade cloud data warehouse service delivering the best price-performance. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift to cost-effectively and quickly analyze their data using standard SQL and existing business intelligence (BI) tools.

Data Lake

Data Lake Data Governance Data Warehouse Data-driven

Chose Both: Data Fabric and Data Lakehouse

Cloudera

SEPTEMBER 12, 2022

For many organizations, a data fabric is a first step to becoming more data driven. A data fabric answers perhaps the biggest question of all: what data do we have to work with? The tremendous overhead placed on IT hampers the speed with which organizations can bring together ever more data to deploy new use cases.

Unstructured Data

Unstructured Data Data Architecture Data Lake Snapshot

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

AWS Big Data

JULY 25, 2023

Choose your level of metrics to monitor: Workgroup Namespace Snapshot storage If we select Workgroup , we can choose from the workgroup-level metrics shown in the following screenshot. The following screenshot shows the metrics available at the snapshot storage level. About the Authors Satesh Sonti is a Sr.

Metrics

Metrics Data Warehouse Dashboards Snapshot

Snowflake and Domino: Better Together

Domino Data Lab

JANUARY 11, 2021

Snowflake is a solution for data warehousing, data lakes, and data application development and specializes in securely sharing and consuming data. About Domino Data Lab. Domino Data Lab is the system-of-record for enterprise data science teams. Integration Features.

Data Science

Data Science Recreation/Entertainment Data Warehouse Publishing

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

By being a truly open table format, Apache Iceberg fits well within the vision of the Cloudera Data Platform (CDP). Let’s highlight some of those benefits, and why choosing CDP and Iceberg can future proof your next generation data architecture. . 4: Enterprise grade. Financial regulation. Reproducibility for ML Ops.

Metadata

Metadata Data Architecture Machine Learning Cost-Benefit

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Depending on your enterprise’s culture and goals, your migration pattern of a legacy multi-tenant data platform to Amazon Redshift could use one of the following strategies: Leapfrog strategy – In this strategy, you move to an AWS modern data architecture and migrate one tenant at a time.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

Since we announced the general availability of Apache Iceberg in Cloudera Data Platform (CDP), Cloudera customers, such as Teranet , have built open lakehouses to future-proof their data platforms for all their analytical workloads. Cloudera partners are also benefiting from Apache Iceberg in CDP. ORC open file format support.

Metadata

Metadata Data Warehouse Snapshot Machine Learning

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

All of this data is essential for investigations and threat hunting, but existing systems often struggle to manage it efficiently. Ingesting the data is often too slow and/or expensive, leading to latent responses and missed opportunities.

Analytics

Analytics Metadata Data-driven Snapshot

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

Tricentis is the global leader in continuous testing for DevOps, cloud, and enterprise applications. From detailed design to a beta release, Tricentis had customers expecting to consume data from a data lake specific to only their data, and all of the data that had been generated for over a decade.

Software

Software Data Lake Testing Cost-Benefit

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

Corinium

APRIL 25, 2019

Why should Chief Data & Analytics Officers care about data security? Most enterprises in the 21st century regard data as an incredibly valuable asset – Insurance is no exception - to know your customers better, know your market better, operate more efficiently and other business benefits. That’s the reward.

Insurance

Insurance Risk IoT Data-driven

CIOs must mind their own data confidence gap

CIO Business Intelligence

APRIL 17, 2025

This data confidence gap between C-level executives and IT leaders at the vice president and director levels could lead to major problems when it comes time to train AI models or roll out other data-driven initiatives, experts warn. Then, after the internal service is finished, IT teams move onto the next thing, Agarwal says.

Data-driven

Data-driven Data Quality Snapshot Dashboards

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

To optimize their security operations, organizations are adopting modern approaches that combine real-time monitoring with scalable data analytics. They are using data lake architectures and Apache Iceberg to efficiently process large volumes of security data while minimizing operational overhead.

Snapshot

Snapshot Optimization Data Lake Metadata

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

All of this data is essential for investigations and threat hunting, but existing systems often struggle to manage it efficiently. Ingesting the data is often too slow and/or expensive, leading to latent responses and missed opportunities.

Analytics

Analytics Metadata Data-driven Snapshot

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Webinars

Trending Sources

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Webinars

Use Apache Iceberg in a data lake to support incremental data processing

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Build a high-performance quant research platform with Apache Iceberg

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Build a data lake with Apache Flink on Amazon EMR

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor

MLOps and DevOps: Why Data Makes It Different

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Implement disaster recovery with Amazon Redshift

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Exploring real-time streaming for generative AI Applications

Estimating Scope 1 Carbon Footprint with Amazon Athena

Unleashing the power of Presto: The Uber case study

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

Chose Both: Data Fabric and Data Lakehouse

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

Snowflake and Domino: Better Together

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Empower Your Cyber Defenders with Real-Time Analytics

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

CIOs must mind their own data confidence gap

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Stay Connected