Blog, Data Lake and Snapshot - Data Leaders Brief

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. The snapshot points to the manifest list. AWS Glue 3.0

Data Lake

Data Lake Data Processing Metadata Snapshot

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

In this blog post, we dive into different data aspects and how Cloudinary breaks the two concerns of vendor locking and cost efficient data analytics by using Apache Iceberg, Amazon Simple Storage Service (Amazon S3 ), Amazon Athena , Amazon EMR , and AWS Glue. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. availability. Note the configuration parameters s3.write.tags.write-tag-name write.tags.write-tag-name and s3.delete.tags.delete-tag-name

Data Lake

Data Lake Snapshot Metadata Optimization

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Big Data

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. Create your S3 bucket if you do not have it.

Data Lake

Data Lake Snapshot Metadata Optimization

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities. Let’s say that this company is located in Europe and the data product must comply with the GDPR.

Data Lake

Data Lake Management Metrics Data Warehouse

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

In this blog, we will share with you in detail how Cloudera integrates core compute engines including Apache Hive and Apache Impala in Cloudera Data Warehouse with Iceberg. We will publish follow up blogs for other data services. Iceberg basics Iceberg is an open table format designed for large analytic workloads.

Data Warehouse

Data Warehouse Snapshot Metadata Cost-Benefit

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Overview This blog post describes support for materialized views for the Iceberg table format. It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. These tables are created as Iceberg tables.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

But what most people don’t realize is that behind the scenes, Uber is not just a transportation service; it’s a data and analytics powerhouse. Every day, millions of riders use the Uber app, unwittingly contributing to a complex web of data-driven decisions. They ingest data in snapshots from operational systems.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

AWS Big Data

MARCH 23, 2023

For this blog our “primary” workgroup is using Athena engine version 3. Data producer setup In this section, we present the steps to set up the data producer. Register the S3 path storing the table using Lake Formation We register the S3 full path in Lake Formation: Navigate to the Lake Formation console.

Interactive

Interactive Snapshot Data Lake Software

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

Introduction Apache Iceberg has recently grown in popularity because it adds data warehouse-like capabilities to your data lake making it easier to analyze all your data — structured and unstructured. Problem with too many snapshots Everytime a write operation occurs on an Iceberg table, a new snapshot is created.

Optimization

Optimization Strategy Snapshot Metadata

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Cloudera

MARCH 5, 2024

With built-in features like time travel, schema evolution, and streamlined data discovery, Iceberg empowers data teams to enhance data lake management while upholding data integrity. Learn more about the next generation of Cloudera Data Platform for Private Cloud.

Snapshot

Snapshot Data Lake Enterprise Data Governance

Amazon Managed Service for Apache Flink now supports Apache Flink version 1.19

AWS Big Data

JULY 8, 2024

Extending checkpoint intervals allows Apache Flink to prioritize processing throughput over frequent state snapshots, thereby improving efficiency and performance. You can find more details about recent releases from the Apache Flink blog and release notes: Amazon Managed Service for Apache Flink 1.19 release notes Apache Flink 1.19.0

Management

Management Dashboards Snapshot Consulting

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

Therefore, it is critical for organizations to embrace a low-latency, scalable, and reliable data streaming infrastructure to deliver real-time business applications and better customer experiences. It can receive the events from an input Kinesis data stream and route the resulting stream to an output data stream.

Analytics

Analytics IoT Data-driven Snapshot

Chose Both: Data Fabric and Data Lakehouse

Cloudera

SEPTEMBER 12, 2022

We have delivered the performance and reliability of the data warehouse with the flexibility and scale of a data lake with our data service engines and the Hive metastore. Applying the Iceberg table format to all the organization’s data in the data lake makes it more performant and usable at scale.

Unstructured Data

Unstructured Data Data Architecture Data Lake Snapshot

Estimating Scope 1 Carbon Footprint with Amazon Athena

AWS Big Data

AUGUST 2, 2023

In this blog, we will walk through how we can apply existing enterprise data to better understand and estimate Scope 1 carbon footprint using Amazon Simple Storage Service (S3) and Amazon Athena , a serverless interactive analytics service that makes it easy to analyze data using standard SQL.

Data Lake

Data Lake Measurement Visualization Data Architecture

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

Improve performance and overall manageability of Iceberg tables using the new table maintenance capabilities such as expiring old snapshots and removing their metadata, and compaction to combine small files for more efficient data processing. Maintaining performance and manageability with improved table maintenance .

Metadata

Metadata Data Warehouse Snapshot Machine Learning

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. For Type , choose Spark.

Data Lake

Data Lake Dashboards Metrics Metadata

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

A Better Way Forward: Cloudera’s Open Data Lakehouse Cloudera offers a solution to these challenges with its open data lakehouse, which combines the flexibility and scalability of data lake storage with data warehouse functionality to unify and simplify the management of cyber log data.

Analytics

Analytics Metadata Data-driven Snapshot

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

Apache Iceberg snapshot and time-travel features can help analysts and auditors to easily look back in time and analyze the data with the simplicity of SQL. . The post 5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP) appeared first on Cloudera Blog. Financial regulation. Reproducibility for ML Ops.

Metadata

Metadata Data Architecture Machine Learning Cost-Benefit

Dimensional modeling in Amazon Redshift

AWS Big Data

JULY 19, 2023

We show how to perform extract, transform, and load (ELT), an integration process focused on getting the raw data from a data lake into a staging layer to perform the modeling. We discuss implementing dimensions and facts within Amazon Redshift. Solution overview The following diagram illustrates the solution architecture.

Modeling

Modeling Sales Data Warehouse Snapshot

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Today, many customers build data quality validation pipelines using its Data Quality Definition Language (DQDL) because with static rules, dynamic rules , and anomaly detection capability , its fairly straightforward. One of its key features is the ability to manage data using branches. snappy.parquet s3:// /src-data/current/ !aws

Data Quality

Data Quality Publishing Snapshot Data Lake

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

AWS Big Data

MAY 19, 2023

Organizations across all industries have complex data processing requirements for their analytical use cases across different analytics systems, such as data lakes on AWS , data warehouses ( Amazon Redshift ), search ( Amazon OpenSearch Service ), NoSQL ( Amazon DynamoDB ), machine learning ( Amazon SageMaker ), and more.

Machine Learning

Machine Learning Metrics Big Data Management

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Cloudera

OCTOBER 10, 2024

It combines the flexibility and scalability of data lake storage with the data analytics, data governance, and data management functionality of the data warehouse. Table Cleanup: As tables grow, they often accumulate unused data files, manifest files, and snapshots that aren’t needed anymore.

Optimization

Optimization Snapshot Data Lake Cost-Benefit

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance.

Data Lake

Data Lake IoT Metadata Testing

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

A Better Way Forward: Cloudera’s Open Data Lakehouse Cloudera offers a solution to these challenges with its open data lakehouse, which combines the flexibility and scalability of data lake storage with data warehouse functionality to unify and simplify the management of cyber log data.

Analytics

Analytics Metadata Data-driven Snapshot

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

Cloudera’s open data lakehouse, powered by Apache Iceberg, solves the real-world big data challenges mentioned above by providing a unified, curated, shareable, and interoperable data lake that is accessible by a wide array of Iceberg-compatible compute engines and tools. You will see the 2 carrier records in the table.

Metadata

Metadata Data Warehouse ROI Snapshot

Data Leaders Brief

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Webinars

Trending Sources

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Introducing Apache Hudi support with AWS Glue crawlers

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Materialized Views in Hive for Iceberg Table Format

Unleashing the power of Presto: The Uber case study

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

Optimization Strategies for Iceberg Tables

A Closer Look at The Next Phase of Cloudera’s Hybrid Data Lakehouse

Amazon Managed Service for Apache Flink now supports Apache Flink version 1.19

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Chose Both: Data Fabric and Data Lakehouse

Estimating Scope 1 Carbon Footprint with Amazon Athena

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Empower Your Cyber Defenders with Real-Time Analytics

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Dimensional modeling in Amazon Redshift

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Stay Connected