Data Lake and Optimization - Data Leaders Brief

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue.

Data Lake

Data Lake Data Processing Optimization Machine Learning

Incremental refresh for Amazon Redshift materialized views on data lake tables

AWS Big Data

NOVEMBER 8, 2024

You can use Amazon Redshift to analyze structured and semi-structured data and seamlessly query data lakes and operational databases, using AWS designed hardware and automated machine learning (ML)-based tuning to deliver top-tier price performance at scale. Amazon Redshift delivers price performance right out of the box.

Data Lake

Data Lake Data Warehouse Optimization Testing

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

AWS Big Data

NOVEMBER 22, 2024

The need for streamlined data transformations As organizations increasingly adopt cloud-based data lakes and warehouses, the demand for efficient data transformation tools has grown. Using Athena and the dbt adapter, you can transform raw data in Amazon S3 into well-structured tables suitable for analytics.

Data Lake

Data Lake Data Warehouse Cost-Benefit Data Transformation

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Rapidminer Platform Supports Entire Data Science Lifecycle

David Menninger's Analyst Perspectives

SEPTEMBER 16, 2021

Rapidminer is a visual enterprise data science platform that includes data extraction, data mining, deep learning, artificial intelligence and machine learning (AI/ML) and predictive analytics. It can support AI/ML processes with data preparation, model validation, results visualization and model optimization.

Data Science

Data Science Data Lake Data mining Deep Learning

The Unexpected Cost of Data Copies

An organization’s data is copied for many reasons, namely ingesting datasets into data warehouses, creating performance-optimized copies, and building BI extracts for analysis. Read this whitepaper to learn: Why organizations frequently end up with unnecessary data copies.

Data Lake

Differentiating Between Data Lakes and Data Warehouses

Smart Data Collective

SEPTEMBER 23, 2020

While there is a lot of discussion about the merits of data warehouses, not enough discussion centers around data lakes. We talked about enterprise data warehouses in the past, so let’s contrast them with data lakes. Both data warehouses and data lakes are used when storing big data.

Data Lake

Data Lake Data Warehouse Unstructured Data Big Data

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. The team landed the data in a Data Lake implemented with cloud storage buckets and then loaded into Snowflake, enabling fast access and smooth integrations with analytical tools.

Data Quality

Data Quality Data Lake Testing Statistics

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization. We have launched new RA3.large large instances.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Cloudinary is a cloud-based media management platform that provides a comprehensive set of tools and services for managing, optimizing, and delivering images, videos, and other media assets on websites and mobile applications.

Data Lake

Data Lake Metadata Snapshot Analytics

Unleash deeper insights with Amazon Redshift data sharing for data lake tables

AWS Big Data

OCTOBER 10, 2024

Over the years, this customer-centric approach has led to the introduction of groundbreaking features such as zero-ETL , data sharing , streaming ingestion , data lake integration , Amazon Redshift ML , Amazon Q generative SQL , and transactional data lake capabilities.

Data Lake

Data Lake Data Warehouse Recreation/Entertainment Data-driven

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

Data Lake

Data Lake Analytics Cost-Benefit Management

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. AWS Glue 3.0 The following diagram illustrates the solution architecture.

Data Lake

Data Lake Data Processing Metadata Snapshot

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Unlocking the true value of data often gets impeded by siloed information. Traditional data management—wherein each business unit ingests raw data in separate data lakes or warehouses—hinders visibility and cross-functional analysis. Amazon DataZone natively supports data sharing for Amazon Redshift data assets.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

AWS Big Data

JULY 31, 2024

In the current industry landscape, data lakes have become a cornerstone of modern data architecture, serving as repositories for vast amounts of structured and unstructured data. Maintaining data consistency and integrity across distributed data lakes is crucial for decision-making and analytics.

Data Lake

Data Lake Marketing Data Processing Management

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

AWS Big Data

SEPTEMBER 10, 2024

We often see requests from customers who have started their data journey by building data lakes on Microsoft Azure, to extend access to the data to AWS services. In such scenarios, data engineers face challenges in connecting and extracting data from storage containers on Microsoft Azure.

Data Lake

Data Lake Metadata Management Software

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

MongoDB Enhances Developer Data Platform

David Menninger's Analyst Perspectives

JANUARY 21, 2025

These include architectural optimizations to reduce memory usage and query times with more efficient batch processing to deliver better throughput, faster bulk writes and accelerated concurrent writes during data replication. also delivers enhanced developer-centric features focused on the development of AI applications.

Data Lake

Data Lake IoT Cost-Benefit Enterprise

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

For many organizations, this centralized data store follows a data lake architecture. Although data lakes provide a centralized repository, making sense of this data and extracting valuable insights can be challenging. About the Authors Dave Horne is a Sr.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Optimize your workloads with Amazon Redshift Serverless AI-driven scaling and optimization

AWS Big Data

AUGUST 21, 2024

To address this requirement, Redshift Serverless launched the artificial intelligence (AI)-driven scaling and optimization feature, which scales the compute not only based on the queuing, but also factoring data volume and query complexity. The slider offers the following options: Optimized for cost – Prioritizes cost savings.

Optimization

Optimization Data Lake Data Warehouse Cost-Benefit

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

How Salesforce optimized their detection and response platform using AWS managed services

AWS Big Data

APRIL 18, 2024

The Salesforce Trust Intelligence Platform (TIP) log platform team is responsible for data pipeline and data lake infrastructure, providing log ingestion, normalization, persistence, search, and detection capability to ensure Salesforce is safe from threat actors. Headquartered in San Francisco, Salesforce, Inc.

Optimization

Optimization Data Lake Management Key Performance Indicator

What is data architecture? A framework to manage data

CIO Business Intelligence

DECEMBER 20, 2024

Beyond breaking down silos, modern data architectures need to provide interfaces that make it easy for users to consume data using tools fit for their jobs. Data must be able to freely move to and from data warehouses, data lakes, and data marts, and interfaces must make it easy for users to consume that data.

Data Architecture

Data Architecture Management Consulting Internet of Things

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. Moreover, the framework should consume compute resources as optimally as possible per the size of the operational tables.

Data Lake

Data Lake Data Processing Metadata Snapshot

Centralize Your Data Processes With a DataOps Process Hub

DataKitchen

NOVEMBER 4, 2021

It expands beyond tools and data architecture and views the data organization from the perspective of its processes and workflows. The DataKitchen Platform is a “ process hub” that masters and optimizes those processes. Cloud computing has made it much easier to integrate data sets, but that’s only the beginning.

Data Processing

Data Processing Data Lake Cost-Benefit Testing

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. This property is set to true by default. availability.

Data Lake

Data Lake Snapshot Metadata Optimization

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python. Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Optimization

Optimization Statistics Metadata Data Lake

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

AWS Big Data

SEPTEMBER 12, 2024

The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Iceberg creates a new version called a snapshot for every change to the data in the table. As more table changes are made, more data files are created.

Optimization

Optimization Snapshot Metadata Metrics

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. Search for the Jira Cloud connector.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

AWS Big Data

FEBRUARY 14, 2023

Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history.

Data Lake

Data Lake Statistics Data Architecture Finance

Implementing a Pharma Data Mesh using DataOps

DataKitchen

AUGUST 19, 2021

Figure 3 shows an example processing architecture with data flowing in from internal and external sources. Each data source is updated on its own schedule, for example, daily, weekly or monthly. The data scientists and analysts have what they need to build analytics for the user. The new Recipes run, and BOOM! Conclusion.

Data Warehouse

Data Warehouse Data Lake Manufacturing Testing

Steps taken to build Sevita’s first enterprise data platform

CIO Business Intelligence

OCTOBER 23, 2024

For the first time, we’re consolidating data to create real-time dashboards for revenue forecasting, resource optimization, and labor utilization. We pulled these people together, and defined use cases we could all agree were the best to demonstrate our new data capability. How is the new platform helping?

Enterprise

Enterprise Dashboards KPI Data Lake

Understanding Apache Iceberg on AWS with the new technical guide

AWS Big Data

MAY 20, 2024

Whether you are new to Apache Iceberg on AWS or already running production workloads on AWS, this comprehensive technical guide offers detailed guidance on foundational concepts to advanced optimizations to build your transactional data lake with Apache Iceberg on AWS. He can be reached via LinkedIn.

Data Lake

Data Lake Big Data Cost-Benefit Data Warehouse

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

For container terminal operators, data-driven decision-making and efficient data sharing are vital to optimizing operations and boosting supply chain efficiency. She has been heavily involved in the Data Sharing Project, focusing on the implementation of Amazon DataZone into EUROGATEs IT environment.

IoT

IoT Machine Learning Metadata Data-driven

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. We will also cover the pattern with automatic compaction through AWS Glue Data Catalog table optimization.

Snapshot

Snapshot Management Metadata Big Data

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights. From enhancing data lakes to empowering AI-driven analytics, AWS unveiled new tools and services that are set to shape the future of data and analytics.

Analytics

Analytics Data Lake Metadata Data Warehouse

The success of GenAI models lies in your data management strategy

CIO Business Intelligence

OCTOBER 9, 2024

Optimizing GenAI with data management More than ever, businesses need to mitigate these risks while discovering the best approach to data management. The data preparation process should take place alongside a long-term strategy built around GenAI use cases, such as content creation, digital assistants, and code generation.

Strategy

Strategy Modeling Management Data Lake

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Incremental refresh for Amazon Redshift materialized views on data lake tables

Webinars

Trending Sources

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

Webinars

Rapidminer Platform Supports Entire Data Science Lifecycle

The Unexpected Cost of Data Copies

Differentiating Between Data Lakes and Data Warehouses

Drug Launch Case Study: Amazing Efficiency Using DataOps

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Recap of Amazon Redshift key product announcements in 2024

Choosing an open table format for your transactional data lake on AWS

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Unleash deeper insights with Amazon Redshift data sharing for data lake tables

Multicloud data lake analytics with Amazon Athena

Use Apache Iceberg in a data lake to support incremental data processing

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

MongoDB Enhances Developer Data Platform

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Enrich your serverless data lake with Amazon Bedrock

Optimize your workloads with Amazon Redshift Serverless AI-driven scaling and optimization

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

How Salesforce optimized their detection and response platform using AWS managed services

What is data architecture? A framework to manage data

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Centralize Your Data Processes With a DataOps Process Hub

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Build a high-performance quant research platform with Apache Iceberg

Speed up queries with the cost-based optimizer in Amazon Athena

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

Implementing a Pharma Data Mesh using DataOps

Steps taken to build Sevita’s first enterprise data platform

Understanding Apache Iceberg on AWS with the new technical guide

How EUROGATE established a data mesh architecture using Amazon DataZone

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Top analytics announcements of AWS re:Invent 2024

The success of GenAI models lies in your data management strategy

Stay Connected