Data Lake, Events and Optimization

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue.

Data Lake

Data Lake Data Processing Optimization Machine Learning

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

AWS Big Data

NOVEMBER 22, 2024

The need for streamlined data transformations As organizations increasingly adopt cloud-based data lakes and warehouses, the demand for efficient data transformation tools has grown. Using Athena and the dbt adapter, you can transform raw data in Amazon S3 into well-structured tables suitable for analytics.

Data Lake

Data Lake Data Warehouse Cost-Benefit Data Transformation

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. The team landed the data in a Data Lake implemented with cloud storage buckets and then loaded into Snowflake, enabling fast access and smooth integrations with analytical tools.

Data Quality

Data Quality Data Lake Testing Statistics

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Enterprises and organizations across the globe want to harness the power of data to make better decisions by putting data at the center of every decision-making process. However, throughout history, data services have held dominion over their customers’ data.

Data Lake

Data Lake Metadata Snapshot Analytics

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

How Salesforce optimized their detection and response platform using AWS managed services

AWS Big Data

APRIL 18, 2024

The Salesforce Trust Intelligence Platform (TIP) log platform team is responsible for data pipeline and data lake infrastructure, providing log ingestion, normalization, persistence, search, and detection capability to ensure Salesforce is safe from threat actors. Headquartered in San Francisco, Salesforce, Inc.

Optimization

Optimization Data Lake Management Key Performance Indicator

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

For many organizations, this centralized data store follows a data lake architecture. Although data lakes provide a centralized repository, making sense of this data and extracting valuable insights can be challenging. Amazon S3 emits an object created event and matches an EventBridge rule.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

AWS Big Data

JULY 31, 2024

In the current industry landscape, data lakes have become a cornerstone of modern data architecture, serving as repositories for vast amounts of structured and unstructured data. Maintaining data consistency and integrity across distributed data lakes is crucial for decision-making and analytics.

Data Lake

Data Lake Marketing Data Processing Management

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. Moreover, the framework should consume compute resources as optimally as possible per the size of the operational tables.

Data Lake

Data Lake Data Processing Metadata Snapshot

Implementing a Pharma Data Mesh using DataOps

DataKitchen

AUGUST 19, 2021

Figure 3 shows an example processing architecture with data flowing in from internal and external sources. Each data source is updated on its own schedule, for example, daily, weekly or monthly. The data scientists and analysts have what they need to build analytics for the user. The new Recipes run, and BOOM! Conclusion.

Data Warehouse

Data Warehouse Data Lake Manufacturing Testing

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

AWS Big Data

OCTOBER 30, 2024

Overview of the auto-copy feature in Amazon Redshift The auto-copy feature in Amazon Redshift leverages the S3 event integration to automatically load data into Amazon Redshift and simplifies automatic data loading from Amazon S3 with a simple SQL command. You can enable Amazon Redshift auto-copy by creating auto-copy jobs.

Data Warehouse

Data Warehouse Sales Data Lake Recreation/Entertainment

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

AWS Big Data

DECEMBER 12, 2024

Amazon Athena offers serverless, flexible SQL analytics for one-time queries, enabling direct querying of Amazon Simple Storage Service (Amazon S3) data for rapid, cost-effective instant analysis. In this post, we use dbt for data modeling on both Amazon Athena and Amazon Redshift. has('lineage_node', 'node_name', '{node}').fold().coalesce(unfold(),

Snapshot

Snapshot Recreation/Entertainment Experimentation Data Lake

Deriving Value from Data Lakes with AI

Sisense

DECEMBER 23, 2019

AI and ML are the only ways to derive value from massive data lakes, cloud-native data warehouses, and other huge stores of information. Your platform should be able to effortlessly connect to numerous large datasets: cloud-native data warehouses and data lakes with millions or billions of rows.

Data Lake

Data Lake Machine Learning Data Warehouse Digital Transformation

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

Amazon Athena supports the MERGE command on Apache Iceberg tables, which allows you to perform inserts, updates, and deletes in your data lake at scale using familiar SQL statements that are compliant with ACID (Atomic, Consistent, Isolated, Durable).

Data Lake

Data Lake Snapshot Optimization Data Transformation

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

For container terminal operators, data-driven decision-making and efficient data sharing are vital to optimizing operations and boosting supply chain efficiency. She has been heavily involved in the Data Sharing Project, focusing on the implementation of Amazon DataZone into EUROGATEs IT environment.

IoT

IoT Machine Learning Metadata Data-driven

Nexthink scales to trillions of events per day with Amazon MSK

AWS Big Data

MARCH 29, 2024

Real-time data streaming and event processing present scalability and management challenges. AWS offers a broad selection of managed real-time data streaming services to effortlessly run these workloads at any scale. This allows IT to evolve from reactive problem-solving to proactive optimization.

Data-driven

Data-driven Cost-Benefit Metrics Management

The Internet of Things: Real-Time Data and Analytics Enable Business Innovation

David Menninger's Analyst Perspectives

MAY 20, 2019

This innovation means that virtually any appropriately designed device can generate and transmit data about its operations, which can facilitate monitoring and a range of automatic functions.

Internet of Things

Internet of Things IoT Analytics Optimization

Deploy and Optimize Your Snowflake Environment Faster With Accelerators

CDW Research Hub

JULY 18, 2022

One modern data platform solution that provides simplicity and flexibility to grow is Snowflake’s data cloud and platform. These Snowflake accelerators reduce the time to analytics for your users at all levels so you can make data-driven decisions faster. Security Data Lake. Optimizing Snowflake functionality.

Optimization

Optimization Data Lake Data Warehouse Manufacturing

Data Lakes: What Are They and Who Needs Them?

Jet Global

JULY 2, 2019

To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake. What’s in a Data Lake? Data warehouses do a great job of standardizing data from disparate sources for analysis. Taking a Dip.

Data Lake

Data Lake Data Warehouse Big Data Machine Learning

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

AWS Big Data

FEBRUARY 14, 2023

Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history.

Data Lake

Data Lake Statistics Data Architecture Finance

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

AWS Big Data

JUNE 23, 2023

Imperva Cloud WAF protects hundreds of thousands of websites and blocks billions of security events every day. Events and many other security data types are stored in Imperva’s Threat Research Multi-Region data lake. Imperva harnesses data to improve their business outcomes.

Data Lake

Data Lake Dashboards Cost-Benefit Data Warehouse

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

This premier event showcased groundbreaking advancements, keynotes from AWS leadership, hands-on technical sessions, and exciting product launches. Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights.

Analytics

Analytics Data Lake Metadata Data Warehouse

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

AWS Big Data

JUNE 19, 2023

Security Lake automatically centralizes security data from cloud, on-premises, and custom sources into a purpose-built data lake stored in your account. With Security Lake, you can get a more complete understanding of your security data across your entire organization.

Publishing

Publishing Dashboards Visualization Management

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Why Modernizing the First Mile of the Data Pipeline Can Accelerate all Analytics

Cloudera

AUGUST 13, 2021

Every enterprise is trying to collect and analyze data to get better insights into their business. Whether it is consuming log files, sensor metrics, and other unstructured data, most enterprises manage and deliver data to the data lake and leverage various applications like ETL tools, search engines, and databases for analysis.

Analytics

Analytics Data Lake Unstructured Data Enterprise

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Alation

FEBRUARY 20, 2020

For many enterprises, a hybrid cloud data lake is no longer a trend, but becoming reality. Not only can resources be quickly provisioned and optimized for different workloads and processing needs, but it can be done cost effectively. earthquake, flood, or fire), where the data collected does not need to be as tightly controlled.

Data Lake

Data Lake ROI Metadata Cost-Benefit

Data in 2021: Ventana Research Market Agenda

David Menninger's Analyst Perspectives

FEBRUARY 26, 2021

Ventana Research recently announced its 2021 Market Agenda for data, continuing the guidance we have offered for nearly two decades to help organizations derive optimal value and improve business outcomes.

Marketing

Marketing Optimization Data Lake IT

Build an end-to-end serverless streaming pipeline with Apache Kafka on Amazon MSK using Python

AWS Big Data

MARCH 21, 2024

You want real-time access to this data so you can monitor performance in real time, and detect and mitigate issues quickly. You also need longer-term access to this data for machine learning (ML) models to run predictive maintenance assessments, find optimization opportunities, and forecast demand. b64decode(x["value"]).decode('utf-8')

Data Lake

Data Lake Management Modeling Optimization

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

AWS Big Data

MAY 15, 2024

You can use this solution regularly as part of your cost-optimization efforts to safely remove unused EIPs to reduce your costs. Check out the GitHub repo to regularly run this analysis as part of your cost-optimization strategy to identify and release inactive EIPs to reduce costs.

Snapshot

Snapshot Optimization Data Lake Reporting

Detect, mask, and redact PII data using AWS Glue before loading into Amazon OpenSearch Service

AWS Big Data

JANUARY 12, 2024

Leadership and development teams can spend more time optimizing current solutions and even experimenting with new use cases, rather than maintaining the current infrastructure. With the ability to move fast on AWS, you also need to be responsible with the data you’re receiving and processing as you continue to scale.

Data Lake

Data Lake Cost-Benefit Visualization Structured Data

Porsche Carrera Cup Brasil gets real-time data boost

CIO Business Intelligence

MAY 21, 2024

In the annual Porsche Carrera Cup Brasil, data is essential to keep drivers safe and sustain optimal performance of race cars. Until recently, getting at and analyzing that essential data was a laborious affair that could take hours, and only once the race was over.

Broadcasting

Broadcasting Recreation/Entertainment Manufacturing Data Lake

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

AWS Big Data

JUNE 5, 2024

The integration is new way for customers to query operational logs in Amazon S3 and Amazon S3-based data lakes without needing to switch between tools to analyze operational data. You can now analyze data in cloud object stores and simultaneously use the operational analytics and visualizations of OpenSearch Service.

Data Lake

Data Lake Dashboards Cost-Benefit Visualization

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. The timestamp clause lets us travel back without altering current data.

Snapshot

Snapshot Data Lake Metadata Optimization

DataOps Observability: Taming the Chaos (Part 3)

DataKitchen

NOVEMBER 18, 2022

As he thinks through the various journeys that data take in his company, Jason sees that his dashboard idea would require extracting or testing for events along the way. So, the only way for a data journey to truly observe what’s happening is to get his tools and pipelines to auto-report events. Data and tool tests.

Testing

Testing Statistics Measurement Metrics

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings.

Statistics

Statistics Data Lake Optimization Data-driven

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

7 key Microsoft Azure analytics services (plus one extra)

CIO Business Intelligence

JUNE 29, 2022

If you can’t make sense of your business data, you’re effectively flying blind. Insights hidden in your data are essential for optimizing business operations, finetuning your customer experience, and developing new products — or new lines of business, like predictive maintenance. Azure Data Lake Analytics.

Data Lake

Data Lake Analytics Data Warehouse Machine Learning

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

AWS Big Data

NOVEMBER 20, 2023

Customers often use Apache Spark Web UI , a popular debugging tool that is part of open source Apache Spark, to help fix problems and optimize job performance. Prerequisites Complete the following prerequisite steps: Enable Spark UI event logs for your job runs. For more information, refer to Data Catalog and crawlers in AWS Glue.

Visualization

Visualization Optimization Data Lake Management

Use AWS Glue to streamline SFTP data processing

AWS Big Data

AUGUST 13, 2024

With AWS Glue, you can discover and connect to hundreds of diverse data sources and manage your data in a centralized data catalog. It enables you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Choose the Job details tab.

Data Processing

Data Processing Visualization Data Lake Data Processing

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

Webinars

Trending Sources

Drug Launch Case Study: Amazing Efficiency Using DataOps

Webinars

Choosing an open table format for your transactional data lake on AWS

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

How Salesforce optimized their detection and response platform using AWS managed services

Enrich your serverless data lake with Amazon Bedrock

Build a high-performance quant research platform with Apache Iceberg

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Implementing a Pharma Data Mesh using DataOps

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

Deriving Value from Data Lakes with AI

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

How EUROGATE established a data mesh architecture using Amazon DataZone

Nexthink scales to trillions of events per day with Amazon MSK

The Internet of Things: Real-Time Data and Analytics Enable Business Innovation

Deploy and Optimize Your Snowflake Environment Faster With Accelerators

Data Lakes: What Are They and Who Needs Them?

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Top analytics announcements of AWS re:Invent 2024

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Why Modernizing the First Mile of the Data Pipeline Can Accelerate all Analytics

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Data in 2021: Ventana Research Market Agenda

Build an end-to-end serverless streaming pipeline with Apache Kafka on Amazon MSK using Python

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

Detect, mask, and redact PII data using AWS Glue before loading into Amazon OpenSearch Service

Porsche Carrera Cup Brasil gets real-time data boost

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

Use Amazon Athena with Spark SQL for your open-source transactional table formats

DataOps Observability: Taming the Chaos (Part 3)

Enhance query performance using AWS Glue Data Catalog column-level statistics

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

7 key Microsoft Azure analytics services (plus one extra)

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

Use AWS Glue to streamline SFTP data processing

Stay Connected