Data Lake, Events and Testing - Data Leaders Brief

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

data engineers delivered over 100 lines of code and 1.5 data quality tests every day to support a cast of analysts and customers. They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. It is necessary to have more than a data lake and a database.

Data Quality

Data Quality Data Lake Testing Statistics

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

AWS Big Data

NOVEMBER 22, 2024

The need for streamlined data transformations As organizations increasingly adopt cloud-based data lakes and warehouses, the demand for efficient data transformation tools has grown. Using Athena and the dbt adapter, you can transform raw data in Amazon S3 into well-structured tables suitable for analytics.

Data Lake

Data Lake Data Warehouse Cost-Benefit Data Transformation

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

AWS Big Data

OCTOBER 23, 2024

The Airflow REST API facilitates a wide range of use cases, from centralizing and automating administrative tasks to building event-driven, data-aware data pipelines. Event-driven architectures – The enhanced API facilitates seamless integration with external events, enabling the triggering of Airflow DAGs based on these events.

Interactive

Interactive Testing Data-driven Data Lake

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Over the years, organizations have invested in creating purpose-built, cloud-based data lakes that are siloed from one another. A major challenge is enabling cross-organization discovery and access to data across these multiple data lakes, each built on different technology stacks.

Data Lake

Data Lake Publishing Metadata Data-driven

Using AWS AppSync and AWS Lake Formation to access a secure data lake through a GraphQL API

AWS Big Data

OCTOBER 9, 2023

Data lakes have been gaining popularity for storing vast amounts of data from diverse sources in a scalable and cost-effective way. As the number of data consumers grows, data lake administrators often need to implement fine-grained access controls for different user profiles.

Data Lake

Data Lake Testing Big Data Management

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Enterprises and organizations across the globe want to harness the power of data to make better decisions by putting data at the center of every decision-making process. However, throughout history, data services have held dominion over their customers’ data.

Data Lake

Data Lake Metadata Snapshot Analytics

Monitor data pipelines in a serverless data lake

AWS Big Data

AUGUST 9, 2023

The combination of a data lake in a serverless paradigm brings significant cost and performance benefits. By monitoring application logs, you can gain insights into job execution, troubleshoot issues promptly to ensure the overall health and reliability of data pipelines.

Data Lake

Data Lake Metrics Cost-Benefit Testing

Implementing a Pharma Data Mesh using DataOps

DataKitchen

AUGUST 19, 2021

Figure 3 shows an example processing architecture with data flowing in from internal and external sources. Each data source is updated on its own schedule, for example, daily, weekly or monthly. The data scientists and analysts have what they need to build analytics for the user. The new Recipes run, and BOOM! Conclusion.

Data Warehouse

Data Warehouse Data Lake Manufacturing Testing

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

The DataFrame code generation now extends beyond AWS Glue DynamicFrame to support a broader range of data processing scenarios. The TICKIT dataset records sales activities on the fictional TICKIT website, where users can purchase and sell tickets online for different types of events such as sports games, shows, and concerts.

Data Integration

Data Integration Visualization Data Processing Big Data

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

For many organizations, this centralized data store follows a data lake architecture. Although data lakes provide a centralized repository, making sense of this data and extracting valuable insights can be challenging. Amazon S3 emits an object created event and matches an EventBridge rule.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

These features allow efficient data corrections, gap-filling in time series, and historical data updates without disrupting ongoing analyses or compromising data integrity. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Big Data

Build an ETL process for Amazon Redshift using Amazon S3 Event Notifications and AWS Step Functions

AWS Big Data

AUGUST 31, 2023

It also helps you to securely access your data in operational databases, data lakes or third-party datasets with minimal movement or copying. Amazon S3 Event Notifications is an Amazon S3 feature that you can enable in order to receive notifications when specific events occur in your S3 bucket.

Data Warehouse

Data Warehouse Data-driven Testing Business Intelligence

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

Amazon Athena supports the MERGE command on Apache Iceberg tables, which allows you to perform inserts, updates, and deletes in your data lake at scale using familiar SQL statements that are compliant with ACID (Atomic, Consistent, Isolated, Durable).

Data Lake

Data Lake Snapshot Optimization Data Transformation

DataOps Observability: Taming the Chaos (Part 3)

DataKitchen

NOVEMBER 18, 2022

As he thinks through the various journeys that data take in his company, Jason sees that his dashboard idea would require extracting or testing for events along the way. So, the only way for a data journey to truly observe what’s happening is to get his tools and pipelines to auto-report events.

Testing

Testing Statistics Measurement Metrics

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

AWS Big Data

JUNE 19, 2023

Security Lake automatically centralizes security data from cloud, on-premises, and custom sources into a purpose-built data lake stored in your account. With Security Lake, you can get a more complete understanding of your security data across your entire organization.

Publishing

Publishing Dashboards Visualization Management

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

The data engineer then emails the BI Team, who refreshes a Tableau dashboard. Figure 1: Example data pipeline with manual processes. There are no automated tests , so errors frequently pass through the pipeline. Figure 2: Example data pipeline with DataOps automation. Adding Tests to Reduce Stress.

Testing

Testing Metadata Dashboards Statistics

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

This premier event showcased groundbreaking advancements, keynotes from AWS leadership, hands-on technical sessions, and exciting product launches. Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights.

Analytics

Analytics Data Lake Metadata Data Warehouse

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale.

Snapshot

Snapshot Data Lake Testing Strategy

Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and Amazon S3 Access Grants

AWS Big Data

MAY 29, 2024

To enable access control with Lake Formation for Redshift tables, we use data sharing in Lake Formation. Data access requests by the specific users are logged to CloudTrail. Later in this post, we also briefly touch upon using CloudTrail Lake to query the data access events.

Data Lake

Data Lake Enterprise Management Business Intelligence

Porsche Carrera Cup Brasil gets real-time data boost

CIO Business Intelligence

MAY 21, 2024

Unlike many other events, which consist of multiple racing teams and manufacturers, Porsche Carrera Cup Brasil provides and maintains all 75 cars used in the race. If I don’t do predictive maintenance, if I have to do corrective maintenance at events, a lot of money is wasted.”

Broadcasting

Broadcasting Recreation/Entertainment Manufacturing Data Lake

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Data governance is the process of ensuring the integrity, availability, usability, and security of an organization’s data. Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake.

Data Quality

Data Quality Data Governance Data Lake Testing

Why the Data Journey Manifesto?

DataKitchen

JUNE 12, 2023

We had been talking about “Agile Analytic Operations,” “DevOps for Data Teams,” and “Lean Manufacturing For Data,” but the concept was hard to get across and communicate. I spent much time de-categorizing DataOps: we are not discussing ETL, Data Lake, or Data Science.

Testing

Testing Dashboards Data Lake Data Science

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

Comparison of modern data architectures : Architecture Definition Strengths Weaknesses Best used when Data warehouse Centralized, structured and curated data repository. Inflexible schema, poor for unstructured or real-time data. Data lake Raw storage for all types of structured and unstructured data.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Integrate custom applications with AWS Lake Formation – Part 1

AWS Big Data

NOVEMBER 19, 2024

Due to these limitations, the application should not be used for arbitrary tests. In this post, we provide instructions on how to deploy a sample API application integrated with Lake Formation that implements the solution architecture. We also show how to test the function with Lambda tests.

Data Lake

Data Lake Metadata Testing Data Processing

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

With automated alerting with a third-party service like PagerDuty , an incident management platform, combined with the robust and powerful alerting plugin provided by OpenSearch Service, businesses can proactively manage and respond to critical events. Leave the defaults and choose Next.

Data Lake

Data Lake Dashboards Metrics Testing

Automate schema evolution at scale with Apache Hudi in AWS Glue

AWS Big Data

FEBRUARY 7, 2023

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. Apache Hudi supports ACID transactions and CRUD operations on a data lake. You don’t alter queries separately in the data lake.

Data Lake

Data Lake Testing Big Data Structured Data

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

AWS Big Data

JUNE 5, 2024

The integration is new way for customers to query operational logs in Amazon S3 and Amazon S3-based data lakes without needing to switch between tools to analyze operational data. Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance.

Data Lake

Data Lake Dashboards Cost-Benefit Visualization

Moving Enterprise Data From Anywhere to Any System Made Easy

Cloudera

JUNE 2, 2022

Over the last decade, we have often heard about the proliferation of data creating sources (mobile applications, laptops, sensors, enterprise apps) in heterogeneous environments (cloud, on-prem, edge) resulting in the exponential growth of data being created.

Enterprise

Enterprise Data Lake Data Collection Data-driven

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Amazon Redshift is a fully managed data warehousing service that offers both provisioned and serverless options, making it more efficient to run and scale analytics without having to manage your data warehouse. Additionally, data is extracted from vendor APIs that includes data related to product, marketing, and customer experience.

Data Warehouse

Data Warehouse Analytics Data Lake Data Science

Query AWS Glue Data Catalog views using Amazon Athena and Amazon Redshift

AWS Big Data

AUGUST 8, 2024

Today’s data lakes are expanding across lines of business operating in diverse landscapes and using various engines to process and analyze data. Traditionally, SQL views have been used to define and share filtered data sets that meet the requirements of these lines of business for easier consumption.

Data Lake

Data Lake Sales Marketing Big Data

Better, faster decisions: Why businesses thrive on real-time data

CIO Business Intelligence

SEPTEMBER 8, 2022

To access data in real time — and ensure that it provides actionable insights for all stakeholders — organizations should invest in the foundational components that enable more efficient, scalable, and secure data collection, processing, and analysis. Nichol ( @PeterBNichol ), Chief Technology Officer at OROCA Innovations. “The

Cost-Benefit

Cost-Benefit Internet of Things Data-driven Data Lake

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Access Amazon Athena in your applications using the WebSocket API

AWS Big Data

MARCH 2, 2023

Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products.

Data Lake

Data Lake Testing Interactive Unstructured Data

Real-Time Data at Verizon: It’s as Critical as Air

CIO Business Intelligence

MAY 12, 2022

The biggest challenge for any big enterprise is organizing the data that has organically grown across the organization over the last several years. Everyone has data lakes, data ponds – whatever you want to call them. How do you get your arms around all the data you have? This isn’t unique to Verizon.

Testing

Testing Advertising Data Lake Marketing

Moving Enterprise Data From Anywhere to Any System Made Easy

CIO Business Intelligence

JULY 13, 2022

Over the last decade, we have often heard about the proliferation of data creating sources (mobile applications, laptops, sensors, enterprise apps) in heterogeneous environments (cloud, on-prem, edge) resulting in the exponential growth of data being created.

Enterprise

Enterprise Data Lake Data Collection Data-driven

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider

AWS Big Data

FEBRUARY 15, 2024

You might be modernizing your data architecture using Amazon Redshift to enable access to your data lake and data in your data warehouse, and are looking for a centralized and scalable way to define and manage the data access based on IdP identities. For IAM role , choose a Lake Formation user-defined role.

Management

Management Data Lake Sales Data Warehouse

Drug Launch Case Study: Amazing Efficiency Using DataOps

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

Webinars

Trending Sources

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Webinars

Run Apache XTable in AWS Lambda for background conversion of open table formats

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Using AWS AppSync and AWS Lake Formation to access a secure data lake through a GraphQL API

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Monitor data pipelines in a serverless data lake

Implementing a Pharma Data Mesh using DataOps

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Enrich your serverless data lake with Amazon Bedrock

Choosing an open table format for your transactional data lake on AWS

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Build a high-performance quant research platform with Apache Iceberg

Build a real-time GDPR-aligned Apache Iceberg data lake

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Build an ETL process for Amazon Redshift using Amazon S3 Event Notifications and AWS Step Functions

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

DataOps Observability: Taming the Chaos (Part 3)

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

A Day in the Life of a DataOps Engineer

Top analytics announcements of AWS re:Invent 2024

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and Amazon S3 Access Grants

Porsche Carrera Cup Brasil gets real-time data boost

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Why the Data Journey Manifesto?

Data’s dark secret: Why poor quality cripples AI and growth

Integrate custom applications with AWS Lake Formation – Part 1

Implement alerts in Amazon OpenSearch Service with PagerDuty

Automate schema evolution at scale with Apache Hudi in AWS Glue

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

Moving Enterprise Data From Anywhere to Any System Made Easy

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Query AWS Glue Data Catalog views using Amazon Athena and Amazon Redshift

Better, faster decisions: Why businesses thrive on real-time data

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Access Amazon Athena in your applications using the WebSocket API

Real-Time Data at Verizon: It’s as Critical as Air

Moving Enterprise Data From Anywhere to Any System Made Easy

Governing data in relational databases using Amazon DataZone

Enhance query performance using AWS Glue Data Catalog column-level statistics

Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider

Stay Connected