Data Lake and Data Quality - Data Leaders Brief

Why Your Data Lake Needs Bad Data

David Menninger's Analyst Perspectives

MAY 13, 2021

Everyone talks about data quality, as they should. Our research shows that improving the quality of information is the top benefit of data preparation activities. Data quality efforts are focused on clean data. Yes, clean data is important. but so is bad data.

Data Lake

Data Lake Data Quality Data Governance Management

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

data engineers delivered over 100 lines of code and 1.5 data quality tests every day to support a cast of analysts and customers. They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. It is necessary to have more than a data lake and a database.

Data Quality

Data Quality Data Lake Testing Statistics

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

DataKitchen

SEPTEMBER 18, 2024

A DataOps Approach to Data Quality The Growing Complexity of Data Quality Data quality issues are widespread, affecting organizations across industries, from manufacturing to healthcare and financial services. 73% of data practitioners do not trust their data (IDC).

Scorecard

Scorecard Data Quality Measurement Testing

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

AWS Big Data

NOVEMBER 22, 2024

The need for streamlined data transformations As organizations increasingly adopt cloud-based data lakes and warehouses, the demand for efficient data transformation tools has grown. Using Athena and the dbt adapter, you can transform raw data in Amazon S3 into well-structured tables suitable for analytics.

Data Lake

Data Lake Data Warehouse Cost-Benefit Data Transformation

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

As technology and business leaders, your strategic initiatives, from AI-powered decision-making to predictive insights and personalized experiences, are all fueled by data. Yet, despite growing investments in advanced analytics and AI, organizations continue to grapple with a persistent and often underestimated challenge: poor data quality.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

Amazon SageMaker Lakehouse , now generally available, unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. Having confidence in your data is key.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Data Lakes on Cloud & it’s Usage in Healthcare

BizAcuity

MARCH 29, 2019

Data lakes are centralized repositories that can store all structured and unstructured data at any desired scale. The power of the data lake lies in the fact that it often is a cost-effective way to store data. The power of the data lake lies in the fact that it often is a cost-effective way to store data.

Data Lake

Data Lake Unstructured Data Cost-Benefit Data Quality

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

AWS Big Data

OCTOBER 9, 2024

Today, customers are embarking on data modernization programs by migrating on-premises data warehouses and data lakes to the AWS Cloud to take advantage of the scale and advanced analytical capabilities of the cloud. Some customers build custom in-house data parity frameworks to validate data during migration.

Data Quality

Data Quality Data Lake Data Warehouse Metrics

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Data governance is the process of ensuring the integrity, availability, usability, and security of an organization’s data. Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake.

Data Quality

Data Quality Data Governance Data Lake Testing

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Unlocking the true value of data often gets impeded by siloed information. Traditional data management—wherein each business unit ingests raw data in separate data lakes or warehouses—hinders visibility and cross-functional analysis. Amazon DataZone natively supports data sharing for Amazon Redshift data assets.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

AWS Glue Data Quality is Generally Available

AWS Big Data

JUNE 6, 2023

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. It takes days for data engineers to identify and implement data quality rules.

Data Quality

Data Quality Statistics Data Lake Visualization

Get started with AWS Glue Data Quality dynamic rules for ETL pipelines

AWS Big Data

MAY 23, 2024

They establish data quality rules to ensure the extracted data is of high quality for accurate business decisions. These rules assess the data based on fixed criteria reflecting current business states. We are excited to talk about how to use dynamic rules , a new capability of AWS Glue Data Quality.

Data Quality

Data Quality Metrics Sales Data Lake

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset.

Data Quality

Data Quality Measurement Testing Visualization

Talend Data Fabric Simplifies Data Life Cycle Management

David Menninger's Analyst Perspectives

NOVEMBER 16, 2021

Talend is a data integration and management software company that offers applications for cloud computing, big data integration, application integration, data quality and master data management.

Management

Management Data Warehouse Data Quality Data Integration

Data Management on Display at Informatica World 2019

David Menninger's Analyst Perspectives

JUNE 12, 2019

Under that focus, Informatica's conference emphasized capabilities across six areas (all strong areas for Informatica): data integration, data management, data quality & governance, Master Data Management (MDM), data cataloging, and data security.

Management

Management Data Quality Data Integration Data Lake

Visualize data quality scores and metrics generated by AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

AWS Glue Data Quality allows you to measure and monitor the quality of data in your data repositories. It’s important for business users to be able to see quality scores and metrics to make confident business decisions and debug data quality issues. An AWS Glue crawler crawls the results.

Data Quality

Data Quality Metrics Visualization Dashboards

Fire Your Super-Smart Data Consultants with DataOps

DataKitchen

JANUARY 25, 2022

Ensuring that data is available, secure, correct, and fit for purpose is neither simple nor cheap. Companies end up paying outside consultants enormous fees while still having to suffer the effects of poor data quality and lengthy cycle time. . For example, DataOps can be used to automate data integration.

Consulting

Consulting Testing Data Lake Data Quality

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Data Lakes: What Are They and Who Needs Them?

Jet Global

JULY 2, 2019

To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake. What’s in a Data Lake? Data warehouses do a great job of standardizing data from disparate sources for analysis. Taking a Dip.

Data Lake

Data Lake Data Warehouse Big Data Machine Learning

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

Get out of the data swamp with a governed data lake

IBM Big Data Hub

MAY 2, 2018

Making your data lake a “governed data lake” is the game changer. Without governance, organizations risk securing the data and as well as protecting it. A governed data lake contains data that’s accessible, clean, trusted and protected.

Data Lake

Data Lake Data Quality Risk IT

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

Poor-quality data can lead to incorrect insights, bad decisions, and lost opportunities. AWS Glue Data Quality measures and monitors the quality of your dataset. It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines.

Data Quality

Data Quality Data Lake Visualization Data-driven

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

AWS Big Data

JUNE 6, 2023

You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. Hundreds of thousands of customers use data lakes for analytics and ML to make data-driven business decisions.

Data Quality

Data Quality Data-driven Data Lake Metrics

Building a Beautiful Data Lakehouse

CIO Business Intelligence

MARCH 9, 2022

However, they do contain effective data management, organization, and integrity capabilities. As a result, users can easily find what they need, and organizations avoid the operational and cost burdens of storing unneeded or duplicate data copies. On the other hand, they don’t support transactions or enforce data quality.

Data Lake

Data Lake Unstructured Data Data Warehouse Big Data

Steps taken to build Sevita’s first enterprise data platform

CIO Business Intelligence

OCTOBER 23, 2024

We pulled these people together, and defined use cases we could all agree were the best to demonstrate our new data capability. Once they were identified, we had to determine we had the right data. Then we migrated the data to our new data lake, and stood up the new platform.

Enterprise

Enterprise Dashboards KPI Data Lake

Analyzing the business-case approach Perdue Farms takes to derive value from data

CIO Business Intelligence

SEPTEMBER 20, 2023

On the agribusiness side we source, purchase, and process agricultural commodities and offer a diverse portfolio of products including grains, soybean meal, blended feed ingredients, and top-quality oils for the food industry to add value to the commodities our customers desire. The data can also help us enrich our commodity products.

Data Lake

Data Lake Data-driven Dashboards Risk

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

cycle_end"', "sagemakedatalakeenvironment_sub_db", ctas_approach=False) A similar approach is used to connect to shared data from Amazon Redshift, which is also shared using Amazon DataZone. The data science and AI teams are able to explore and use new data sources as they become available through Amazon DataZone.

IoT

IoT Machine Learning Metadata Data-driven

Data architecture strategy for data quality

IBM Big Data Hub

JANUARY 5, 2023

Poor data quality is one of the top barriers faced by organizations aspiring to be more data-driven. Ill-timed business decisions and misinformed business processes, missed revenue opportunities, failed business initiatives and complex data systems can all stem from data quality issues.

Data Architecture

Data Architecture Data Quality Strategy Data Lake

Informatica’s new data management clouds target health, finance services

CIO Business Intelligence

MAY 24, 2022

In order to help maintain data privacy while validating and standardizing data for use, the IDMC platform offers a Data Quality Accelerator for Crisis Response. Cloud Computing, Data Management, Financial Services Industry, Healthcare Industry

Finance

Finance Management Metadata Machine Learning

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Domain ownership recognizes that the teams generating the data have the deepest understanding of it and are therefore best suited to manage, govern, and share it effectively. This principle makes sure data accountability remains close to the source, fostering higher data quality and relevance.

Metadata

Metadata Data Governance Data Quality Data-driven

Navigating the Chaos of Unruly Data: Solutions for Data Teams

DataKitchen

NOVEMBER 10, 2023

The core issue plaguing many organizations is the presence of out-of-control databases or data lakes characterized by: Unrestrained Data Changes: Numerous users and tools incessantly alter data, leading to a tumultuous environment. Monitor freshness, schema changes, volume, and column health are standard.

Data Quality

Data Quality Testing Data Lake Data Integration

The essential check list for effective data democratization

CIO Business Intelligence

JANUARY 20, 2023

“All of a sudden, you’re trying to give this data to somebody who’s not a data person,” he says, “and it’s really easy for them to draw erroneous or misleading insights from that data.” As more companies use the cloud and cloud-native development, normalizing data has become more complicated.

Data Lake

Data Lake Data-driven Finance Data Architecture

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

These formats, exemplified by Apache Iceberg, Apache Hudi, and Delta Lake, addresses persistent challenges in traditional data lake structures by offering an advanced combination of flexibility, performance, and governance capabilities. For more information, refer to What are deletion vectors?

Snapshot

Snapshot Metadata Data Lake Optimization

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. Both operations target the same partition based on customer_id , leading to potential conflicts because theyre modifying an overlapping dataset.

Snapshot

Snapshot Management Metadata Big Data

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

AWS Big Data

FEBRUARY 27, 2024

The following are the key components of the Bluestone Data Platform: Data mesh architecture – Bluestone adopted a data mesh architecture, a paradigm that distributes data ownership across different business units. This enables data-driven decision-making across the organization.

Data-driven

Data-driven Data Lake Data Quality Data Governance

What is Dark Data, Why Does it Matter, and Why Are Humans Still Needed?

Timo Elliott

JANUARY 3, 2022

It’s stored in corporate data warehouses, data lakes, and a myriad of other locations – and while some of it is put to good use, it’s estimated that around 73% of this data remains unexplored. Improving data quality. Unexamined and unused data is often of poor quality. Data augmentation.

IT

IT Unstructured Data Data Quality Machine Learning

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

To provide a response that includes the enterprise context, each user prompt needs to be augmented with a combination of insights from structured data from the data warehouse and unstructured data from the enterprise data lake. Implement data privacy policies. Implement data quality by data type and source.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. Data quality check – The data quality module enables you to perform quality checks on a huge amount of data and generate reports that describe and validate the data quality.

Metadata

Metadata Data Lake Visualization Data Quality

Decentralize LF-tag management with AWS Lake Formation

AWS Big Data

NOVEMBER 16, 2023

One of the core features of AWS Lake Formation is the delegation of permissions on a subset of resources such as databases, tables, and columns in AWS Glue Data Catalog to data stewards, empowering them make decisions regarding who should get access to their resources and helping you decentralize the permissions management of your data lakes.

Management

Management Data Lake Sales Machine Learning

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2022 , we talked about the enhancements we had done to these services. Bien intégré!

Data Lake

Data Lake Metadata Data Governance Statistics

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

AWS Big Data

DECEMBER 9, 2024

As organizations process vast amounts of data, maintaining an accurate historical record is crucial. History management in data systems is fundamental for compliance, business intelligence, data quality, and time-based analysis. Hes passionate about helping customers use Apache Iceberg for their data lakes on AWS.

Snapshot

Snapshot Data Warehouse Data Lake Data Quality

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

One of the bank’s key challenges related to strict cybersecurity requirements is to implement field level encryption for personally identifiable information (PII), Payment Card Industry (PCI), and data that is classified as high privacy risk (HPR). Only users with required permissions are allowed to access data in clear text.

Data Lake

Data Lake Data Warehouse Management Risk

Why Your Data Lake Needs Bad Data

Drug Launch Case Study: Amazing Efficiency Using DataOps

Webinars

Trending Sources

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

Webinars

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

Data’s dark secret: Why poor quality cripples AI and growth

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Data Lakes on Cloud & it’s Usage in Healthcare

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Glue Data Quality is Generally Available

Get started with AWS Glue Data Quality dynamic rules for ETL pipelines

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Measure performance of AWS Glue Data Quality for ETL pipelines

Talend Data Fabric Simplifies Data Life Cycle Management

Data Management on Display at Informatica World 2019

Visualize data quality scores and metrics generated by AWS Glue Data Quality

Fire Your Super-Smart Data Consultants with DataOps

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Data Lakes: What Are They and Who Needs Them?

What is a Data Mesh?

Get out of the data swamp with a governed data lake

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

Building a Beautiful Data Lakehouse

Steps taken to build Sevita’s first enterprise data platform

Analyzing the business-case approach Perdue Farms takes to derive value from data

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

How EUROGATE established a data mesh architecture using Amazon DataZone

Data architecture strategy for data quality

Informatica’s new data management clouds target health, finance services

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Navigating the Chaos of Unruly Data: Solutions for Data Teams

The essential check list for effective data democratization

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

What is Dark Data, Why Does it Matter, and Why Are Humans Still Needed?

Data governance in the age of generative AI

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Decentralize LF-tag management with AWS Lake Formation

AWS Lake Formation 2023 year in review

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Stay Connected