Data Lake, Data Quality and Testing

Data Lake

Data Quality

Testing

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

data engineers delivered over 100 lines of code and 1.5 data quality tests every day to support a cast of analysts and customers. They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. It is necessary to have more than a data lake and a database.

Data Quality

Data Quality Data Lake Testing Statistics

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

AWS Big Data

NOVEMBER 22, 2024

The need for streamlined data transformations As organizations increasingly adopt cloud-based data lakes and warehouses, the demand for efficient data transformation tools has grown. Using Athena and the dbt adapter, you can transform raw data in Amazon S3 into well-structured tables suitable for analytics.

Data Lake

Data Lake Data Warehouse Cost-Benefit Data Transformation

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

DataKitchen

SEPTEMBER 18, 2024

A DataOps Approach to Data Quality The Growing Complexity of Data Quality Data quality issues are widespread, affecting organizations across industries, from manufacturing to healthcare and financial services. 73% of data practitioners do not trust their data (IDC).

Scorecard

Scorecard Data Quality Measurement Testing

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

AWS Big Data

OCTOBER 9, 2024

Today, customers are embarking on data modernization programs by migrating on-premises data warehouses and data lakes to the AWS Cloud to take advantage of the scale and advanced analytical capabilities of the cloud. Some customers build custom in-house data parity frameworks to validate data during migration.

Data Quality

Data Quality Data Lake Data Warehouse Metrics

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

As technology and business leaders, your strategic initiatives, from AI-powered decision-making to predictive insights and personalized experiences, are all fueled by data. Yet, despite growing investments in advanced analytics and AI, organizations continue to grapple with a persistent and often underestimated challenge: poor data quality.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Data governance is the process of ensuring the integrity, availability, usability, and security of an organization’s data. Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake.

Data Quality

Data Quality Data Governance Data Lake Testing

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset.

Data Quality

Data Quality Measurement Testing Visualization

Fire Your Super-Smart Data Consultants with DataOps

DataKitchen

JANUARY 25, 2022

Ensuring that data is available, secure, correct, and fit for purpose is neither simple nor cheap. Companies end up paying outside consultants enormous fees while still having to suffer the effects of poor data quality and lengthy cycle time. . When a job is automated, there is little advantage to outsourcing. .

Consulting

Consulting Testing Data Lake Data Quality

Visualize data quality scores and metrics generated by AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

AWS Glue Data Quality allows you to measure and monitor the quality of data in your data repositories. It’s important for business users to be able to see quality scores and metrics to make confident business decisions and debug data quality issues. An AWS Glue crawler crawls the results.

Data Quality

Data Quality Metrics Visualization Dashboards

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

Accomplish Agile Business Intelligence & Analytics For Your Business

datapine

APRIL 15, 2020

Your Chance: Want to test an agile business intelligence solution? Business intelligence is moving away from the traditional engineering model: analysis, design, construction, testing, and implementation. Test BI in a small group and deploy the software internally. Finalize testing. Test throughout the lifecycle.

Business Intelligence

Business Intelligence Analytics Testing Dashboards

Navigating the Chaos of Unruly Data: Solutions for Data Teams

DataKitchen

NOVEMBER 10, 2023

The core issue plaguing many organizations is the presence of out-of-control databases or data lakes characterized by: Unrestrained Data Changes: Numerous users and tools incessantly alter data, leading to a tumultuous environment. Monitor freshness, schema changes, volume, and column health are standard.

Data Quality

Data Quality Testing Data Lake Data Integration

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

The data engineer then emails the BI Team, who refreshes a Tableau dashboard. Figure 1: Example data pipeline with manual processes. There are no automated tests , so errors frequently pass through the pipeline. Figure 2: Example data pipeline with DataOps automation. Adding Tests to Reduce Stress.

Testing

Testing Metadata Dashboards Statistics

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. Manually upgrading, testing, and deploying over 5,000 jobs every few quarters was time consuming, error prone, costly, and not sustainable.

Metadata

Metadata Data Lake Visualization Data Quality

DataOps Observability: Taming the Chaos (Part 3)

DataKitchen

NOVEMBER 18, 2022

As he thinks through the various journeys that data take in his company, Jason sees that his dashboard idea would require extracting or testing for events along the way. So, the only way for a data journey to truly observe what’s happening is to get his tools and pipelines to auto-report events. Data and tool tests.

Testing

Testing Statistics Measurement Metrics

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI). Additional considerations – Factor in additional tasks beyond schema conversion.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

AWS Big Data

NOVEMBER 9, 2023

A modern data platform entails maintaining data across multiple layers, targeting diverse platform capabilities like high performance, ease of development, cost-effectiveness, and DataOps features such as CI/CD, lineage, and unit testing. AWS Glue – AWS Glue is used to load files into Amazon Redshift through the S3 data lake.

Data Warehouse

Data Warehouse Testing Data Quality Reporting

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

Putting the Business Back Into Business Innovation

Timo Elliott

DECEMBER 14, 2022

Most innovation platforms make you rip the data out of your existing applications and move it to some another environment—a data warehouse, or data lake, or data lake house or data cloud—before you can do any innovation. Business Context.

Data Lake

Data Lake Recreation/Entertainment Data Warehouse Metadata

Porsche Carrera Cup Brasil gets real-time data boost

CIO Business Intelligence

MAY 21, 2024

Real-Time Intelligence, on the other hand, takes that further by supporting data in AWS, Google Cloud Platform, Kafka installations, and on-prem installations. “We We introduced the Real-Time Hub,” says Arun Ulagaratchagan, CVP, Azure Data at Microsoft. Morrone notes that in the next race, the company will test Azure Private 5G Core.

Broadcasting

Broadcasting Recreation/Entertainment Manufacturing Data Lake

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources. Answering questions as simple as “How many unique customers do we have?”

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

Did Big Data Deliver Business Transformation & Improved CX?

Alation

AUGUST 4, 2022

And where data was available, the ability to access and interpret it proved problematic. Big data can grow too big fast. Left unchecked, data lakes became data swamps. Some data lake implementations required expensive ‘cleansing pumps’ to make them navigable again.

Big Data

Big Data Digital Transformation Data Lake Data-driven

Making the gen AI and data connection work

CIO Business Intelligence

AUGUST 9, 2024

The complexities of compliance In May, the Italian Data Protection Authority highlighted how training models on which gen AI systems are based always require a huge amount of data, often obtained by web scraping, or a massive and indiscriminate collection carried out on the web, it says.

Risk

Risk Measurement Data Lake Data Collection

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

AWS Big Data

JANUARY 30, 2023

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities.

Insurance

Insurance Data Lake Data-driven Management

Better, faster decisions: Why businesses thrive on real-time data

CIO Business Intelligence

SEPTEMBER 8, 2022

In Foundry’s 2022 Data & Analytics Study , 88% of IT decision-makers agree that data collection and analysis have the potential to fundamentally change their business models over the next three years. The ability to pivot quickly to address rapidly changing customer or market demands is driving the need for real-time data.

Cost-Benefit

Cost-Benefit Internet of Things Data-driven Data Lake

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Data has become an invaluable asset for businesses, offering critical insights to drive strategic decision-making and operational optimization. This separation means changes can be tested thoroughly before being deployed to live operations. It helps HEMA centralize all data assets across disparate data stacks into a single catalog.

Data Governance

Data Governance Publishing Data-driven Metadata

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Unless, of course, the rest of their data also resides in the Google Cloud. In this post we showcase how we used AWS Glue to move siloed digital analytics data, with inconsistent arrival times, to AWS S3 (our Data Lake) and our central data warehouse (DWH), Snowflake. It consists of full-day and intraday tables.

Analytics

Analytics Data Lake Testing Optimization

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

With in-place table migration, you can rapidly convert to Iceberg tables since there is no need to regenerate data files. Newly generated metadata will then point to source data files as illustrated in the diagram below. . Data quality using table rollback. Only metadata will be regenerated.

Metadata

Metadata Data Warehouse Snapshot Machine Learning

Top 10 Data Governance Predictions for 2019

erwin

DECEMBER 13, 2018

As organizations become data-driven and awash in an overwhelming amount of data from multiple data sources (AI, IOT, ML, etc.), organizations will need to get a better handle on data quality and focus on data management processes and practices.

Data Governance

Data Governance IoT Internet of Things Data-driven

Migrate workloads from AWS Data Pipeline

AWS Big Data

JULY 25, 2024

Before starting any production workloads after migration, you need to test your new workflows to ensure no disruption to production systems. Migrating workloads to AWS Glue AWS Glue is a serverless data integration service that helps analytics users to discover, prepare, move, and integrate data from multiple sources.

Visualization

Visualization Management Data Integration Testing

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

Tricentis is the global leader in continuous testing for DevOps, cloud, and enterprise applications. Speed changes everything, and continuous testing across the entire CI/CD lifecycle is the key. Tricentis instills that confidence by providing software tools that enable Agile Continuous Testing (ACT) at scale.

Software

Software Data Lake Testing Cost-Benefit

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Data Mesh 101: How Data Mesh Can Be Used in an Organization

Ontotext

FEBRUARY 12, 2024

Domain teams should continually monitor for data errors with data validation checks and incorporate data lineage to track usage. Establish and enforce data governance by ensuring all data used is accurate, complete, and compliant with regulations. This calls for additional planning, documentation, and testing.

Data Quality

Data Quality Data-driven Data Lake Data Governance

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

Optimization

Optimization Forecasting Data Lake Metadata

Data Governance for Dummies: Your Questions, Answered

Alation

FEBRUARY 17, 2023

Some data seems more analytical, while other is operational (external facing). We recommend identifying the data sources and tables that need to be considered to be governed, establishing the governance owner & data quality details, and saving those details in the catalog. Where do you govern? Here’s an example.

Data Governance

Data Governance Data Quality Metadata Cost-Benefit

Data Profiling: What It Is and How to Perfect It

Alation

APRIL 18, 2023

For any data user in an enterprise today, data profiling is a key tool for resolving data quality issues and building new data solutions. In this blog, we’ll cover the definition of data profiling, top use cases, and share important techniques and best practices for data profiling today.

IT Data Quality Metadata Data Governance

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

Therefore, it’s crucial to keep the schema definition in the Schema Registry and the Data Catalog table in sync. To avoid this, it’s recommended to use a data quality check mechanism to identify such anomalies and take appropriate action in case of unexpected behavior. test-schema-registry MSKSchemaName Name of the schema.

Management

Management Metadata Internet of Things Testing

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

erwin

JANUARY 11, 2019

It’s only when companies take their first stab at manually cataloging and documenting operational systems, processes and the associated data, both at rest and in motion, that they realize how time-consuming the entire data prepping and mapping effort is, and why that work is sure to be compounded by human error and data quality issues.

Data Governance

Data Governance Risk Metadata Management

Turnkey Cloud DataOps: Solution from Alation and Accenture

Alation

MARCH 22, 2022

As the latest iteration in this pursuit of high-quality data sharing, DataOps combines a range of disciplines. It synthesizes all we’ve learned about agile, data quality , and ETL/ELT. This produces end-to-end lineage so business and technology users alike can understand the state of a data lake and/or lake house.

Metadata

Metadata Cost-Benefit Data Quality Data Lake

L’AI cresce. E, con lei, aumentano anche le difficoltà delle infrastrutture

CIO Business Intelligence

OCTOBER 23, 2024

Le aziende che sperimentano la GenAI di solito creano account di livello aziendale con servizi basati sul cloud, come ChatGPT di OpenAI o Claude di Anthropic, e i primi test sul campo e i vantaggi in termini di produttività le portano a cercare altre opportunità per implementare la tecnologia. “Le

Testing

Testing Data Lake Data Warehouse Management

Case study: Policy Enforcement Automation With Semantics

Ontotext

MAY 2, 2024

Storage-centric approach In the storage-centric approach, people try to address data silos by throwing everything in a data lake or a data warehouse. But, although, this helps somewhat in terms of architecture, soon these data lakes become unwieldy.

Metadata

Metadata Data Lake Data-driven Enterprise

ChatGPT: le nuove sfide della strategia sui dati nell’era dell’IA generativa

CIO Business Intelligence

MARCH 27, 2024

“La qualità del dato viene ottenuta definendo un processo che coinvolge tutti gli attori aziendali e gli strumenti di misurazione appositi”, evidenzia Francesco Saverio Colasuonno, Data & Analytics Office Manager di INAIL. “Le Ma ci aspettiamo di estenderla ad altri use-case”.

Data Governance

Data Governance Data Lake Data Strategy Data-driven

Accelerating model velocity through Snowflake Java UDF integration

Domino Data Lab

JUNE 15, 2021

At a certain point, as the demand keeps growing, the data volumes rapidly increase. Data is no longer stored in CSV files, but in a dedicated, purpose built data lake / data warehouse. Existing preprocessing, data ingestion, and data quality processes can be converted from Java/Spark into Java UDFs.

Modeling

Modeling Data Science Data-driven Data Warehouse

Configure end-to-end data pipelines with Etleap, Amazon Redshift, and dbt

AWS Big Data

JULY 12, 2023

Amazon Redshift helps you break down the data silos and allows you to run unified, self-service, real-time, and predictive analytics on all data across your operational databases, data lake, data warehouse, and third-party datasets with built-in governance.

Data Warehouse

Data Warehouse Modeling Dashboards Data Lake

Drug Launch Case Study: Amazing Efficiency Using DataOps

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

Webinars

Trending Sources

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

Webinars

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

Data’s dark secret: Why poor quality cripples AI and growth

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Measure performance of AWS Glue Data Quality for ETL pipelines

Fire Your Super-Smart Data Consultants with DataOps

Visualize data quality scores and metrics generated by AWS Glue Data Quality

What is a Data Mesh?

Accomplish Agile Business Intelligence & Analytics For Your Business

Navigating the Chaos of Unruly Data: Solutions for Data Teams

A Day in the Life of a DataOps Engineer

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

DataOps Observability: Taming the Chaos (Part 3)

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

Governing data in relational databases using Amazon DataZone

Putting the Business Back Into Business Innovation

Porsche Carrera Cup Brasil gets real-time data boost

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Did Big Data Deliver Business Transformation & Improved CX?

Making the gen AI and data connection work

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

Better, faster decisions: Why businesses thrive on real-time data

HEMA accelerates their data governance journey with Amazon DataZone

How SumUp made digital analytics more accessible using AWS Glue

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Top 10 Data Governance Predictions for 2019

Migrate workloads from AWS Data Pipeline

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Data Mesh 101: How Data Mesh Can Be Used in an Organization

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

Data Governance for Dummies: Your Questions, Answered

Data Profiling: What It Is and How to Perfect It

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

Turnkey Cloud DataOps: Solution from Alation and Accenture

L’AI cresce. E, con lei, aumentano anche le difficoltà delle infrastrutture

Case study: Policy Enforcement Automation With Semantics

ChatGPT: le nuove sfide della strategia sui dati nell’era dell’IA generativa

Accelerating model velocity through Snowflake Java UDF integration

Configure end-to-end data pipelines with Etleap, Amazon Redshift, and dbt

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift