Data Transformation, Reporting and Testing

Complex Data Transformations — Test Planning Best Practices

Wayne Yaddow

FEBRUARY 21, 2025

Complex Data TransformationsTest Planning Best Practices Ensuring data accuracy with structured testing and best practices Photo by Taylor Vick on Unsplash Introduction Data transformations and conversions are crucial for data pipelines, enabling organizations to process, integrate, and refine raw data into meaningful insights.

Testing

Testing Data Transformation Data Quality Data Integration

Key Challenges Affecting Data Transformations—Dev and Testing

Wayne Yaddow

FEBRUARY 6, 2025

Common challenges and practical mitigation strategies for reliable data transformations. Photo by Mika Baumeister on Unsplash Introduction Data transformations are important processes in data engineering, enabling organizations to structure, enrich, and integrate data for analytics , reporting, and operational decision-making.

Testing

Testing Data Transformation Data-driven Manufacturing

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

With this launch of JDBC connectivity, Amazon DataZone expands its support for data users, including analysts and scientists, allowing them to work in their preferred environments—whether it’s SQL Workbench, Domino, or Amazon-native solutions—while ensuring secure, governed access within Amazon DataZone. Choose Test connection.

Visualization

Visualization Data Lake Testing Data Governance

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Wayne Yaddow

MARCH 5, 2025

In this post, well see the fundamental procedures, tools, and techniques that data engineers, data scientists, and QA/testing teams use to ensure high-quality data as soon as its deployed. First, we look at how unit and integration tests uncover transformation errors at an early stage. PyTest, JUnit,NUnit).

Testing

Testing Data Transformation Statistics Metadata

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL, business intelligence (BI), and reporting tools. dbt Cloud is a hosted service that helps data teams productionize dbt deployments.

Data Warehouse

Data Warehouse Analytics Testing Modeling

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

Fragmented systems, inconsistent definitions, outdated architecture and manual processes contribute to a silent erosion of trust in data. When financial data is inconsistent, reporting becomes unreliable. A compliance report is rejected because timestamps dont match across systems. Assign domain data stewards.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Ensuring Data Transformation Quality with dbt Core

Wayne Yaddow

MARCH 14, 2025

How dbt Core aids data teams test, validate, and monitor complex data transformations and conversions Photo by NASA on Unsplash Introduction dbt Core, an open-source framework for developing, testing, and documenting SQL-based data transformations, has become a must-have tool for modern data teams as the complexity of data pipelines grows.

Data Transformation

Data Transformation Testing Unstructured Data Data Quality

Ensuring Data Transformation Results with Great Expectations

Wayne Yaddow

MARCH 12, 2025

How GX helps data teams validate, test, and monitor complex data pipelines Introduction Data flows from diverse sources, and transformations are becoming increasingly complex. Great Expectations can enable a wide range of data transformations and conversion operations.

Data Transformation

Data Transformation Data Quality Testing Data Warehouse

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

In this article, we will detail everything which is at stake when we talk about DQM: why it is essential, how to measure data quality, the pillars of good quality management, and some data quality control techniques. But first, let’s define what data quality actually is. 4 – Data Reporting. date, month, and year).

Data Quality

Data Quality Metrics Data-driven Management

It’s Essential — Verifying Data Transformations (Part 4)

Wayne Yaddow

FEBRUARY 4, 2025

Its EssentialVerifying Data Transformations (Part4) Uncovering the leading problems in data transformation workflowsand practical ways to detect and preventthem In Parts 13 of this series of blogs, categories of data transformations were identified as among the top causes of data quality defects in data pipeline workflows.

Data Transformation

Data Transformation Testing Data Quality Strategy

Data Engineers Are Using AI to Verify Data Transformations

Wayne Yaddow

FEBRUARY 26, 2025

AI is transforming how senior data engineers and data scientists validate data transformations and conversions. Artificial intelligence-based verification approaches aid in the detection of anomalies, the enforcement of data integrity, and the optimization of pipelines for improved efficiency.

Data Transformation

Data Transformation Testing Data-driven Data Quality

What is a DataOps Engineer?

DataKitchen

OCTOBER 5, 2021

The data organization wants to run the Value Pipeline as robustly as a six sigma factory, and it must be able to implement and deploy process improvements as rapidly as a Silicon Valley start-up. The data engineer builds data transformations. Their product is the data. Create tests. Run the factory.

Testing

Testing Dashboards Measurement Experimentation

DataOps Observability: Taming the Chaos (part 1)

DataKitchen

OCTOBER 5, 2022

DataOps Observability can help you ensure that your complex data pipelines and processes are accurate and that they deliver as designed. Observability also validates that your data transformations, models, and reports are performing as expected. to monitor your data operations. without replacing staff or systems?to

Testing

Testing Risk Data Processing Statistics

AzureML and CRISP-DM – a Framework to help the Business Intelligence professional move to AI

Jen Stirrup

SEPTEMBER 30, 2021

Before we dive in, let’s define strands of AI, Machine Learning and Data Science: Business intelligence (BI) leverages software and services to transform data into actionable insights that inform an organization’s strategic and tactical business decisions. Once the model has been trained, it will need to be tested.

Business Intelligence

Business Intelligence Data mining Machine Learning Testing

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset.

Reporting

Reporting Metrics Optimization Data Lake

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

Azure Blob Storage serves as the data lake to store raw data. Azure Databricks, a big data analytics platform built on Apache Spark, performs the actual data transformations. Azure Machine Learning can then use this data to train, test, and deploy machine learning models. Azure Machine Learning).

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

AWS Big Data

NOVEMBER 9, 2023

A modern data platform entails maintaining data across multiple layers, targeting diverse platform capabilities like high performance, ease of development, cost-effectiveness, and DataOps features such as CI/CD, lineage, and unit testing. It does this by helping teams handle the T in ETL (extract, transform, and load) processes.

Data Warehouse

Data Warehouse Testing Data Quality Reporting

What is data analytics? Analyzing and managing data for decisions

CIO Business Intelligence

JUNE 7, 2022

Data analytics draws from a range of disciplines — including computer programming, mathematics, and statistics — to perform analysis on data in an effort to describe, predict, and improve performance. What are the four types of data analytics? Data analytics and data science are closely related.

Data Analytics

Data Analytics Diagnostic Analytics Management Analytics

DataOps Observability: Taming the Chaos (Part 2)

DataKitchen

OCTOBER 25, 2022

The goal of DataOps Observability is to provide visibility of every journey that data takes from source to customer value across every tool, environment, data store, data and analytic team, and customer so that problems are detected, localized and raised immediately. That data then fills several database tables.

Testing

Testing Data-driven Visualization Dashboards

DataOps Observability: Taming the Chaos (part 1)

DataKitchen

OCTOBER 5, 2022

DataOps Observability can help you ensure that your complex data pipelines and processes are accurate and that they deliver as designed. Observability also validates that your data transformations, models, and reports are performing as expected. to monitor your data operations. without replacing staff or systems?to

Testing

Testing Risk Data Processing Statistics

What is business analytics? Using data to improve business outcomes

CIO Business Intelligence

JULY 5, 2022

What is the difference between business analytics and data analytics? Business analytics is a subset of data analytics. Data analytics is used across disciplines to find trends and solve problems using data mining , data cleansing, data transformation, data modeling, and more.

Business Analytics

Business Analytics Prescriptive Analytics Data mining Diagnostic Analytics

Data Integrity, the Basis for Reliable Insights

Sisense

AUGUST 28, 2020

All this contributes to your overall data integrity profile. Logical data integrity is designed to guard against human error. We’ll explore this concept in detail in the testing section below. Data integrity: A process and a state. There are two means for ensuring data integrity: process and testing.

Data Integration

Data Integration Testing Data Quality Data-driven

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

However, you might face significant challenges when planning for a large-scale data warehouse migration. This will enable right-sizing the Redshift data warehouse to meet workload demands cost-effectively. This report shows how tables, views, and stored procedures rely on each other.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Cloudera

OCTOBER 7, 2022

dbt allows data teams to produce trusted data sets for reporting, ML modeling, and operational workflows using SQL, with a simple workflow that follows software engineering best practices like modularity, portability, and continuous integration/continuous development (CI/CD). Introduction.

Data Warehouse

Data Warehouse Data Transformation Machine Learning Data Lake

Development Strategies to Prevent Data Quality Issues in Production (Part 1)

Wayne Yaddow

MARCH 3, 2025

Photo by CDC on Unsplash Many data pipeline failures and quality issues that are detected by data observability tools in production could have been prevented earlier in the pipeline lifecycle with better pre-production testing strategies. Helps identify transformation errors, and data quality issues early, minimizing risks.

Data Quality

Data Quality Strategy ROI Testing

Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering

Cloudera

OCTOBER 19, 2021

Airflow has been adopted by many Cloudera Data Platform (CDP) customers in the public cloud as the next generation orchestration service to setup and operationalize complex data pipelines. The same Airflow job can now be used to generate different SQL reports. Looking forward.

Data Transformation

Data Transformation Interactive Machine Learning Testing

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

At the heart of CDP is SDX , a unified context layer for governance and security, that makes it easy to create a secure data lake and run workloads that address all stages of your data lifecycle (collect, enrich, report, serve and predict). Enrich – Data Engineering (Apache Spark and Apache Hive). This is Now.

Data Warehouse

Data Warehouse Machine Learning Visualization Data Lake

Migrate from Apache Solr to OpenSearch

AWS Big Data

JULY 18, 2024

The main driving factors include lower total cost of ownership, scalability, stability, improved ingestion connectors (such as Data Prepper , Fluent Bit, and OpenSearch Ingestion), elimination of external cluster managers like Zookeeper, enhanced reporting, and rich visualizations with OpenSearch Dashboards.

Dashboards

Dashboards Testing Data-driven Visualization

Monitor data pipelines in a serverless data lake

AWS Big Data

AUGUST 9, 2023

The advent of rapid adoption of serverless data lake architectures—with ever-growing datasets that need to be ingested from a variety of sources, followed by complex data transformation and machine learning (ML) pipelines—can present a challenge. Disable the rules after testing to avoid repeated messages.

Data Lake

Data Lake Metrics Testing Cost-Benefit

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

The data products used inside the company include insights from user journeys, operational reports, and marketing campaign results, among others. The data platform serves on average 60 thousand queries per day. The data volume is in double-digit TBs with steady growth as business and data sources evolve.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

AWS Big Data

JUNE 19, 2024

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. All columns should masked for them.

Data Warehouse

Data Warehouse Testing Sales Structured Data

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

AWS Big Data

JANUARY 30, 2025

Our approach The migration initiative consisted of two main parts: building the new architecture and migrating data pipelines from the existing tool to the new architecture. Often, we would work on both in parallel, testing one component of the architecture while developing another at the same time.

Data Warehouse

Data Warehouse Data Architecture Machine Learning Data Transformation

The 10 biggest issues IT faces today

CIO Business Intelligence

JUNE 13, 2022

According to Evanta’s 2022 CIO Leadership Perspectives study, CIOs’ second top priority within the IT function is around data and analytics, with CIOs seeing advancing organizational use of data as key to reaching enterprise objectives. To get there, Angel-Johnson has embarked on a master data management initiative.

IT

IT Digital Transformation Internet of Things Strategy

Connecting the Data Lifecycle

Cloudera

NOVEMBER 29, 2021

Data transforms businesses. That’s where the data lifecycle comes into play. Managing data and its flow, from the edge to the cloud, is one of the most important tasks in the process of gaining data intelligence. . This has also improved analytics for ad hoc business report queries.

Data Lake

Data Lake Data Warehouse Data Architecture Reporting

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

Each CDH dataset has three processing layers: source (raw data), prepared (transformed data in Parquet), and semantic (combined datasets). It is possible to define stages (DEV, INT, PROD) in each layer to allow structured release and test without affecting PROD.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Queue Usage Concurrency Scaling Mode Concurrency on Main / Memory % Query Monitoring Rules etl For ingestion from multiple data integration auto auto Stop action on: Query runtime (seconds) > 3600 The following table summarizes the new workload management configuration for the consumer cluster.

Data Warehouse

Data Warehouse Analytics Data Lake Data Science

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

Modak Nabu relies on a framework of “Botworks”, a series of micro-jobs to accomplish various data transformation steps from ingestion to profiling, and indexing. Cloudera Data Engineering within CDP provides : Fully managed Spark-on-Kubernetes service that hides the complexity running production DE workloads at scale.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

These tools empower analysts and data scientists to easily collaborate on the same data, with their choice of tools and analytic engines. No more lock-in, unnecessary data transformations, or data movement across tools and clouds just to extract insights out of the data.

Data Lake

Data Lake Data Warehouse Data Architecture Metadata

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. A question arises on what level of details we need to include in the table metadata.

Metadata

Metadata Data Lake Modeling Data Warehouse

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Cloudera

JULY 21, 2022

Cloudera Data Warehouse). Efficient batch data processing. Complex data transformations. Built-in workflow: In addition to querying capabilities, Rill includes scheduled exports and alerts to stay on top of regular reporting and provide opportunities to dive deeper. Apache Hive. Joins and subqueries . Apache Druid.

Metrics

Metrics Slice and Dice Data Warehouse Dashboards

Understand PMML (It’s Not That Hard)!

Smarten

MAY 13, 2024

To accomplish this interchange, the method uses data mining and machine learning and it contains components like a data dictionary to define the fields used by the model, and data transformation to map user data and make it easier for the system to mine that data. No programming or scripting required.

Predictive Modeling

Predictive Modeling Data mining Predictive Analytics Modeling

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

Duplicating data from a production database to a lower or lateral environment and masking personally identifiable information (PII) to comply with regulations enables development, testing, and reporting without impacting critical systems or exposing sensitive customer data. PII detection and scrubbing.

Visualization

Visualization Metadata Data Transformation Testing

Complex Data Transformations — Test Planning Best Practices

Key Challenges Affecting Data Transformations—Dev and Testing

Webinars

Trending Sources

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Webinars

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Data’s dark secret: Why poor quality cripples AI and growth

Ensuring Data Transformation Quality with dbt Core

Ensuring Data Transformation Results with Great Expectations

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

It’s Essential — Verifying Data Transformations (Part 4)

Data Engineers Are Using AI to Verify Data Transformations

What is a DataOps Engineer?

DataOps Observability: Taming the Chaos (part 1)

AzureML and CRISP-DM – a Framework to help the Business Intelligence professional move to AI

Automate alerting and reporting for AWS Glue job resource usage

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

What is data analytics? Analyzing and managing data for decisions

DataOps Observability: Taming the Chaos (Part 2)

DataOps Observability: Taming the Chaos (part 1)

What is business analytics? Using data to improve business outcomes

Data Integrity, the Basis for Reliable Insights

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Development Strategies to Prevent Data Quality Issues in Production (Part 1)

Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering

Happy Birthday, CDP Public Cloud

Migrate from Apache Solr to OpenSearch

Monitor data pipelines in a serverless data lake

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

The 10 biggest issues IT faces today

Connecting the Data Lifecycle

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Understand PMML (It’s Not That Hard)!

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

Stay Connected