Data Transformation, Reference and Testing

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

We need robust versioning for data, models, code, and preferably even the internal state of applications—think Git on steroids to answer inevitable questions: What changed? The applications must be integrated to the surrounding business systems so ideas can be tested and validated in the real world in a controlled manner.

IT

IT Testing Experimentation Software

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Wayne Yaddow

MARCH 5, 2025

In this post, well see the fundamental procedures, tools, and techniques that data engineers, data scientists, and QA/testing teams use to ensure high-quality data as soon as its deployed. First, we look at how unit and integration tests uncover transformation errors at an early stage. PyTest, JUnit,NUnit).

Testing

Testing Data Transformation Statistics Metadata

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Together with price-performance, Amazon Redshift offers capabilities such as serverless architecture, machine learning integration within your data warehouse and secure data sharing across the organization. dbt Cloud is a hosted service that helps data teams productionize dbt deployments. Choose Test Connection.

Data Warehouse

Data Warehouse Analytics Testing Modeling

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Functional Gaps in Your Data Transformation Testing Tools?

Wayne Yaddow

FEBRUARY 11, 2025

Managing tests of complex data transformations when automated data testing tools lack important features? Photo by Marvin Meyer on Unsplash Introduction Data transformations are at the core of modern business intelligence, blending and converting disparate datasets into coherent, reliable outputs.

Testing

Testing Data Transformation Data Quality Statistics

Key Challenges Affecting Data Transformations—Dev and Testing

Wayne Yaddow

FEBRUARY 6, 2025

Common challenges and practical mitigation strategies for reliable data transformations. Photo by Mika Baumeister on Unsplash Introduction Data transformations are important processes in data engineering, enabling organizations to structure, enrich, and integrate data for analytics , reporting, and operational decision-making.

Testing

Testing Data Transformation Data-driven Manufacturing

Ensuring Data Transformation Results with Great Expectations

Wayne Yaddow

MARCH 12, 2025

How GX helps data teams validate, test, and monitor complex data pipelines Introduction Data flows from diverse sources, and transformations are becoming increasingly complex. Great Expectations can enable a wide range of data transformations and conversion operations.

Data Transformation

Data Transformation Data Quality Testing Data Warehouse

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

Your generated jobs can use a variety of data transformations, including filters, projections, unions, joins, and aggregations, giving you the flexibility to handle complex data processing requirements. In this post, we discuss how Amazon Q data integration transforms ETL workflow development.

Data Integration

Data Integration Visualization Data Processing Data Lake

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Reporting being part of an effective DQM, we will also go through some data quality metrics examples you can use to assess your efforts in the matter. But first, let’s define what data quality actually is. What is the definition of data quality? Why Do You Need Data Quality Management? date, month, and year).

Data Quality

Data Quality Metrics Data-driven Management

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. This approach simplifies your data journey and helps you meet your security requirements. This new capability can simplify your data journey.

Visualization

Visualization Data Processing Testing Publishing

Data Engineers Are Using AI to Verify Data Transformations

Wayne Yaddow

FEBRUARY 26, 2025

AI is transforming how senior data engineers and data scientists validate data transformations and conversions. Artificial intelligence-based verification approaches aid in the detection of anomalies, the enforcement of data integrity, and the optimization of pipelines for improved efficiency.

Data Transformation

Data Transformation Testing Data-driven Data Quality

Automating the Automators: Shift Change in the Robot Factory

O'Reilly on Data

JANUARY 17, 2023

” I, thankfully, learned this early in my career, at a time when I could still refer to myself as a software developer. Upload your data, click through a workflow, walk away. If you’re a professional data scientist, you already have the knowledge and skills to test these models. It does not exist in the code.

Machine Learning

Machine Learning Predictive Modeling Software Modeling

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Prompt with no metadata For the first test, we used a basic prompt containing just the SQL generating instructions and no table metadata. A question arises on what level of details we need to include in the table metadata. As tables undergo schema changes, updating metadata for each change can be time-consuming and requires effort.

Metadata

Metadata Data Lake Modeling Data Warehouse

Data Integrity, the Basis for Reliable Insights

Sisense

AUGUST 28, 2020

All these pitfalls are avoidable with the right data integrity policies in place. Means of ensuring data integrity. Data integrity can be divided into two areas: physical and logical. Physical data integrity refers to how data is stored and accessed. Data integrity: A process and a state.

Data Integration

Data Integration Testing Data Quality Data-driven

The Best Data Management Tools For Small Businesses

Smart Data Collective

APRIL 29, 2020

What is data management? Data management can be defined in many ways. Usually the term refers to the practices, techniques and tools that allow access and delivery through different fields and data structures in an organisation. Extraction, Transform, Load (ETL). Data transformation. Microsoft Azure.

Management

Management Data Warehouse Digital Transformation Dashboards

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

AWS Big Data

JUNE 19, 2024

We refer to multiple masking policies being attached to a table as a multi-modal masking policy. SELECT * FROM svv_attached_masking_policy; Now you can test that different users can see the same data masked differently based on their roles. Check that the masking policies are created with the following code: -- 1.1-

Data Warehouse

Data Warehouse Testing Sales Structured Data

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard. Additionally, it manages table definitions in the AWS Glue Data Catalog , containing references to data sources and targets of extract, transform, and load (ETL) jobs in AWS Glue.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases.

Management

Management Metadata Analytics Dashboards

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

DECEMBER 9, 2022

Developers need to onboard new data sources, chain multiple data transformation steps together, and explore data as it travels through the flow. Figure 5: Parameter references in the configuration panel and auto-complete. Figure 7: Test sessions provide an interactive experience that NiFi developers love.

Testing

Testing Cost-Benefit Interactive Visualization

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

AWS Big Data

JULY 26, 2023

To grow the power of data at scale for the long term, it’s highly recommended to design an end-to-end development lifecycle for your data integration pipelines. The following are common asks from our customers: Is it possible to develop and test AWS Glue data integration jobs on my local laptop?

Data Integration

Data Integration Snapshot Testing Visualization

Stream VPC Flow Logs to Datadog via Amazon Kinesis Data Firehose

AWS Big Data

JUNE 20, 2023

Kinesis Data Firehose is a fully managed service for delivering near-real-time streaming data to various destinations for storage and performing near-real-time analytics. You can perform analytics on VPC flow logs delivered from your VPC using the Kinesis Data Firehose integration with Datadog as a destination.

Dashboards

Dashboards Visualization Metrics Data Transformation

Enable advanced search capabilities for Amazon Keyspaces data by integrating with Amazon OpenSearch Service

AWS Big Data

FEBRUARY 26, 2024

Additionally, you can configure OpenSearch Ingestion to apply data transformations before delivery. The content includes a reference architecture, a step-by-step guide on infrastructure setup, sample code for implementing the solution within a use case, and an AWS Cloud Development Kit (AWS CDK) application for deployment.

Dashboards

Dashboards Testing Metrics Optimization

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

However, you might face significant challenges when planning for a large-scale data warehouse migration. For an example, refer to How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform. Platform architects define a well-architected platform.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

AWS Big Data

JANUARY 30, 2025

Our approach The migration initiative consisted of two main parts: building the new architecture and migrating data pipelines from the existing tool to the new architecture. Often, we would work on both in parallel, testing one component of the architecture while developing another at the same time.

Data Warehouse

Data Warehouse Data Architecture Machine Learning Data Transformation

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

Jessicamouth 2964 Queensland Load the dataset First, create a new table in your Redshift Serverless endpoint and copy the test data into it by doing the following: Open the Query Editor V2 and log in using the admin user name and details defined when the endpoint was created.

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

Best Web Analytics 2.0 Tools: Quantitative, Qualitative, Life Saving!

Occam's Razor

OCTOBER 19, 2010

so you have some reference as to where each item fits (and this will also make it easier for you to pick tools for the priority order referenced in Context #3 above). If you can show ROI on a DW it would be a good use of your money to go with Omniture Discover, WebTrends Data Mart, Coremetrics Explore. and embrace Multiplicity.

Analytics

Analytics Testing Measurement Optimization

Improve observability across Amazon MWAA tasks

AWS Big Data

FEBRUARY 6, 2023

In the next sections, we explore the following topics: The DAG file, in order to understand how to define and then pass the correlation ID in the AWS Glue and EMR tasks The code needed in the Python scripts to output information based on the correlation ID Refer to the GitHub repo for the detailed DAG definition and Spark scripts.

Management

Management Interactive Publishing Metadata

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Building a starter version of anything can often be straightforward, but building something with enterprise-grade scale, security, resiliency, and performance typically requires knowledge of and adherence to battle-tested best practices, and using the right tools and features in the right scenario. Data Vault 2.0

Enterprise

Enterprise Data Warehouse Data Lake Optimization

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

Duplicating data from a production database to a lower or lateral environment and masking personally identifiable information (PII) to comply with regulations enables development, testing, and reporting without impacting critical systems or exposing sensitive customer data. PII detection and scrubbing.

Visualization

Visualization Metadata Data Transformation Testing

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

For more details on how to configure and schedule the log collector, refer to the yarn-log-collector GitHub repo. Transform the YARN job history logs from JSON to CSV After obtaining YARN logs, you run a YARN log organizer, yarn-log-organizer.py, which is a parser to transform JSON-based logs to CSV files.

Dashboards

Dashboards Optimization Data Lake Cost-Benefit

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset.

Reporting

Reporting Metrics Optimization Data Lake

Enrich, standardize, and translate streaming data in Amazon Redshift with generative AI

AWS Big Data

AUGUST 6, 2024

Example data The following code shows an example of raw order data from the stream: Record1: { "orderID":"101", "email":" john. To address the challenges with the raw data, we can implement a comprehensive data transformation process using Redshift ML integrated with an LLM in an ETL workflow.

Data Warehouse

Data Warehouse Data-driven Modeling Internet of Things

Ten new visual transforms in AWS Glue Studio

AWS Big Data

MAY 9, 2023

AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. It allows you to visually compose data transformation workflows using nodes that represent different data handling steps, which later are converted automatically into code to run.

Visualization

Visualization Marketing Big Data IT

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

With these features, you can now build data pipelines completely in standard SQL that are serverless, more simple to build, and able to operate at scale. Typically, data transformation processes are used to perform this operation, and a final consistent view is stored in an S3 bucket or folder.

Data Lake

Data Lake Snapshot Optimization Data Transformation

Use Snowflake with Amazon MWAA to orchestrate data pipelines

AWS Big Data

OCTOBER 31, 2023

citibike-tripdata-destination-ACCOUNT_ID – The bucket used for storing the transformed dataset. When implementing the solution in this post, replace references to airflow-blog-bucket-ACCOUNT_ID and citibike-tripdata-destination-ACCOUNT_ID with the names of your own S3 buckets. Run the DAG Let’s look at how to run the DAGs.

Data Processing

Data Processing Management Publishing Visualization

Migrate from Amazon Kinesis Data Analytics for SQL Applications to Amazon Kinesis Data Analytics Studio

AWS Big Data

JUNE 29, 2023

In this post, we discuss why AWS recommends moving from Kinesis Data Analytics for SQL Applications to Amazon Kinesis Data Analytics for Apache Flink to take advantage of Apache Flink’s advanced streaming capabilities. View the stream data. Transform and enrich the data. Manipulate the data with Python.

Data Analytics

Data Analytics Analytics IoT Data Lake

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Creating an external schema from the data share database on the consumer, mirroring that of the producer cluster with identical names. Testing: Conducting an internal week-long regression testing and auditing process to meticulously validate all data points by running the same workload and twice the workload.

Data Warehouse

Data Warehouse Analytics Data Lake Data Science

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches. You can test this solution yourself using the AWS Samples GitHub repository. This method uses GZIP compression to optimize storage consumption and query performance.

Analytics

Analytics IoT Metadata Internet of Things

Extract time series from satellite weather data with AWS Lambda

AWS Big Data

JULY 6, 2023

It has not been specifically designed for heavy data transformation tasks. Now that the data is on Amazon S3, you can delete the directory that has been downloaded from your Linux machine. Create the Lambda functions For step-by-step instructions on how to create a Lambda function, refer to Getting started with Lambda.

Machine Learning

Machine Learning Visualization IoT Digital Transformation

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

AWS Big Data

MARCH 16, 2023

A source of unpredictable workloads is dbt Cloud , which SafetyCulture uses to manage data transformations in the form of models. Whenever models are created or modified, a dbt Cloud CI job is triggered to test the models by materializing the models in Amazon Redshift. Refer to Connect dbt Cloud to Redshift for setup steps.

Data Warehouse

Data Warehouse Testing Snapshot Modeling

Cross-account integration between SaaS platforms using Amazon AppFlow

AWS Big Data

APRIL 25, 2023

The following AWS services are used for data ingestion, processing, and load: Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications like Salesforce, SAP, Marketo, Slack, and ServiceNow, and AWS services like Amazon S3 and Amazon Redshift , in just a few clicks.

Sales

Sales Visualization Software Metadata

Manual Feature Engineering

Domino Data Lab

AUGUST 20, 2019

We are going to turn our attention away from expanding our catalog of models [as mentioned previously in the book ] and instead take a closer look at the data. Feature engineering refers to manipulation—addition, deletion, combination, mutation—of the features. Separate out a hold-out test set. Don’t peek at it.

Testing

Testing Modeling Interactive Measurement

Migrate your existing SQL-based ETL workload to an AWS serverless ETL infrastructure using AWS Glue

AWS Big Data

JULY 31, 2023

AWS DMS enables us to capture deltas, including deletes from the source database, through the use of Change Data Capture (CDC) configuration. CDC in DMS enables us to capture deltas without writing code and without missing any changes, which is critical for the integrity of the data. Under Transforms , choose SQL Query.

Sales

Sales Data Warehouse Visualization Testing

How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with Amazon EMR on Amazon EKS

AWS Big Data

FEBRUARY 21, 2023

We use Apache Spark as our main data processing engine and have over 1,000 Spark applications running over massive amounts of data every day. These Spark applications implement our business logic ranging from data transformation, machine learning (ML) model inference, to operational tasks. Their costs were climbing.

Cost-Benefit

Cost-Benefit Informatics Optimization Management

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

Alternatively, you can use AWS Glue for Apache Spark, which provides built-in support for bucketing configurations during the data transformation process. AWS Glue allows you to define bucketing parameters, such as the number of buckets and the columns to bucket on, providing an optimized data layout for efficient querying with Athena.

Optimization

Optimization Data Lake Cost-Benefit Reporting

MLOps and DevOps: Why Data Makes It Different

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Webinars

Trending Sources

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Webinars

Functional Gaps in Your Data Transformation Testing Tools?

Key Challenges Affecting Data Transformations—Dev and Testing

Ensuring Data Transformation Results with Great Expectations

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Data Engineers Are Using AI to Verify Data Transformations

Automating the Automators: Shift Change in the Robot Factory

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Data Integrity, the Basis for Reliable Insights

The Best Data Management Tools For Small Businesses

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

Stream VPC Flow Logs to Datadog via Amazon Kinesis Data Firehose

Enable advanced search capabilities for Amazon Keyspaces data by integrating with Amazon OpenSearch Service

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Best Web Analytics 2.0 Tools: Quantitative, Qualitative, Life Saving!

Improve observability across Amazon MWAA tasks

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Automate alerting and reporting for AWS Glue job resource usage

Enrich, standardize, and translate streaming data in Amazon Redshift with generative AI

Ten new visual transforms in AWS Glue Studio

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

Use Snowflake with Amazon MWAA to orchestrate data pipelines

Migrate from Amazon Kinesis Data Analytics for SQL Applications to Amazon Kinesis Data Analytics Studio

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Extract time series from satellite weather data with AWS Lambda

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

Cross-account integration between SaaS platforms using Amazon AppFlow

Manual Feature Engineering

Migrate your existing SQL-based ETL workload to an AWS serverless ETL infrastructure using AWS Glue

How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with Amazon EMR on Amazon EKS

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift