Data Transformation, Machine Learning and Testing

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

Much has been written about struggles of deploying machine learning projects to production. As with many burgeoning fields and disciplines, we don’t yet have a shared canonical infrastructure stack or best practices for developing and deploying data-intensive applications. An Overarching Concern: Correctness and Testing.

IT

IT Testing Experimentation Software

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

CIO Business Intelligence

AUGUST 9, 2024

At Atlanta’s Hartsfield-Jackson International Airport, an IT pilot has led to a wholesale data journey destined to transform operations at the world’s busiest airport, fueled by machine learning and generative AI. They’re trying to get a handle on their data estate right now.

Data Transformation

Data Transformation Machine Learning Data Lake Dashboards

Automating the Automators: Shift Change in the Robot Factory

O'Reilly on Data

JANUARY 17, 2023

Think about what the model results tell you: “Maybe a random forest isn’t the best tool to split this data, but XLNet is.” ” If none of your models performed well, that tells you that your dataset–your choice of raw data, feature selection, and feature engineering–is not amenable to machine learning.

Machine Learning

Machine Learning Predictive Modeling Software Modeling

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Functional Gaps in Your Data Transformation Testing Tools?

Wayne Yaddow

FEBRUARY 11, 2025

Managing tests of complex data transformations when automated data testing tools lack important features? Photo by Marvin Meyer on Unsplash Introduction Data transformations are at the core of modern business intelligence, blending and converting disparate datasets into coherent, reliable outputs.

Testing

Testing Data Transformation Data Quality Statistics

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Within seconds of transactional data being written into Amazon Aurora (a fully managed modern relational database service offering performance and high availability at scale), the data is seamlessly made available in Amazon Redshift for analytics and machine learning. Choose Test Connection.

Data Warehouse

Data Warehouse Analytics Testing Sales

Is Big Data Transforming Our Broken Hospital Management Systems?

Smart Data Collective

JULY 25, 2019

Purchase Ready-Made Big Data Solutions for Healthcare Applications. There is also a range of different data-driven solutions you can start using right now. Such products usually come with a standard set of tools, and you can test several of them to pick the best option. Big Data is the Key to Hospital Management.

Big Data

Big Data Data Transformation Management Software

The Journey to DataOps Success: Key Takeaways from Transformation Trailblazers

DataKitchen

APRIL 26, 2021

GSK had been pursuing DataOps capabilities such as automation, containerization, automated testing and monitoring, and reusability, for several years. Workiva also prioritized improving the data lifecycle of machine learning models, which otherwise can be very time consuming for the team to monitor and deploy.

Measurement

Measurement Metrics Data-driven Dashboards

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Wayne Yaddow

MARCH 5, 2025

In this post, well see the fundamental procedures, tools, and techniques that data engineers, data scientists, and QA/testing teams use to ensure high-quality data as soon as its deployed. First, we look at how unit and integration tests uncover transformation errors at an early stage. PyTest, JUnit,NUnit).

Testing

Testing Data Transformation Statistics Metadata

Unparalleled Productivity: The Power of Cloudera Copilot for Cloudera Machine Learning

Cloudera

JUNE 24, 2024

In the fast-evolving landscape of data science and machine learning, efficiency is not just desirable—it’s essential. Imagine a world where every data practitioner, from seasoned data scientists to budding developers, has an intelligent assistant at their fingertips.

Machine Learning

Machine Learning Data Science Data-driven Testing

AzureML and CRISP-DM – a Framework to help the Business Intelligence professional move to AI

Jen Stirrup

SEPTEMBER 30, 2021

Although CRISP-DM is not perfect , the CRISP-DM framework offers a pathway for machine learning using AzureML for Microsoft Data Platform professionals. AI vs ML vs Data Science vs Business Intelligence. They may also learn from evidence, but the data and the modelling fundamentally comes from humans in some way.

Business Intelligence

Business Intelligence Data mining Machine Learning Testing

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

You can use it for big data analytics and machine learning workloads. Azure Databricks Delta Live Table s: These provide a more straightforward way to build and manage Data Pipelines for the latest, high-quality data in Delta Lake. Azure Blob Storage serves as the data lake to store raw data.

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

At AstraZeneca, data and AI are more than game changers – they are life changers

CIO Business Intelligence

OCTOBER 11, 2022

The goal, she explained, is to knock down data silos between those groups, using multiple data lakes supported by strong security and governance, to drive positive impact across the supply chain, manufacturing, and the clinical trials of new drugs. . Four ways to improve data-driven business transformation .

Machine Learning

Machine Learning Data Science Data-driven Testing

Ensuring Data Transformation Quality with dbt Core

Wayne Yaddow

MARCH 14, 2025

How dbt Core aids data teams test, validate, and monitor complex data transformations and conversions Photo by NASA on Unsplash Introduction dbt Core, an open-source framework for developing, testing, and documenting SQL-based data transformations, has become a must-have tool for modern data teams as the complexity of data pipelines grows.

Data Transformation

Data Transformation Testing Unstructured Data Data Quality

Data Engineers Are Using AI to Verify Data Transformations

Wayne Yaddow

FEBRUARY 26, 2025

AI is transforming how senior data engineers and data scientists validate data transformations and conversions. Artificial intelligence-based verification approaches aid in the detection of anomalies, the enforcement of data integrity, and the optimization of pipelines for improved efficiency.

Data Transformation

Data Transformation Testing Data-driven Data Quality

Ensuring Data Transformation Results with Great Expectations

Wayne Yaddow

MARCH 12, 2025

How GX helps data teams validate, test, and monitor complex data pipelines Introduction Data flows from diverse sources, and transformations are becoming increasingly complex. Great Expectations can enable a wide range of data transformations and conversion operations.

Data Transformation

Data Transformation Data Quality Testing Data Warehouse

12 data science certifications that will pay off

CIO Business Intelligence

JANUARY 19, 2024

The exam tests general knowledge of the platform and applies to multiple roles, including administrator, developer, data analyst, data engineer, data scientist, and system architect. The exam is designed for seasoned and high-achiever data science thought and practice leaders.

Data Science

Data Science Machine Learning Predictive Modeling Forecasting

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

Build data validation rules directly into ingestion layers so that insufficient data is stopped at the gate and not detected after damage is done. Use lineage tooling to trace data from source to report. Understanding how data transforms and where it breaks is crucial for audibility and root-cause resolution.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

What is data analytics? Analyzing and managing data for decisions

CIO Business Intelligence

JUNE 7, 2022

Data analytics draws from a range of disciplines — including computer programming, mathematics, and statistics — to perform analysis on data in an effort to describe, predict, and improve performance. What are the four types of data analytics? Data analytics methods and techniques.

Data Analytics

Data Analytics Diagnostic Analytics Management Analytics

What is business analytics? Using data to improve business outcomes

CIO Business Intelligence

JULY 5, 2022

What is the difference between business analytics and data analytics? Business analytics is a subset of data analytics. Data analytics is used across disciplines to find trends and solve problems using data mining , data cleansing, data transformation, data modeling, and more.

Business Analytics

Business Analytics Prescriptive Analytics Data mining Diagnostic Analytics

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. To solve for these challenges, we launched Amazon SageMaker Lakehouse unified data connectivity.

Visualization

Visualization Data Processing Testing Publishing

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Cloudera

OCTOBER 7, 2022

We’re excited to announce the general availability of the open source adapters for dbt for all the engines in CDP — Apache Hive , Apache Impala , and Apache Spark, with added support for Apache Livy and Cloudera Data Engineering. This variety can result in a lack of standardization, leading to data duplication and inconsistency.

Data Warehouse

Data Warehouse Data Transformation Machine Learning Data Lake

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

In the beginning, CDP ran only on AWS with a set of services that supported a handful of use cases and workload types: CDP Data Warehouse: a kubernetes-based service that allows business analysts to deploy data warehouses with secure, self-service access to enterprise data. Predict – Data Engineering (Apache Spark).

Data Warehouse

Data Warehouse Machine Learning Visualization Data Lake

Adding AI to Products: A High-Level Guide for Product Managers

Sisense

AUGUST 6, 2020

AI and machine learning (ML) are not just catchy buzzwords; they’re vital to the future of our planet and your business. Doing it right can mean the difference between thriving in the new world of data and disappearing from it. Be sure test cases represent the diversity of app users. Can a chatbot help improve relations?

Management

Management Machine Learning Key Performance Indicator Cost-Benefit

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

AWS Big Data

JANUARY 30, 2025

Our approach The migration initiative consisted of two main parts: building the new architecture and migrating data pipelines from the existing tool to the new architecture. Often, we would work on both in parallel, testing one component of the architecture while developing another at the same time.

Data Warehouse

Data Warehouse Data Architecture Machine Learning Data Transformation

Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering

Cloudera

OCTOBER 19, 2021

Airflow has been adopted by many Cloudera Data Platform (CDP) customers in the public cloud as the next generation orchestration service to setup and operationalize complex data pipelines. With this Technical Preview release, any CDE customer can test drive the new authoring interface by setting up the latest CDE service.

Data Transformation

Data Transformation Interactive Machine Learning Testing

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

Once released, consumers use datasets from different providers for analysis, machine learning (ML) workloads, and visualization. Each CDH dataset has three processing layers: source (raw data), prepared (transformed data in Parquet), and semantic (combined datasets).

Dashboards

Dashboards Analytics Metadata Data Warehouse

Bringing MMM to 21st Century with Machine Learning and Automation?

DataRobot Blog

APRIL 4, 2022

Before the data is put into the model comes a process called feature engineering – transforming the original data columns to impose certain business assumptions or simply increase model accuracy. The classical approach is to assume the adstock function (typically linear ) and test out various values of ? Request a demo.

Machine Learning

Machine Learning Sales Measurement ROI

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

Today’s general availability announcement covers Iceberg running within key data services in the Cloudera Data Platform (CDP) — including Cloudera Data Warehousing ( CDW ), Cloudera Data Engineering ( CDE ), and Cloudera Machine Learning ( CML ). Read why the future of data lakehouses is open.

Data Lake

Data Lake Data Warehouse Data Architecture Metadata

Using COD and CML to build applications that predict stock data

Cloudera

FEBRUARY 8, 2021

Continuing from my previous blog post about how awesome and easy it is to develop web-based applications backed by Cloudera Operational Database (COD), I started a small project to integrate COD with another CDP cloud experience, Cloudera Machine Learning (CML). . Now, let’s start testing our model! Go to runner.py

Machine Learning

Machine Learning Statistics Dashboards Modeling

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

AWS Big Data

APRIL 12, 2023

Also, you can run other types of business applications, such as web applications and machine learning (ML) TensorFlow workloads, on the same EKS cluster. We also share a Spark benchmark solution that suits all Amazon EMR deployment options, so you can replicate the process in your environment for your own performance test cases.

Testing

Testing Big Data Metadata Optimization

Monitor data pipelines in a serverless data lake

AWS Big Data

AUGUST 9, 2023

The advent of rapid adoption of serverless data lake architectures—with ever-growing datasets that need to be ingested from a variety of sources, followed by complex data transformation and machine learning (ML) pipelines—can present a challenge. Disable the rules after testing to avoid repeated messages.

Data Lake

Data Lake Metrics Testing Cost-Benefit

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

Modak Nabu relies on a framework of “Botworks”, a series of micro-jobs to accomplish various data transformation steps from ingestion to profiling, and indexing. Cloudera Data Engineering within CDP provides : Fully managed Spark-on-Kubernetes service that hides the complexity running production DE workloads at scale.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

How to Use Apache Iceberg in CDP’s Open Lakehouse

Cloudera

AUGUST 8, 2022

The general availability covers Iceberg running within some of the key data services in CDP, including Cloudera Data Warehouse ( CDW ), Cloudera Data Engineering ( CDE ), and Cloudera Machine Learning ( CML ). Cloudera Data Engineering (Spark 3) with Airflow enabled. Cloudera Machine Learning .

Snapshot

Snapshot Data Warehouse Machine Learning Cost-Benefit

Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps

Cloudera

OCTOBER 14, 2020

Apache Spark unifies batch processing, real-time processing, stream analytics, machine learning, and interactive query in one-platform. The Test and Development queue have fixed resource limits. Background. Why choose K8s for Apache Spark. All other queues are only limited by the size of the cluster. Acknowledgments.

Machine Learning

Machine Learning Management Big Data Optimization

Extract time series from satellite weather data with AWS Lambda

AWS Big Data

JULY 6, 2023

It has not been specifically designed for heavy data transformation tasks. Step Functions helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines. Note that Lambda is a general purpose serverless engine.

Machine Learning

Machine Learning Visualization IoT Digital Transformation

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Amazon Redshift is used to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Amazon EMR provides a big data environment for data processing, interactive analysis, and machine learning using open source frameworks such as Apache Spark, Apache Hive, and Presto.

Metadata

Metadata Data Lake Modeling Data Warehouse

Connecting the Data Lifecycle

Cloudera

NOVEMBER 29, 2021

Data transforms businesses. That’s where the data lifecycle comes into play. Managing data and its flow, from the edge to the cloud, is one of the most important tasks in the process of gaining data intelligence. .

Data Lake

Data Lake Data Warehouse Data Architecture Reporting

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI). Platform architects define a well-architected platform.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

AWS Big Data

JULY 26, 2023

To grow the power of data at scale for the long term, it’s highly recommended to design an end-to-end development lifecycle for your data integration pipelines. The following are common asks from our customers: Is it possible to develop and test AWS Glue data integration jobs on my local laptop?

Data Integration

Data Integration Snapshot Testing Visualization

The 10 biggest issues IT faces today

CIO Business Intelligence

JUNE 13, 2022

According to Evanta’s 2022 CIO Leadership Perspectives study, CIOs’ second top priority within the IT function is around data and analytics, with CIOs seeing advancing organizational use of data as key to reaching enterprise objectives. Others also list data initiatives as a top issue for CIOs.

IT

IT Digital Transformation Internet of Things Strategy

Understand PMML (It’s Not That Hard)!

Smarten

MAY 13, 2024

To accomplish this interchange, the method uses data mining and machine learning and it contains components like a data dictionary to define the fields used by the model, and data transformation to map user data and make it easier for the system to mine that data. Validation summary of models.

Predictive Modeling

Predictive Modeling Data mining Predictive Analytics Modeling

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

AWS Big Data

OCTOBER 18, 2023

They use various AWS analytics services, such as Amazon EMR, to enable their analysts and data scientists to apply advanced analytics techniques to interactively develop and test new surveillance patterns and improve investor protection. Melody Yang is a Senior Big Data Solutions Architect for Amazon EMR at AWS.

Big Data

Big Data Data Processing Interactive Testing

Manual Feature Engineering

Domino Data Lab

AUGUST 20, 2019

Many thanks to AWP Pearson for the permission to excerpt “Manual Feature Engineering: Manipulating Data for Fun and Profit” from the book, Machine Learning with Python for Everyone by Mark E. Missing values can be filled in based on expert knowledge, heuristics, or by some machine learning techniques.

Testing

Testing Modeling Interactive Measurement

MLOps and DevOps: Why Data Makes It Different

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

Webinars

Trending Sources

Automating the Automators: Shift Change in the Robot Factory

Webinars

Functional Gaps in Your Data Transformation Testing Tools?

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Is Big Data Transforming Our Broken Hospital Management Systems?

The Journey to DataOps Success: Key Takeaways from Transformation Trailblazers

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Unparalleled Productivity: The Power of Cloudera Copilot for Cloudera Machine Learning

AzureML and CRISP-DM – a Framework to help the Business Intelligence professional move to AI

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

At AstraZeneca, data and AI are more than game changers – they are life changers

Ensuring Data Transformation Quality with dbt Core

Data Engineers Are Using AI to Verify Data Transformations

Ensuring Data Transformation Results with Great Expectations

12 data science certifications that will pay off

Data’s dark secret: Why poor quality cripples AI and growth

What is data analytics? Analyzing and managing data for decisions

What is business analytics? Using data to improve business outcomes

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Happy Birthday, CDP Public Cloud

Adding AI to Products: A High-Level Guide for Product Managers

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Bringing MMM to 21st Century with Machine Learning and Automation?

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Using COD and CML to build applications that predict stock data

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

Monitor data pipelines in a serverless data lake

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

How to Use Apache Iceberg in CDP’s Open Lakehouse

Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps

Extract time series from satellite weather data with AWS Lambda

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Connecting the Data Lifecycle

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

The 10 biggest issues IT faces today

Understand PMML (It’s Not That Hard)!

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

Manual Feature Engineering

Stay Connected