Blog, Data Transformation and Testing

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Together with price-performance, Amazon Redshift offers capabilities such as serverless architecture, machine learning integration within your data warehouse and secure data sharing across the organization. dbt Cloud is a hosted service that helps data teams productionize dbt deployments.

Data Warehouse

Data Warehouse Analytics Testing Sales

Key Challenges Affecting Data Transformations—Dev and Testing

Wayne Yaddow

FEBRUARY 6, 2025

Common challenges and practical mitigation strategies for reliable data transformations. Photo by Mika Baumeister on Unsplash Introduction Data transformations are important processes in data engineering, enabling organizations to structure, enrich, and integrate data for analytics , reporting, and operational decision-making.

Testing

Testing Data Transformation Data-driven Manufacturing

Functional Gaps in Your Data Transformation Testing Tools?

Wayne Yaddow

FEBRUARY 11, 2025

Managing tests of complex data transformations when automated data testing tools lack important features? Photo by Marvin Meyer on Unsplash Introduction Data transformations are at the core of modern business intelligence, blending and converting disparate datasets into coherent, reliable outputs.

Testing

Testing Data Transformation Data Quality Statistics

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

AWS Big Data

OCTOBER 30, 2024

Amazon DataZone recently announced the expansion of data analysis and visualization options for your project-subscribed data within Amazon DataZone using the Amazon Athena JDBC driver. Refer to the detailed blog post on how you can use this to connect through various other tools. Get started with our technical documentation.

Analytics

Analytics Visualization Data Governance Data-driven

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

With this launch of JDBC connectivity, Amazon DataZone expands its support for data users, including analysts and scientists, allowing them to work in their preferred environments—whether it’s SQL Workbench, Domino, or Amazon-native solutions—while ensuring secure, governed access within Amazon DataZone. Choose Test connection.

Visualization

Visualization Data Lake Testing Data Governance

It’s Essential — Verifying Data Transformations (Part 4)

Wayne Yaddow

FEBRUARY 4, 2025

Its EssentialVerifying Data Transformations (Part4) Uncovering the leading problems in data transformation workflowsand practical ways to detect and preventthem In Parts 13 of this series of blogs, categories of data transformations were identified as among the top causes of data quality defects in data pipeline workflows.

Data Transformation

Data Transformation Testing Data Quality Strategy

What is a DataOps Engineer?

DataKitchen

OCTOBER 5, 2021

DataOps establishes a process hub that automates data production and analytics development workflows so that the data team is more efficient, innovative and less prone to error. In this blog, we’ll explore the role of the DataOps Engineer in driving the data organization to higher levels of productivity. Create tests.

Testing

Testing Dashboards Measurement Experimentation

The Journey to DataOps Success: Key Takeaways from Transformation Trailblazers

DataKitchen

APRIL 26, 2021

GSK had been pursuing DataOps capabilities such as automation, containerization, automated testing and monitoring, and reusability, for several years. DataOps provides the “continuous delivery equivalent for Machine Learning and enables teams to manage the complexities around continuous training, A/B testing, and deploying without downtime.

Measurement

Measurement Metrics Data-driven Dashboards

10 Examples of How Big Data in Logistics Can Transform The Supply Chain

datapine

MAY 2, 2023

Your Chance: Want to test a professional logistics analytics software? Use our 14-days free trial today & transform your supply chain! Your Chance: Want to test a professional logistics analytics software? Use our 14-days free trial today & transform your supply chain! Now’s the time to strike.

Big Data

Big Data Internet of Things Cost-Benefit Optimization

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Also known as data validation, integrity refers to the structural testing of data to ensure that the data complies with procedures. This means there are no unintended data errors, and it corresponds to its appropriate designation (e.g., Here, it all comes down to the data transformation error rate.

Data Quality

Data Quality Metrics Data-driven Management

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. This approach simplifies your data journey and helps you meet your security requirements.

Visualization

Visualization Data Processing Testing Publishing

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

Here are a few examples that we have seen of how this can be done: Batch ETL with Azure Data Factory and Azure Databricks: In this pattern, Azure Data Factory is used to orchestrate and schedule batch ETL processes. Azure Blob Storage serves as the data lake to store raw data. Azure Machine Learning).

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

Navigating the Chaos of Unruly Data: Solutions for Data Teams

DataKitchen

NOVEMBER 10, 2023

Extrinsic Control Deficit: Many of these changes stem from tools and processes beyond the immediate control of the data team. Unregulated ETL/ELT Processes: The absence of stringent data quality tests in ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes further exacerbates the problem.

Data Quality

Data Quality Testing Data Lake Data Integration

DataOps Observability: Taming the Chaos (part 1)

DataKitchen

OCTOBER 5, 2022

DataOps Observability can help you ensure that your complex data pipelines and processes are accurate and that they deliver as designed. Observability also validates that your data transformations, models, and reports are performing as expected. to monitor your data operations. without replacing staff or systems?to

Testing

Testing Risk Data Processing Statistics

Data Integrity, the Basis for Reliable Insights

Sisense

AUGUST 28, 2020

We live in a world of data: There’s more of it than ever before, in a ceaselessly expanding array of forms and locations. Dealing with Data is your window into the ways data teams are tackling the challenges of this new world to help their companies and their customers thrive. Data integrity: A process and a state.

Data Integration

Data Integration Testing Data Quality Data-driven

Migrate from Apache Solr to OpenSearch

AWS Big Data

JULY 18, 2024

This blog post dives into the strategic considerations and steps involved in migrating from Solr to OpenSearch. For example, the following creates a collection called test with one shard and no replicas. Multiple processor stages can be chained to form a pipeline for data transformation.

Dashboards

Dashboards Testing Data-driven Visualization

Improve Business Agility by Hiring a DataOps Engineer

DataKitchen

DECEMBER 20, 2020

DataOps Engineers implement the continuous deployment of data analytics. They give data scientists tools to instantiate development sandboxes on demand. They automate the data operations pipeline and create platforms used to test and monitor data from ingestion to published charts and graphs.

Data-driven

Data-driven Manufacturing Data Architecture Data Analytics

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Cloudera

OCTOBER 7, 2022

We’re excited to announce the general availability of the open source adapters for dbt for all the engines in CDP — Apache Hive , Apache Impala , and Apache Spark, with added support for Apache Livy and Cloudera Data Engineering. This variety can result in a lack of standardization, leading to data duplication and inconsistency.

Data Warehouse

Data Warehouse Data Transformation Machine Learning Data Lake

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

DECEMBER 9, 2022

Developers need to onboard new data sources, chain multiple data transformation steps together, and explore data as it travels through the flow. This allows developers to make changes to their processing logic on the fly while running some test data through their flow and validating that their changes work as intended.

Testing

Testing Cost-Benefit Interactive Visualization

Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering

Cloudera

OCTOBER 19, 2021

Airflow has been adopted by many Cloudera Data Platform (CDP) customers in the public cloud as the next generation orchestration service to setup and operationalize complex data pipelines. With this Technical Preview release, any CDE customer can test drive the new authoring interface by setting up the latest CDE service.

Data Transformation

Data Transformation Interactive Machine Learning Testing

DataOps Observability: Taming the Chaos (part 1)

DataKitchen

OCTOBER 5, 2022

DataOps Observability can help you ensure that your complex data pipelines and processes are accurate and that they deliver as designed. Observability also validates that your data transformations, models, and reports are performing as expected. to monitor your data operations. without replacing staff or systems?to

Testing

Testing Risk Data Processing Statistics

DataOps Observability: Taming the Chaos (Part 2)

DataKitchen

OCTOBER 25, 2022

The goal of DataOps Observability is to provide visibility of every journey that data takes from source to customer value across every tool, environment, data store, data and analytic team, and customer so that problems are detected, localized and raised immediately. A data journey spans and tracks multiple pipelines.

Testing

Testing Data-driven Visualization Dashboards

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

We just announced the general availability of Cloudera DataFlow Designer , bringing self-service data flow development to all CDP Public Cloud customers. In our previous DataFlow Designer blog post , we introduced you to the new user interface and highlighted its key capabilities.

Testing

Testing Publishing Metadata Interactive

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

Use Snowflake with Amazon MWAA to orchestrate data pipelines

AWS Big Data

OCTOBER 31, 2023

This blog post is co-written with James Sun from Snowflake. Customers rely on data from different sources such as mobile applications, clickstream events from websites, historical data, and more to deduce meaningful patterns to optimize their products, services, and processes. Provide a name of your choice for the environment.

Data Processing

Data Processing Management Publishing Visualization

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

Predict – Data Engineering (Apache Spark). CDP Data Engineering (1) – a service purpose-built for data engineers focused on deploying and orchestrating data transformation using Spark at scale. 3) Data Visualization is in Tech Preview on AWS and Azure. New Services.

Data Warehouse

Data Warehouse Machine Learning Visualization Data Lake

Cloudera Data Engineering 2021 Year End Review

Cloudera

DECEMBER 21, 2021

This enabled new use-cases with customers that were using a mix of Spark and Hive to perform data transformations. . Along with delivering the world’s first true hybrid data cloud, stay tuned for product announcements that will drive even more business value with innovative data ops and engineering capabilities.

Snapshot

Snapshot Data-driven Optimization Management

Unparalleled Productivity: The Power of Cloudera Copilot for Cloudera Machine Learning

Cloudera

JUNE 24, 2024

This integration empowers developers and data scientists alike with advanced capabilities for code completion, generation, and troubleshooting. Whether you’re tackling data transformation challenges or refining intricate machine learning models, our Copilot is designed to be your reliable partner in innovation.

Machine Learning

Machine Learning Data Science Data-driven Testing

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

Modak Nabu relies on a framework of “Botworks”, a series of micro-jobs to accomplish various data transformation steps from ingestion to profiling, and indexing. Cloudera Data Engineering within CDP provides : Fully managed Spark-on-Kubernetes service that hides the complexity running production DE workloads at scale.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

Using COD and CML to build applications that predict stock data

Cloudera

FEBRUARY 8, 2021

Continuing from my previous blog post about how awesome and easy it is to develop web-based applications backed by Cloudera Operational Database (COD), I started a small project to integrate COD with another CDP cloud experience, Cloudera Machine Learning (CML). . Now, let’s start testing our model! b) Basic data transformation.

Machine Learning

Machine Learning Statistics Dashboards Modeling

Adding AI to Products: A High-Level Guide for Product Managers

Sisense

AUGUST 6, 2020

Be sure test cases represent the diversity of app users. As an AI product manager, here are some important data-related questions you should ask yourself: What is the problem you’re trying to solve? What data transformations are needed from your data scientists to prepare the data? The perfect fit.

Management

Management Machine Learning Key Performance Indicator Cost-Benefit

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

cd /home/ec2-user/SageMaker BASE_S3_PATH="s3://aws-blogs-artifacts-public/artifacts/BDB-4265" aws s3 cp "${BASE_S3_PATH}/0_create_tables_with_metadata.ipynb"./ Prompt with no metadata For the first test, we used a basic prompt containing just the SQL generating instructions and no table metadata.

Metadata

Metadata Data Lake Modeling Data Warehouse

Connecting the Data Lifecycle

Cloudera

NOVEMBER 29, 2021

Data transforms businesses. That’s where the data lifecycle comes into play. Managing data and its flow, from the edge to the cloud, is one of the most important tasks in the process of gaining data intelligence. . The post Connecting the Data Lifecycle appeared first on Cloudera Blog.

Data Lake

Data Lake Data Warehouse Data Architecture Reporting

How to Use Apache Iceberg in CDP’s Open Lakehouse

Cloudera

AUGUST 8, 2022

The general availability covers Iceberg running within some of the key data services in CDP, including Cloudera Data Warehouse ( CDW ), Cloudera Data Engineering ( CDE ), and Cloudera Machine Learning ( CML ). In this first blog, we shared with you how to use Apache Iceberg in Cloudera Data Platform to build an open lakehouse.

Snapshot

Snapshot Data Warehouse Machine Learning Cost-Benefit

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

These tools empower analysts and data scientists to easily collaborate on the same data, with their choice of tools and analytic engines. No more lock-in, unnecessary data transformations, or data movement across tools and clouds just to extract insights out of the data.

Data Lake

Data Lake Data Warehouse Data Architecture Metadata

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Cloudera

JULY 21, 2022

Cloudera Data Warehouse). Efficient batch data processing. Complex data transformations. Together Cloudera and Rill Data are dedicated to building and maintaining the data infrastructure that best supports our customers with cost-performant queries, resilience, and distributed real-time metrics. .

Metrics

Metrics Slice and Dice Data Warehouse Dashboards

Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps

Cloudera

OCTOBER 14, 2020

The Test and Development queue have fixed resource limits. YuniKorn, thus empowers Apache Spark to become an enterprise-grade essential platform for users, offering a robust platform for a variety of applications ranging from large scale data transformation to analytics to machine learning. Acknowledgments.

Machine Learning

Machine Learning Management Big Data Optimization

Deploy and Scale AI Applications With Cloudera AI Inference Service

Cloudera

OCTOBER 8, 2024

Detailed Data and Model Lineage Tracking*: Ensures comprehensive tracking and documentation of data transformations and model lifecycle events, enhancing reproducibility and auditability. The post Deploy and Scale AI Applications With Cloudera AI Inference Service appeared first on Cloudera Blog.

Optimization

Optimization Experimentation Metrics Enterprise

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

Jessicamouth 2964 Queensland Load the dataset First, create a new table in your Redshift Serverless endpoint and copy the test data into it by doing the following: Open the Query Editor V2 and log in using the admin user name and details defined when the endpoint was created.

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

Prevent Rain Clouds Along Your Snowflake Migration

CDW Research Hub

OCTOBER 25, 2019

As we review data transformation and modernization strategies with our clients, we find many are investigating Snowflake as a data warehouse solution due to its ease of use, speed, and increased flexibility over a traditional data warehouse offering. Validate and test through the entire migration. Sirius can help.

Data Warehouse

Data Warehouse Testing Strategy Data-driven

Self-Service Data’s New Frontier: The Data Catalog

Alation

FEBRUARY 20, 2020

In perhaps a preview of things to come next year, we decided to test how a Data Catalog might work with Tableau on the same data. You can check out a self service data prep flow from catalog to viz in this recorded version here. Rita Sallam Introduces the Data Prep Rodeo. Subscribe to Alation's Blog.

Scorecard

Scorecard ROI Data-driven Visualization

Understand PMML (It’s Not That Hard)!

Smarten

MAY 13, 2024

To accomplish this interchange, the method uses data mining and machine learning and it contains components like a data dictionary to define the fields used by the model, and data transformation to map user data and make it easier for the system to mine that data. Simple interpretation of models in English.

Predictive Modeling

Predictive Modeling Data mining Predictive Analytics Modeling

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset.

Reporting

Reporting Metrics Optimization Data Lake

How to Include BI in Your 2020 Budget

Sisense

DECEMBER 12, 2019

Building a data-driven business includes choosing the right software and implementing best practices around its use. Every year when budget time rolls around, many organizations find themselves asking the same question: “what are we going to do about our data?” This is a summary article. New year, same questions.

Business Intelligence

Business Intelligence Software Data-driven Visualization

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Key Challenges Affecting Data Transformations—Dev and Testing

Webinars

Trending Sources

Functional Gaps in Your Data Transformation Testing Tools?

Webinars

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

It’s Essential — Verifying Data Transformations (Part 4)

What is a DataOps Engineer?

The Journey to DataOps Success: Key Takeaways from Transformation Trailblazers

10 Examples of How Big Data in Logistics Can Transform The Supply Chain

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

Navigating the Chaos of Unruly Data: Solutions for Data Teams

DataOps Observability: Taming the Chaos (part 1)

Data Integrity, the Basis for Reliable Insights

Migrate from Apache Solr to OpenSearch

Improve Business Agility by Hiring a DataOps Engineer

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Introducing Self-Service, No-Code Airflow Authoring UI in Cloudera Data Engineering

DataOps Observability: Taming the Chaos (part 1)

DataOps Observability: Taming the Chaos (Part 2)

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Use Snowflake with Amazon MWAA to orchestrate data pipelines

Happy Birthday, CDP Public Cloud

Cloudera Data Engineering 2021 Year End Review

Unparalleled Productivity: The Power of Cloudera Copilot for Cloudera Machine Learning

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Using COD and CML to build applications that predict stock data

Adding AI to Products: A High-Level Guide for Product Managers

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Connecting the Data Lifecycle

How to Use Apache Iceberg in CDP’s Open Lakehouse

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps

Deploy and Scale AI Applications With Cloudera AI Inference Service

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Prevent Rain Clouds Along Your Snowflake Migration

Self-Service Data’s New Frontier: The Data Catalog

Understand PMML (It’s Not That Hard)!

Automate alerting and reporting for AWS Glue job resource usage

How to Include BI in Your 2020 Budget

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift