Data Transformation, Information and Testing

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

AWS Big Data

OCTOBER 23, 2024

response = client.create( key="test", value="Test value", description="Test description" ) print(response) print("nListing all variables.") variables = client.list() print(variables) print("nGetting the test variable.") Creating a test variable. Creating a test variable. Creating a test variable.

Interactive

Interactive Testing Data-driven Data Lake

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Together with price-performance, Amazon Redshift offers capabilities such as serverless architecture, machine learning integration within your data warehouse and secure data sharing across the organization. dbt Cloud is a hosted service that helps data teams productionize dbt deployments. Choose Test Connection.

Data Warehouse

Data Warehouse Analytics Testing Sales

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

We need robust versioning for data, models, code, and preferably even the internal state of applications—think Git on steroids to answer inevitable questions: What changed? The applications must be integrated to the surrounding business systems so ideas can be tested and validated in the real world in a controlled manner.

IT

IT Testing Experimentation Software

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Available Now! Automated Testing for Data Transformations

Wayne Yaddow

FEBRUARY 18, 2025

Selecting the strategies and tools for validating data transformations and data conversions in your data pipelines. Introduction Data transformations and data conversions are crucial to ensure that raw data is organized, processed, and ready for useful analysis.

Testing

Testing Data Transformation Data-driven Data Quality

Functional Gaps in Your Data Transformation Testing Tools?

Wayne Yaddow

FEBRUARY 11, 2025

Managing tests of complex data transformations when automated data testing tools lack important features? Photo by Marvin Meyer on Unsplash Introduction Data transformations are at the core of modern business intelligence, blending and converting disparate datasets into coherent, reliable outputs.

Testing

Testing Data Transformation Data Quality Statistics

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

CIO Business Intelligence

AUGUST 9, 2024

Pruitt says the airport’s new capabilities provide data-driven insights for improving operations, passenger experience, and non-aeronautical revenue across airport business units. Applying AI to elevate ROI Pruitt and Databricks recently finished a pilot test with Microsoft called Smart Flow.

Data Transformation

Data Transformation Machine Learning Data Lake Dashboards

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

AWS Big Data

MAY 2, 2025

These innovations run AI search flows to uncover relevant information through semantic, cross-language, and content understanding; adapt information ranking to individual behaviors; and enable guided conversations to pinpoint answers. Ingest flows are created to enrich data as its added to an index.

Machine Learning

Machine Learning Visualization Dashboards Metadata

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

AWS Big Data

OCTOBER 30, 2024

Amazon DataZone recently announced the expansion of data analysis and visualization options for your project-subscribed data within Amazon DataZone using the Amazon Athena JDBC driver. Joel has led data transformation projects on fraud analytics, claims automation, and Master Data Management.

Analytics

Analytics Visualization Data Governance Data-driven

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

However, with all good things comes many challenges and businesses often struggle with managing their information in the correct way. Oftentimes, the data being collected and used is incomplete or damaged, leading to many other issues that can considerably harm the company. Enters data quality management.

Data Quality

Data Quality Metrics Data-driven Management

10 Examples of How Big Data in Logistics Can Transform The Supply Chain

datapine

MAY 2, 2023

To work effectively, big data requires a large amount of high-quality information sources. Where is all of that data going to come from? Proactivity: Another key benefit of big data in the logistics industry is that it encourages informed decision-making and proactivity.

Big Data

Big Data Internet of Things Cost-Benefit Optimization

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

Your generated jobs can use a variety of data transformations, including filters, projections, unions, joins, and aggregations, giving you the flexibility to handle complex data processing requirements. In this post, we discuss how Amazon Q data integration transforms ETL workflow development.

Data Integration

Data Integration Visualization Data Processing Big Data

Data Engineers Are Using AI to Verify Data Transformations

Wayne Yaddow

FEBRUARY 26, 2025

AI is transforming how senior data engineers and data scientists validate data transformations and conversions. Artificial intelligence-based verification approaches aid in the detection of anomalies, the enforcement of data integrity, and the optimization of pipelines for improved efficiency.

Data Transformation

Data Transformation Testing Data-driven Data Quality

Navigating the Chaos of Unruly Data: Solutions for Data Teams

DataKitchen

NOVEMBER 10, 2023

Extrinsic Control Deficit: Many of these changes stem from tools and processes beyond the immediate control of the data team. Unregulated ETL/ELT Processes: The absence of stringent data quality tests in ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes further exacerbates the problem.

Data Quality

Data Quality Testing Data Lake Data Integration

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant. Data lives across siloed systems ERP, CRM, cloud platforms, spreadsheets with little integration or consistency. Synthetic data.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

Here are a few examples that we have seen of how this can be done: Batch ETL with Azure Data Factory and Azure Databricks: In this pattern, Azure Data Factory is used to orchestrate and schedule batch ETL processes. Azure Blob Storage serves as the data lake to store raw data. Azure Machine Learning).

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

DataOps Observability: Taming the Chaos (part 1)

DataKitchen

OCTOBER 5, 2022

DataOps Observability can help you ensure that your complex data pipelines and processes are accurate and that they deliver as designed. Observability also validates that your data transformations, models, and reports are performing as expected. to monitor your data operations. without replacing staff or systems?to

Testing

Testing Risk Data Processing Statistics

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

AWS Big Data

NOVEMBER 9, 2023

A modern data platform entails maintaining data across multiple layers, targeting diverse platform capabilities like high performance, ease of development, cost-effectiveness, and DataOps features such as CI/CD, lineage, and unit testing. It does this by helping teams handle the T in ETL (extract, transform, and load) processes.

Data Warehouse

Data Warehouse Testing Data Quality Reporting

At AstraZeneca, data and AI are more than game changers – they are life changers

CIO Business Intelligence

OCTOBER 11, 2022

The new approach involved federating its vast and globally dispersed data repositories in the cloud with Amazon Web Services (AWS). Unifying its data within a centralized architecture allows AstraZeneca’s researchers to easily tag, search, share, transform, analyze, and govern petabytes of information at a scale unthinkable a decade ago. .

Machine Learning

Machine Learning Data Science Data-driven Testing

What is business analytics? Using data to improve business outcomes

CIO Business Intelligence

JULY 5, 2022

What is the difference between business analytics and data analytics? Business analytics is a subset of data analytics. Data analytics is used across disciplines to find trends and solve problems using data mining , data cleansing, data transformation, data modeling, and more.

Business Analytics

Business Analytics Prescriptive Analytics Data mining Diagnostic Analytics

AzureML and CRISP-DM – a Framework to help the Business Intelligence professional move to AI

Jen Stirrup

SEPTEMBER 30, 2021

Before we dive in, let’s define strands of AI, Machine Learning and Data Science: Business intelligence (BI) leverages software and services to transform data into actionable insights that inform an organization’s strategic and tactical business decisions. Once the model has been trained, it will need to be tested.

Business Intelligence

Business Intelligence Data mining Machine Learning Testing

DataOps Observability: Taming the Chaos (Part 2)

DataKitchen

OCTOBER 25, 2022

The space agency created and still uses “mission control” where many screens share detailed data about all aspects of a space flight. That shared information is the basis for monitoring mission status, making decisions and changes, and then communicating to all people involved. DataOps Observability Starts with Data Journeys.

Testing

Testing Data-driven Visualization Dashboards

Data Integrity, the Basis for Reliable Insights

Sisense

AUGUST 28, 2020

Keeping data quality high ensures that the insights your end-users pull are aligned with reality and can help them (and the company at large) make smarter, d ata-driven decisions , as well as pipe quality information to customer-facing apps. . All this contributes to your overall data integrity profile.

Data Integration

Data Integration Testing Data Quality Data-driven

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

AWS Big Data

JUNE 19, 2024

Amazon Redshift gives you more flexibility in how you apply data masking to protect sensitive information stored in SUPER columns containing semi-structured data. SELECT * FROM svv_attached_masking_policy; Now you can test that different users can see the same data masked differently based on their roles.

Data Warehouse

Data Warehouse Testing Sales Structured Data

12 data science certifications that will pay off

CIO Business Intelligence

JANUARY 19, 2024

To qualify for the aCAP exam, you need a master’s degree and less than three years of related experience in data or analytics. The exam tests general knowledge of the platform and applies to multiple roles, including administrator, developer, data analyst, data engineer, data scientist, and system architect.

Data Science

Data Science Machine Learning Predictive Modeling Forecasting

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

DECEMBER 9, 2022

Developers need to onboard new data sources, chain multiple data transformation steps together, and explore data as it travels through the flow. The side panel is context-sensitive and instantly displays relevant configuration information as you navigate through your flow components.

Testing

Testing Cost-Benefit Interactive Visualization

DataOps Observability: Taming the Chaos (part 1)

DataKitchen

OCTOBER 5, 2022

DataOps Observability can help you ensure that your complex data pipelines and processes are accurate and that they deliver as designed. Observability also validates that your data transformations, models, and reports are performing as expected. to monitor your data operations. without replacing staff or systems?to

Testing

Testing Risk Data Processing Statistics

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

However, you might face significant challenges when planning for a large-scale data warehouse migration. This will enable right-sizing the Redshift data warehouse to meet workload demands cost-effectively. Additional considerations – Factor in additional tasks beyond schema conversion.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Can it also help write SQL queries? The answer is yes.

Metadata

Metadata Data Lake Modeling Data Warehouse

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

AWS Big Data

APRIL 12, 2023

We also share a Spark benchmark solution that suits all Amazon EMR deployment options, so you can replicate the process in your environment for your own performance test cases. The solution uses the TPC-DS dataset and unmodified data schema and table relationships, but derives queries from TPC-DS to support the SparkSQL test cases.

Testing

Testing Big Data Metadata Optimization

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

Duplicating data from a production database to a lower or lateral environment and masking personally identifiable information (PII) to comply with regulations enables development, testing, and reporting without impacting critical systems or exposing sensitive customer data. See AWS Glue: How it works for further details.

Visualization

Visualization Metadata Data Transformation Testing

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

AWS Big Data

AUGUST 9, 2024

As businesses strive to make informed decisions, the amount of data being generated and required for analysis is growing exponentially. This trend is no exception for Dafiti , an ecommerce company that recognizes the importance of using data to drive strategic decision-making processes.

Data Lake

Data Lake Analytics Data Warehouse Data-driven

The Best Data Management Tools For Small Businesses

Smart Data Collective

APRIL 29, 2020

The techniques for managing organisational data in a standardised approach that minimises inefficiency. Extraction, Transform, Load (ETL). The extraction of raw data, transforming to a suitable format for business needs, and loading into a data warehouse. Data transformation. Microsoft Azure.

Management

Management Data Warehouse Digital Transformation Dashboards

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard. Each CDH dataset has three processing layers: source (raw data), prepared (transformed data in Parquet), and semantic (combined datasets). Within each stage, it’s possible to create resources for storing actual data.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Estes Express shifts gears on customer experience by streamlining data operations

CIO Business Intelligence

JANUARY 9, 2023

Customers are increasingly demanding access to real-time data, and freight transportation provider Estes Express Lines is among the rising tide of enterprises overhauling their data operations to deliver it. At one point, I had 15 people on my data team and seven of them were engaged only in data analysis.”

Data Strategy

Data Strategy Strategy Data Governance Marketing

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

Modak Nabu relies on a framework of “Botworks”, a series of micro-jobs to accomplish various data transformation steps from ingestion to profiling, and indexing. Cloudera Data Engineering within CDP provides : Fully managed Spark-on-Kubernetes service that hides the complexity running production DE workloads at scale.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

Stream VPC Flow Logs to Datadog via Amazon Kinesis Data Firehose

AWS Big Data

JUNE 20, 2023

You can perform log analysis on these logs to understand users’ application behavior and patterns to make informed decisions. Analyzing VPC flow logs helps you understand how your applications are communicating over the VPC network and acts as a main source of information to the network in your VPC. Choose Create delivery stream.

Dashboards

Dashboards Visualization Metrics Data Transformation

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases.

Management

Management Metadata Analytics Dashboards

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

Overview of the dataset being used The dataset we use is mimicking a source that holds customer information. This source has a manual process of inserting and updating customer data, and this has led to multiple instances of non-unique customers being represented with duplicate records.

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

Predict – Data Engineering (Apache Spark). CDP Data Engineering (1) – a service purpose-built for data engineers focused on deploying and orchestrating data transformation using Spark at scale. 3) Data Visualization is in Tech Preview on AWS and Azure. New Services. Learn More, Keep in Touch.

Data Warehouse

Data Warehouse Machine Learning Visualization Data Lake

Adding AI to Products: A High-Level Guide for Product Managers

Sisense

AUGUST 6, 2020

Be sure test cases represent the diversity of app users. As an AI product manager, here are some important data-related questions you should ask yourself: What is the problem you’re trying to solve? What data transformations are needed from your data scientists to prepare the data? The perfect fit.

Management

Management Machine Learning Key Performance Indicator Cost-Benefit

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset.

Reporting

Reporting Metrics Optimization Data Lake

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

AWS Big Data

OCTOBER 18, 2023

FINRA centralizes all its data in Amazon Simple Storage Service (Amazon S3) with a remote Hive metastore on Amazon Relational Database Service (Amazon RDS) to manage their metadata information. Navigate to the side menu Virtual clusters , then select the HiveDemo cluster , You can see an entry for the SparkSQL test job.

Big Data

Big Data Data Processing Interactive Testing

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Webinars

Trending Sources

MLOps and DevOps: Why Data Makes It Different

Webinars

Available Now! Automated Testing for Data Transformations

Functional Gaps in Your Data Transformation Testing Tools?

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

10 Examples of How Big Data in Logistics Can Transform The Supply Chain

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Data Engineers Are Using AI to Verify Data Transformations

Navigating the Chaos of Unruly Data: Solutions for Data Teams

Data’s dark secret: Why poor quality cripples AI and growth

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataOps Observability: Taming the Chaos (part 1)

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

At AstraZeneca, data and AI are more than game changers – they are life changers

What is business analytics? Using data to improve business outcomes

AzureML and CRISP-DM – a Framework to help the Business Intelligence professional move to AI

DataOps Observability: Taming the Chaos (Part 2)

Data Integrity, the Basis for Reliable Insights

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

12 data science certifications that will pay off

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

DataOps Observability: Taming the Chaos (part 1)

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

The Best Data Management Tools For Small Businesses

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Estes Express shifts gears on customer experience by streamlining data operations

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Stream VPC Flow Logs to Datadog via Amazon Kinesis Data Firehose

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Happy Birthday, CDP Public Cloud

Adding AI to Products: A High-Level Guide for Product Managers

Automate alerting and reporting for AWS Glue job resource usage

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift