Data Transformation, Optimization and Testing

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

AWS Big Data

NOVEMBER 22, 2024

The need for streamlined data transformations As organizations increasingly adopt cloud-based data lakes and warehouses, the demand for efficient data transformation tools has grown. This approach helps in managing storage costs while maintaining the flexibility to analyze historical trends when needed.

Data Lake

Data Lake Data Warehouse Cost-Benefit Data Transformation

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

We need robust versioning for data, models, code, and preferably even the internal state of applications—think Git on steroids to answer inevitable questions: What changed? The applications must be integrated to the surrounding business systems so ideas can be tested and validated in the real world in a controlled manner.

IT

IT Testing Experimentation Software

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Together with price-performance, Amazon Redshift offers capabilities such as serverless architecture, machine learning integration within your data warehouse and secure data sharing across the organization. dbt Cloud is a hosted service that helps data teams productionize dbt deployments. Choose Test Connection.

Data Warehouse

Data Warehouse Analytics Testing Sales

Webinars

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Complex Data Transformations — Test Planning Best Practices

Wayne Yaddow

FEBRUARY 21, 2025

Complex Data TransformationsTest Planning Best Practices Ensuring data accuracy with structured testing and best practices Photo by Taylor Vick on Unsplash Introduction Data transformations and conversions are crucial for data pipelines, enabling organizations to process, integrate, and refine raw data into meaningful insights.

Testing

Testing Data Transformation Data Quality Data Integration

Available Now! Automated Testing for Data Transformations

Wayne Yaddow

FEBRUARY 18, 2025

Selecting the strategies and tools for validating data transformations and data conversions in your data pipelines. Introduction Data transformations and data conversions are crucial to ensure that raw data is organized, processed, and ready for useful analysis.

Testing

Testing Data Transformation Data-driven Data Quality

Key Challenges Affecting Data Transformations—Dev and Testing

Wayne Yaddow

FEBRUARY 6, 2025

Common challenges and practical mitigation strategies for reliable data transformations. Photo by Mika Baumeister on Unsplash Introduction Data transformations are important processes in data engineering, enabling organizations to structure, enrich, and integrate data for analytics , reporting, and operational decision-making.

Testing

Testing Data Transformation Data-driven Manufacturing

10 Examples of How Big Data in Logistics Can Transform The Supply Chain

datapine

MAY 2, 2023

Table of Contents 1) Benefits Of Big Data In Logistics 2) 10 Big Data In Logistics Use Cases Big data is revolutionizing many fields of business, and logistics analytics is no exception. The complex and ever-evolving nature of logistics makes it an essential use case for big data applications.

Big Data

Big Data Internet of Things Cost-Benefit Optimization

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

AWS Big Data

OCTOBER 30, 2024

Amazon DataZone recently announced the expansion of data analysis and visualization options for your project-subscribed data within Amazon DataZone using the Amazon Athena JDBC driver. This new JDBC connectivity feature enables our governed data to flow seamlessly into these tools, supporting productivity across our teams.”

Analytics

Analytics Visualization Data Governance Data-driven

The Journey to DataOps Success: Key Takeaways from Transformation Trailblazers

DataKitchen

APRIL 26, 2021

GSK had been pursuing DataOps capabilities such as automation, containerization, automated testing and monitoring, and reusability, for several years. DataOps provides the “continuous delivery equivalent for Machine Learning and enables teams to manage the complexities around continuous training, A/B testing, and deploying without downtime.

Measurement

Measurement Metrics Data-driven Dashboards

Data Engineers Are Using AI to Verify Data Transformations

Wayne Yaddow

FEBRUARY 26, 2025

AI is transforming how senior data engineers and data scientists validate data transformations and conversions. Artificial intelligence-based verification approaches aid in the detection of anomalies, the enforcement of data integrity, and the optimization of pipelines for improved efficiency.

Data Transformation

Data Transformation Testing Data-driven Data Quality

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

Let’s go through the ten Azure data pipeline tools Azure Data Factory : This cloud-based data integration service allows you to create data-driven workflows for orchestrating and automating data movement and transformation. You can use it for big data analytics and machine learning workloads.

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

However, you might face significant challenges when planning for a large-scale data warehouse migration. This includes the ETL processes that capture source data, the functional refinement and creation of data products, the aggregation for business metrics, and the consumption from analytics, business intelligence (BI), and ML.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

AWS Big Data

APRIL 12, 2023

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows organizations to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). This performance-optimized runtime offered by Amazon EMR makes your Spark jobs run fast and cost-effectively. test: EMR release – EMR 6.10.0

Testing

Testing Big Data Metadata Optimization

What is business analytics? Using data to improve business outcomes

CIO Business Intelligence

JULY 5, 2022

What is the difference between business analytics and data analytics? Business analytics is a subset of data analytics. Data analytics is used across disciplines to find trends and solve problems using data mining , data cleansing, data transformation, data modeling, and more.

Business Analytics

Business Analytics Prescriptive Analytics Data mining Diagnostic Analytics

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

Build data validation rules directly into ingestion layers so that insufficient data is stopped at the gate and not detected after damage is done. Use lineage tooling to trace data from source to report. Understanding how data transforms and where it breaks is crucial for audibility and root-cause resolution.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

12 data science certifications that will pay off

CIO Business Intelligence

JANUARY 19, 2024

The exam tests general knowledge of the platform and applies to multiple roles, including administrator, developer, data analyst, data engineer, data scientist, and system architect. Candidates for the exam are tested on ML, AI solutions, NLP, computer vision, and predictive analytics.

Data Science

Data Science Machine Learning Predictive Modeling Forecasting

Migrate from Apache Solr to OpenSearch

AWS Big Data

JULY 18, 2024

The main driving factors include lower total cost of ownership, scalability, stability, improved ingestion connectors (such as Data Prepper , Fluent Bit, and OpenSearch Ingestion), elimination of external cluster managers like Zookeeper, enhanced reporting, and rich visualizations with OpenSearch Dashboards.

Dashboards

Dashboards Testing Data-driven Visualization

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Also known as data validation, integrity refers to the structural testing of data to ensure that the data complies with procedures. This means there are no unintended data errors, and it corresponds to its appropriate designation (e.g., Here, it all comes down to the data transformation error rate.

Data Quality

Data Quality Metrics Data-driven Management

At AstraZeneca, data and AI are more than game changers – they are life changers

CIO Business Intelligence

OCTOBER 11, 2022

More quickly moving from ideas to insights has aided new drug development and the clinical trials used for testing new products. AstraZeneca’s ability to quickly spin up new analytics capabilities using AI Bench was put to the ultimate test in early 2020 as the global pandemic took hold. .

Machine Learning

Machine Learning Data Science Data-driven Testing

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Prompt with no metadata For the first test, we used a basic prompt containing just the SQL generating instructions and no table metadata. Enriching the prompt You can enhance the prompts with query optimization rules like partition pruning. You can add more such query optimization rules to the instructions.

Metadata

Metadata Data Lake Modeling Data Warehouse

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

BMW Group uses 4,500 AWS Cloud accounts across the entire organization but is faced with the challenge of reducing unnecessary costs, optimizing spend, and having a central place to monitor costs. The ultimate goal is to raise awareness of cloud efficiency and optimize cloud utilization in a cost-effective and sustainable manner.

Analytics

Analytics Dashboards Metadata Data Warehouse

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To create and manage the data products, smava uses Amazon Redshift , a cloud data warehouse. In this post, we show how smava optimized their data platform by using Amazon Redshift Serverless and Amazon Redshift data sharing to overcome right-sizing challenges for unpredictable workloads and further improve price-performance.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

AWS Big Data

AUGUST 9, 2024

AWS offers Redshift Test Drive to validate whether the configuration chosen for Amazon Redshift is ideal for your workload before migrating the production environment. At this point, only one-time queries and those made by Amazon QuickSight reached the new cluster. We removed the DC2 cluster and completed the migration.

Data Lake

Data Lake Analytics Data Warehouse Data-driven

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making. However, as data volumes continue to grow, optimizing data layout and organization becomes crucial for efficient querying and analysis.

Optimization

Optimization Data Lake Cost-Benefit Reporting

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

Data Warehouse – in addition to a number of performance optimizations, DW has added a number of new features for better scalability, monitoring and reliability to enable self-service access with security and performance . Predict – Data Engineering (Apache Spark). New Services. Learn More, Keep in Touch.

Data Warehouse

Data Warehouse Machine Learning Visualization Data Lake

Cloudera Data Engineering 2021 Year End Review

Cloudera

DECEMBER 21, 2021

Since the release of Cloudera Data Engineering (CDE) more than a year ago , our number one goal was operationalizing Spark pipelines at scale with first class tooling designed to streamline automation and observability. This enabled new use-cases with customers that were using a mix of Spark and Hive to perform data transformations. .

Snapshot

Snapshot Data-driven Optimization Management

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

AWS Big Data

JANUARY 30, 2025

Our approach The migration initiative consisted of two main parts: building the new architecture and migrating data pipelines from the existing tool to the new architecture. Often, we would work on both in parallel, testing one component of the architecture while developing another at the same time.

Data Warehouse

Data Warehouse Data Architecture Machine Learning Data Transformation

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset.

Reporting

Reporting Metrics Optimization Data Lake

Deploy and Scale AI Applications With Cloudera AI Inference Service

Cloudera

OCTOBER 8, 2024

This service supports a range of optimized AI models, enabling seamless and scalable AI inference. By leveraging the NVIDIA NeMo platform and optimized versions of open-source models like Llama 3 and Mistral, businesses can harness the latest advancements in natural language processing, computer vision, and other AI domains.

Optimization

Optimization Experimentation Metrics Enterprise

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

It supports modern analytical data lake operations such as create table as select (CTAS), upsert and merge, and time travel queries. Athena also supports the ability to create views and perform VACUUM (snapshot expiration) on Apache Iceberg tables to optimize storage and performance. You can perform bulk load using a CTAS statement.

Data Lake

Data Lake Snapshot Optimization Data Transformation

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

Additionally, a TCO calculator generates the TCO estimation of an optimized EMR cluster for facilitating the migration. For optimizing EMR cluster cost effectiveness, the following table provides general guidelines of choosing the proper type of EMR cluster and Amazon Elastic Compute Cloud (Amazon EC2) family.

Dashboards

Dashboards Optimization Data Lake Cost-Benefit

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Building a starter version of anything can often be straightforward, but building something with enterprise-grade scale, security, resiliency, and performance typically requires knowledge of and adherence to battle-tested best practices, and using the right tools and features in the right scenario. Data Vault 2.0

Enterprise

Enterprise Data Warehouse Data Lake Optimization

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

Allows them to iteratively develop processing logic and test with as little overhead as possible. Plays nice with existing CI/CD processes to promote a data pipeline to production. Provides monitoring, alerting, and troubleshooting for production data pipelines.

Testing

Testing Publishing Metadata Interactive

Unparalleled Productivity: The Power of Cloudera Copilot for Cloudera Machine Learning

Cloudera

JUNE 24, 2024

The Evolution of AI-Powered Assistance At Cloudera, we understand the challenges faced by data practitioners. The complexities of modern data workflows often translate into countless hours spent coding, debugging, and optimizing models.

Machine Learning

Machine Learning Data Science Data-driven Testing

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Cloudera

JULY 21, 2022

Cloudera Data Warehouse). Efficient batch data processing. Complex data transformations. Support for data rollup and summarization. Highly optimized time series queries. This reduces the total cost of ownership and frees internal resources for higher priority tasks than Druid maintenance and optimization.

Metrics

Metrics Slice and Dice Data Warehouse Dashboards

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

Modak Nabu relies on a framework of “Botworks”, a series of micro-jobs to accomplish various data transformation steps from ingestion to profiling, and indexing. Cloudera Data Engineering within CDP provides : Fully managed Spark-on-Kubernetes service that hides the complexity running production DE workloads at scale.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Amazon Redshift enables you to use SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning (ML) to deliver the best price-performance at scale. Shashank Tewari is a Senior Technical Account Manager at AWS.

Data Warehouse

Data Warehouse Analytics Data Lake Data Science

Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps

Cloudera

OCTOBER 14, 2020

YuniKorn is designed for Big Data app workloads, and it natively supports to run Spark/Flink/Tensorflow, etc efficiently in K8s. YuniKorn is optimized for performance, it is suitable for high throughput and large scale environments. YuniKorn scheduler provides an optimal solution to manage resource quotas by using resource queues.

Machine Learning

Machine Learning Management Big Data Optimization

Estes Express shifts gears on customer experience by streamlining data operations

CIO Business Intelligence

JANUARY 9, 2023

To fuel self-service analytics and provide the real-time information customers and internal stakeholders need to meet customers’ shipping requirements, the Richmond, VA-based company, which operates a fleet of more than 8,500 tractors and 34,000 trailers, has embarked on a data transformation journey to improve data integration and data management.

Data Strategy

Data Strategy Strategy Data Governance Marketing

Assessing and interviewing data engineers from a distance

Insight

APRIL 8, 2020

Having run a data engineering program at Insight for several years, we’ve identified three broad categories of data engineers: Software engineers who focus on building data pipelines. In some cases, they work to deploy data science models into production with an eye towards optimization, scalability and maintainability.

Data Warehouse

Data Warehouse Cost-Benefit Software Optimization

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

.” Sean Im, CEO, Samsung SDS America “In the field of generative AI and foundation models, watsonx is a platform that will enable us to meet our customers’ requirements in terms of optimization and security, while allowing them to benefit from the dynamism and innovations of the open-source community.”

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

This method uses GZIP compression to optimize storage consumption and query performance. You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches. You can test this solution yourself using the AWS Samples GitHub repository.

Analytics

Analytics IoT Metadata Internet of Things

How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with Amazon EMR on Amazon EKS

AWS Big Data

FEBRUARY 21, 2023

We use Apache Spark as our main data processing engine and have over 1,000 Spark applications running over massive amounts of data every day. These Spark applications implement our business logic ranging from data transformation, machine learning (ML) model inference, to operational tasks. Their costs were climbing.

Cost-Benefit

Cost-Benefit Informatics Optimization Management

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

MLOps and DevOps: Why Data Makes It Different

Webinars

Trending Sources

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Webinars

Complex Data Transformations — Test Planning Best Practices

Available Now! Automated Testing for Data Transformations

Key Challenges Affecting Data Transformations—Dev and Testing

10 Examples of How Big Data in Logistics Can Transform The Supply Chain

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

The Journey to DataOps Success: Key Takeaways from Transformation Trailblazers

Data Engineers Are Using AI to Verify Data Transformations

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

What is business analytics? Using data to improve business outcomes

Data’s dark secret: Why poor quality cripples AI and growth

12 data science certifications that will pay off

Migrate from Apache Solr to OpenSearch

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

At AstraZeneca, data and AI are more than game changers – they are life changers

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Happy Birthday, CDP Public Cloud

Cloudera Data Engineering 2021 Year End Review

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

Automate alerting and reporting for AWS Glue job resource usage

Deploy and Scale AI Applications With Cloudera AI Inference Service

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Unparalleled Productivity: The Power of Cloudera Copilot for Cloudera Machine Learning

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps

Estes Express shifts gears on customer experience by streamlining data operations

Assessing and interviewing data engineers from a distance

Exploring the AI and data capabilities of watsonx

Gain insights from historical location data using Amazon Location Service and AWS analytics services

How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with Amazon EMR on Amazon EKS

Stay Connected