Data Leaders Brief

tag etl-testing-tools

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

AWS Glue Data Quality is built on DeeQu , an open source tool developed and used at Amazon to calculate data quality metrics and verify data quality constraints and changes in the data distribution so you can focus on describing how data should look instead of implementing algorithms.

Data Quality

Data Quality Measurement Testing Visualization

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

In this post, we showcase how to use AWS Glue with AWS Glue Data Quality , sensitive data detection transforms , and AWS Lake Formation tag-based access control to automate data governance. For the purpose of this post, the following governance policies are defined: No PII data should exist in tables or columns tagged as public.

Data Quality

Data Quality Data Governance Data Lake Testing

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows. Our solution uses an end-to-end ETL pipeline orchestrated by Amazon MWAA that looks for new incremental files in an Amazon S3 location in Account A, where the raw data is present.

Metadata

Metadata Data Processing Management Testing

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

From enhancing data lakes to empowering AI-driven analytics, AWS unveiled new tools and services that are set to shape the future of data and analytics. This zero-ETL integration reduces the complexity and operational burden of data replication to let you focus on deriving insights from your data.

Analytics

Analytics Data Lake Metadata Data Warehouse

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

They have dev, test, and production clusters running critical workloads and want to upgrade their clusters to CDP Private Cloud Base. Hive-on-Tez for better ETL performance. Sentry to Ranger migration tools. Customer Environment: The customer has three environments: development, test, and production. Test and QA.

Testing

Testing Metadata Risk Data Science

Migrate workloads from AWS Data Pipeline

AWS Big Data

JULY 25, 2024

Data Pipeline has been a foundational service for getting customer off the ground for their extract, transform, load (ETL) and infra provisioning use cases. Before starting any production workloads after migration, you need to test your new workflows to ensure no disruption to production systems. Choose ETL jobs.

Visualization

Visualization Management Data Integration Testing

One Big Cluster Stuck: The Right Tool for the Right Job

Cloudera

JUNE 26, 2023

Over time, using the wrong tool for the job can wreak havoc on environmental health. Take precaution using CDSW as an all-purpose workflow management and scheduling tool. Spark is primarily used to create ETL workloads by data engineers and data scientists. So which open source pipeline tool is better, NiFi or Airflow?

Data Processing

Data Processing Testing Visualization Data Science

Query big data with resilience using Trino in Amazon EMR with Amazon EC2 Spot Instances for less cost

AWS Big Data

OCTOBER 4, 2023

Due to this limitation, the cost of failures of long-running extract, transform, and load (ETL) and batch queries on Trino was high in terms of completion time, compute wastage, and spend. These new enhancements in Trino with Amazon EMR provide improved resiliency for running ETL and batch workloads on Spot Instances with reduced costs.

Big Data

Big Data Optimization Data-driven Management

There’s More to erwin Data Governance Automation Than Meets the AI

erwin

NOVEMBER 6, 2020

Import existing Excel or CSV files, use the drag-and-drop feature to extract the mappings from your ETL scripts, or manually populate the inventory to then be visualized with the lineage analyzer. Data Cataloging: Catalog and sync metadata with data management and governance artifacts according to business requirements in real time.

Data Governance

Data Governance Metadata Data-driven Visualization

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

AWS Big Data

JULY 26, 2023

The following are common asks from our customers: Is it possible to develop and test AWS Glue data integration jobs on my local laptop? The software development lifecycle on AWS defines the following six phases: Plan, Design, Implement, Test, Deploy, and Maintain. Test In the testing phase, you check the implementation for bugs.

Data Integration

Data Integration Snapshot Testing Visualization

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

BMW Cloud Efficiency Analytics (CLEA) is a homegrown tool developed within the BMW FinOps CoE (Center of Excellence) aiming to optimize and reduce costs across all these accounts. In this post, we explore how the BMW Group FinOps CoE implemented their Cloud Efficiency Analytics tool (CLEA), powered by Amazon QuickSight and Amazon Athena.

Analytics

Analytics Dashboards Metadata Data Warehouse

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

A data domain producer maintains its own ETL stack using AWS Glue , AWS Lambda to process, AWS Glue Databrew to profile the data and prepare the data asset (data product) before cataloguing it into AWS Glue Data Catalog in their account. Producers ingest data into their S3 buckets through pipelines they manage, own, and operate.

Data Lake

Data Lake Publishing Metadata Data-driven

Manage Amazon Redshift provisioned clusters with Terraform

AWS Big Data

JULY 25, 2024

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze all your data using standard SQL and your existing extract, transform, and load (ETL); business intelligence (BI); and reporting tools. Add any necessary tags to the instance.

Management

Management Data Warehouse Big Data Testing

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

In order to mature our data marts, it became clear that we needed to provide Analysts and other data consumers with all tracked digital analytics data in our DWH as they depend on it for analyses, reporting, campaign evaluation, product development and A/B testing. We made sure to add appropriate tags for cost monitoring.

Analytics

Analytics Data Lake Testing Optimization

How Alation’s Data Team Uses the Modern Data Stack to Power Insights

Alation

OCTOBER 27, 2022

This data transformation tool enables data analysts and engineers to transform, test and document data in the cloud data warehouse. What tools do you use? For this use case, we leveraged Fivetran, dbt, and Alation to build an internal tool that estimates unique buyer personas. How does this help the end user?

Dashboards

Dashboards Metrics Sales Reporting

What Is a Data Fabric and How Does a Data Catalog Support It?

Alation

JANUARY 25, 2022

But social extends even further; it includes the comments users make, assets they flag as favorites, the watches they set on assets, who they tag and how, and their level of expertise. At its best, a data catalog should empower data analysts, scientists, and anyone curious about data with tools to explore and understand it. Simply put?

Metadata

Metadata IT Data-driven Metrics

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

AWS Big Data

MAY 9, 2025

SageMaker Lakehouse organizes data using logical containers called catalogs , enabling teams to seamlessly query and analyze data across their entire ecosystemfrom S3 data lakes to Amazon Redshift warehousesusing familiar Apache Iceberg compatible tools. Under LF-Tags or catalog resources , select Named Data Catalog resources.

Data Lake

Data Lake Data Warehouse Marketing Management

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

AWS Big Data

MARCH 21, 2025

Traditionally, answering this question would involve multiple data exports, complex extract, transform, and load (ETL) processes, and careful data synchronization across systems. Users can write data to managed RMS tables using Iceberg APIs, Amazon Redshift, or Zero-ETL ingestion from supported data sources.

Data Warehouse

Data Warehouse Metadata Publishing Sales

Measure performance of AWS Glue Data Quality for ETL pipelines

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Webinars

Trending Sources

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Webinars

Top analytics announcements of AWS re:Invent 2024

Upgrade Journey: The Path from CDH to CDP Private Cloud

Migrate workloads from AWS Data Pipeline

One Big Cluster Stuck: The Right Tool for the Right Job

Query big data with resilience using Trino in Amazon EMR with Amazon EC2 Spot Instances for less cost

There’s More to erwin Data Governance Automation Than Meets the AI

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Manage Amazon Redshift provisioned clusters with Terraform

How SumUp made digital analytics more accessible using AWS Glue

How Alation’s Data Team Uses the Modern Data Stack to Power Insights

What Is a Data Fabric and How Does a Data Catalog Support It?

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

Stay Connected