Data Leaders Brief

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset. Dataset details The test dataset contains 104 columns and 1 million rows stored in Parquet format. We define eight different AWS Glue ETL jobs where we run the data quality rulesets.

Data Quality

Data Quality Measurement Testing Visualization

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

Lake Formation tag-based access control (LF-TBAC) is an authorization strategy that defines permissions based on attributes. In Lake Formation, these attributes are called LF-Tags. You can attach LF-Tags to Data Catalog resources, Lake Formation principals, and table columns. You can see the associated database LF-Tags.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Introducing AWS Glue usage profiles for flexible cost control

AWS Big Data

JUNE 18, 2024

AWS Glue is a serverless data integration service that enables you to run extract, transform, and load (ETL) workloads on your data in a scalable and serverless manner. Because an AWS Glue profile is a resource identified by an ARN, all the default IAM controls apply, including action-based, resource-based, and tag-based authorization.

Big Data

Big Data Interactive Management Data Integration

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

In this post, we showcase how to use AWS Glue with AWS Glue Data Quality , sensitive data detection transforms , and AWS Lake Formation tag-based access control to automate data governance. For the purpose of this post, the following governance policies are defined: No PII data should exist in tables or columns tagged as public.

Data Quality

Data Quality Data Governance Data Lake Testing

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

This post demonstrates how to orchestrate an end-to-end extract, transform, and load (ETL) pipeline using Amazon Simple Storage Service (Amazon S3), AWS Glue , and Amazon Redshift Serverless with Amazon MWAA. This is done by invoking AWS Glue ETL jobs and writing to data objects in a Redshift Serverless cluster in Account B.

Metadata

Metadata Data Processing Management Testing

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset. This team is allowed to create AWS Glue for Spark jobs in development, test, and production environments. jobs because this feature will help reduce cost and optimize your ETL jobs.

Reporting

Reporting Metrics Optimization Data Lake

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

This zero-ETL integration reduces the complexity and operational burden of data replication to let you focus on deriving insights from your data. When end-users in the appropriate user groups access Amazon S3 using AWS Glue ETL for Apache Spark , they will then automatically have the necessary permissions to read and write data.

Analytics

Analytics Data Lake Metadata Data Warehouse

Migrate workloads from AWS Data Pipeline

AWS Big Data

JULY 25, 2024

Data Pipeline has been a foundational service for getting customer off the ground for their extract, transform, load (ETL) and infra provisioning use cases. Before starting any production workloads after migration, you need to test your new workflows to ensure no disruption to production systems. Choose ETL jobs.

Visualization

Visualization Management Data Integration Testing

Query big data with resilience using Trino in Amazon EMR with Amazon EC2 Spot Instances for less cost

AWS Big Data

OCTOBER 4, 2023

Due to this limitation, the cost of failures of long-running extract, transform, and load (ETL) and batch queries on Trino was high in terms of completion time, compute wastage, and spend. These new enhancements in Trino with Amazon EMR provide improved resiliency for running ETL and batch workloads on Spot Instances with reduced costs.

Big Data

Big Data Optimization Data-driven Management

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

We illustrate a cross-account sharing use case, where a Lake Formation steward in producer account A shares a federated Hive database and tables using LF-Tags to consumer account B. The admin continues to set up Lake Formation tag-based access control (LF-TBAC) on the federated Hive database and share it to account B.

Data Lake

Data Lake Metadata Data Processing Big Data

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

They have dev, test, and production clusters running critical workloads and want to upgrade their clusters to CDP Private Cloud Base. Hive-on-Tez for better ETL performance. Customer Environment: The customer has three environments: development, test, and production. Test and QA. Test and QA. Background: .

Testing

Testing Metadata Risk Data Science

Implement data warehousing solution using dbt on Amazon Redshift

AWS Big Data

NOVEMBER 17, 2023

It also applies general software engineering principles like integrating with git repositories, setting up DRYer code, adding functional test cases, and including external libraries. Tests – These are assertions you make about your models and other resources in your dbt project (such as sources, seeds, and snapshots). project-dir.

Snapshot

Snapshot Data Processing Testing Data Warehouse

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Test access to the producer cataloged Amazon S3 data using EMR Serverless in the consumer account. Test access using Athena queries in the consumer account. Test access using SageMaker Studio in the consumer account. It is recommended to use test accounts. The producer account will host the EMR cluster and S3 buckets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

One Big Cluster Stuck: The Right Tool for the Right Job

Cloudera

JUNE 26, 2023

Spark is primarily used to create ETL workloads by data engineers and data scientists. Impala only masquerades as an ETL pipeline tool: use NiFi or Airflow instead It is common for Cloudera Data Platform (CDP) users to ‘test’ pipeline development and creation with Impala because it facilitates fast, iterate development and testing.

Data Processing

Data Processing Testing Visualization Data Science

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

AWS Big Data

JULY 26, 2023

The following are common asks from our customers: Is it possible to develop and test AWS Glue data integration jobs on my local laptop? The software development lifecycle on AWS defines the following six phases: Plan, Design, Implement, Test, Deploy, and Maintain. Test In the testing phase, you check the implementation for bugs.

Data Integration

Data Integration Snapshot Testing Visualization

There’s More to erwin Data Governance Automation Than Meets the AI

erwin

NOVEMBER 6, 2020

Import existing Excel or CSV files, use the drag-and-drop feature to extract the mappings from your ETL scripts, or manually populate the inventory to then be visualized with the lineage analyzer. Data Cataloging: Catalog and sync metadata with data management and governance artifacts according to business requirements in real time.

Data Governance

Data Governance Metadata Data-driven Visualization

7 Benefits of Metadata Management

erwin

FEBRUARY 19, 2021

It’s easier to map, move and test data for regular maintenance of existing structures, movement from legacy systems to new systems during a merger or acquisition or a modernization effort. Quicker project delivery. Greater productivity & reduced costs. Digital transformation.

Metadata

Metadata Management Data Quality Cost-Benefit

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

A data domain producer maintains its own ETL stack using AWS Glue , AWS Lambda to process, AWS Glue Databrew to profile the data and prepare the data asset (data product) before cataloguing it into AWS Glue Data Catalog in their account. Producers ingest data into their S3 buckets through pipelines they manage, own, and operate.

Data Lake

Data Lake Publishing Metadata Data-driven

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

Additionally, it manages table definitions in the AWS Glue Data Catalog , containing references to data sources and targets of extract, transform, and load (ETL) jobs in AWS Glue. It is possible to define stages (DEV, INT, PROD) in each layer to allow structured release and test without affecting PROD.

Analytics

Analytics Dashboards Metadata Data Warehouse

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

Synthea is a synthetic patient generator that creates realistic patient data and associated medical records that can be used for testing healthcare software applications. To learn more about Pydeequ as a data testing framework, see Testing Data quality at scale with Pydeequ. onData(df).useRepository(metricsRepository).addCheck(

Data Quality

Data Quality Visualization Metadata Metrics

Manage Amazon Redshift provisioned clusters with Terraform

AWS Big Data

JULY 25, 2024

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze all your data using standard SQL and your existing extract, transform, and load (ETL); business intelligence (BI); and reporting tools. Add any necessary tags to the instance.

Management

Management Data Warehouse Big Data Testing

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

AWS Big Data

FEBRUARY 9, 2023

Developers need to understand the application APIs, write implementation and test code, and maintain the code for future API changes. Optionally, provide a description for the flow and tags. Test the solution Log in to your Salesforce account, and edit any record in the Account object. Choose Create flow. Choose Next.

Data Warehouse

Data Warehouse Data-driven Snapshot Testing

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

erwin

JANUARY 11, 2019

It’s obvious that the manual road is very challenging to discover and synthesize data that resides in different formats in thousands of unharvested, undocumented databases, applications, ETL processes and procedural code.

Data Governance

Data Governance Risk Metadata Management

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

In order to mature our data marts, it became clear that we needed to provide Analysts and other data consumers with all tracked digital analytics data in our DWH as they depend on it for analyses, reporting, campaign evaluation, product development and A/B testing. We made sure to add appropriate tags for cost monitoring.

Analytics

Analytics Data Lake Testing Optimization

How Alation’s Data Team Uses the Modern Data Stack to Power Insights

Alation

OCTOBER 27, 2022

This data transformation tool enables data analysts and engineers to transform, test and document data in the cloud data warehouse. First, we pulled unique contact titles into a Google Sheet from Hubspot and other sources, and tagged them with internal attributes; this helps downstream analysts get work done.

Dashboards

Dashboards Metrics Sales Reporting

What Is a Data Fabric and How Does a Data Catalog Support It?

Alation

JANUARY 25, 2022

But social extends even further; it includes the comments users make, assets they flag as favorites, the watches they set on assets, who they tag and how, and their level of expertise. People often use the same data sources, so capturing knowledge about that usage, (including their role, team, and expertise), layers on important context.

Metadata

Metadata IT Data-driven Metrics

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

AWS Big Data

APRIL 21, 2025

The data engineering team owns the extract, transform, and load (ETL) application that will process the raw data to create and maintain the Iceberg tables. The ETL application will use IAM role-based access to the Iceberg table, and the data analyst gets Lake Formation permissions to query the same tables. Choose Grant.

Data Lake

Data Lake Metadata Interactive Big Data

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

AWS Big Data

MAY 9, 2025

Built-in zero-ETL connectors reduce data silos by integrating various data sources, enabling unified analytics across teams. Under LF-Tags or catalog resources , select Named Data Catalog resources. The Lake Formation admin verifies that the shared resources are accessible by running test queries in Athena. Choose Grant.

Data Lake

Data Lake Data Warehouse Marketing Management

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

AWS Big Data

MARCH 21, 2025

Traditionally, answering this question would involve multiple data exports, complex extract, transform, and load (ETL) processes, and careful data synchronization across systems. Users can write data to managed RMS tables using Iceberg APIs, Amazon Redshift, or Zero-ETL ingestion from supported data sources.

Data Warehouse

Data Warehouse Metadata Publishing Sales

Data Leaders Brief

Measure performance of AWS Glue Data Quality for ETL pipelines

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Webinars

Trending Sources

Introducing AWS Glue usage profiles for flexible cost control

Webinars

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Automate alerting and reporting for AWS Glue job resource usage

Top analytics announcements of AWS re:Invent 2024

Migrate workloads from AWS Data Pipeline

Query big data with resilience using Trino in Amazon EMR with Amazon EC2 Spot Instances for less cost

Query your Apache Hive metastore with AWS Lake Formation permissions

Upgrade Journey: The Path from CDH to CDP Private Cloud

Implement data warehousing solution using dbt on Amazon Redshift

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

One Big Cluster Stuck: The Right Tool for the Right Job

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

There’s More to erwin Data Governance Automation Than Meets the AI

7 Benefits of Metadata Management

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Manage Amazon Redshift provisioned clusters with Terraform

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

How SumUp made digital analytics more accessible using AWS Glue

How Alation’s Data Team Uses the Modern Data Stack to Power Insights

What Is a Data Fabric and How Does a Data Catalog Support It?

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

Stay Connected