Data Transformation, Definition and Metadata

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

These data processing and analytical services support Structured Query Language (SQL) to interact with the data. Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values.

Metadata

Metadata Data Lake Modeling Data Warehouse

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

Yet, despite growing investments in advanced analytics and AI, organizations continue to grapple with a persistent and often underestimated challenge: poor data quality. Fragmented systems, inconsistent definitions, legacy infrastructure and manual workarounds introduce critical risks.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Institutional Data & AI Platform architecture The Institutional Division has implemented a self-service data platform to enable the domain teams to build and manage data products autonomously. The following diagram illustrates the building blocks of the Institutional Data & AI Platform.

Metadata

Metadata Data Governance Data Quality Data-driven

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Reporting being part of an effective DQM, we will also go through some data quality metrics examples you can use to assess your efforts in the matter. But first, let’s define what data quality actually is. What is the definition of data quality? 2 – Data profiling. This is definitely not in line with reality.

Data Quality

Data Quality Metrics Data-driven Management

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Leveraging AWS’s managed service was crucial for us to access business insights faster, apply standardized data definitions, and tap into generative AI potential. Publish data assets – As the data producer from the retail team, you must ingest individual data assets into Amazon DataZone.

Visualization

Visualization Data Lake Testing Data Governance

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Wayne Yaddow

MARCH 5, 2025

The goal is to examine five major methods of verifying and validating data transformations in data pipelines with an eye toward high-quality data deployment. First, we look at how unit and integration tests uncover transformation errors at an early stage. Applicability by Transformation Type 2.

Testing

Testing Data Transformation Statistics Metadata

Alation and dbt Unlock Metadata and Increase Modern Data Stack Visibility

Alation

OCTOBER 18, 2022

Data analysts and engineers use dbt to transform, test, and document data in the cloud data warehouse. Yet every dbt transformation contains vital metadata that is not captured – until now. Data Transformation in the Modern Data Stack. Improve data analysis accuracy.

Metadata

Metadata Metrics Recreation/Entertainment Data Quality

Ensuring Data Transformation Quality with dbt Core

Wayne Yaddow

MARCH 14, 2025

How dbt Core aids data teams test, validate, and monitor complex data transformations and conversions Photo by NASA on Unsplash Introduction dbt Core, an open-source framework for developing, testing, and documenting SQL-based data transformations, has become a must-have tool for modern data teams as the complexity of data pipelines grows.

Data Transformation

Data Transformation Testing Unstructured Data Data Quality

The What & Why of Data Governance

erwin

MARCH 4, 2021

Led by Frank Pörschmann of iDIGMA GmbH, an IT industry veteran and data governance strategist, it examined “ The What & Why of Data Governance.”. The What: Data Governance Defined. Data governance has no standard definition. But he cautions “this [notion] is something which is not really likely to happen.”.

Data Governance

Data Governance Digital Transformation Data-driven Cost-Benefit

How Your Finance Team Can Lead Your Enterprise Data Transformation

Alation

OCTOBER 26, 2021

Building a Data Culture Within a Finance Department. Our finance users tell us that their first exposure to the Alation Data Catalog often comes soon after the launch of organization-wide data transformation efforts. After all, finance is one of the greatest consumers of data within a business.

Finance

Finance Data Transformation Enterprise Metrics

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

Solution overview The following diagram illustrates the solution architecture: The solution uses AWS Glue as an ETL engine to extract data from the source Amazon RDS database. Built-in data transformations then scrub columns containing PII using pre-defined masking functions. This saves time over manually defining schemas.

Visualization

Visualization Metadata Data Transformation Testing

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

It seamlessly consolidates data from various data sources within AWS, including AWS Cost Explorer (and forecasting with Cost Explorer ), AWS Trusted Advisor , and AWS Compute Optimizer. Data providers and consumers are the two fundamental users of a CDH dataset. You might notice that this differs slightly from traditional ETL.

Analytics

Analytics Dashboards Metadata Data Warehouse

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

AWS Big Data

DECEMBER 5, 2023

A combination of Amazon Redshift Spectrum and COPY commands are used to ingest the survey data stored as CSV files. For the files with unknown structures, AWS Glue crawlers are used to extract metadata and create table definitions in the Data Catalog.

Measurement

Measurement Dashboards Data Warehouse Analytics

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

The data flow life cycle with Cloudera DataFlow for the Public Cloud (CDF-PC) Data flows in CDF-PC follow a bespoke life cycle that starts with either creating a new draft from scratch or by opening an existing flow definition from the Catalog.

Testing

Testing Publishing Metadata Interactive

Improve observability across Amazon MWAA tasks

AWS Big Data

FEBRUARY 6, 2023

We then discuss a solution to further enhance end-to-end observability by modifying the task definitions that make up the data pipeline. For this post, we focus on task definitions for two services: AWS Glue and Amazon EMR, however the same method can be applied across different services.

Management

Management Interactive Publishing Metadata

Why Enterprise Data Lineage is Critical for the Success of Your Modern Data Stack

Octopai

NOVEMBER 13, 2022

The modern data stack is a data management system built out of cloud-based data systems. A given modern data stack will usually include components for data ingestion from your data sources, data transformation, data storage, data analysis and reporting.

Enterprise

Enterprise Data Warehouse Reporting Metadata

Empowering data mesh: The tools to deliver BI excellence

erwin

APRIL 16, 2024

In this blog, we’ll delve into the critical role of governance and data modeling tools in supporting a seamless data mesh implementation and explore how erwin tools can be used in that role. erwin also provides data governance, metadata management and data lineage software called erwin Data Intelligence by Quest.

Metadata

Metadata Data Quality Data Governance Data Processing

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

Alation

AUGUST 11, 2022

We chatted about industry trends, why decentralization has become a hot topic in the data world, and how metadata drives many data-centric use cases. But, through it all, Mohan says it’s critical to view everything through the same lens: gaining business value from data. Data fabric is a technology architecture.

Metadata

Metadata Data Warehouse Data Quality Data Lake

How healthcare organizations can analyze and create insights using price transparency data

AWS Big Data

OCTOBER 11, 2023

Due to this low complexity, the solution uses AWS serverless services to ingest the data, transform it, and make it available for analytics. The Data Catalog now contains references to the machine-readable data. Use the Data Catalog and transform the hospital price transparency data.

Visualization

Visualization Dashboards Data-driven Gap analysis

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches. Athena is used to run geospatial queries on the location data stored in the S3 buckets. Choose Run.

Analytics

Analytics IoT Metadata Internet of Things

How to modernize data lakes with a data lakehouse architecture

IBM Big Data Hub

JULY 5, 2023

This was, without a question, a significant departure from traditional analytic environments, which often meant vendor-lock in and the inability to work with data at scale. Another unexpected challenge was the introduction of Spark as a processing framework for big data. Comprehensive data security and data governance (i.e.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

erwin

JANUARY 11, 2019

Organizations have spent a lot of time and money trying to harmonize data across diverse platforms , including cleansing, uploading metadata, converting code, defining business glossaries, tracking data transformations and so on. And there’s control of that landscape to facilitate insight and collaboration and limit risk.

Data Governance

Data Governance Risk Metadata Management

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To ingest the data, smava uses a set of popular third-party customer data platforms complemented by custom scripts. After the data lands in Amazon S3, smava uses the AWS Glue Data Catalog and crawlers to automatically catalog the available data, capture the metadata, and provide an interface that allows querying all data assets.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Choosing A Graph Data Model to Best Serve Your Use Case

Ontotext

MARCH 27, 2024

For example, GPS, social media, cell phone handoffs are modeled as graphs while data catalogs, data lineage and MDM tools leverage knowledge graphs for linking metadata with semantics. RDF is used extensively for data publishing and data interchange and is based on W3C and other industry standards.

Modeling

Modeling Metadata Data Quality Enterprise

Data platform trinity: Competitive or complementary?

IBM Big Data Hub

JANUARY 18, 2023

This adds an additional ETL step, making the data even more stale. Data lakehouse was created to solve these problems. The data warehouse storage layer is removed from lakehouse architectures. Instead, continuous data transformation is performed within the BLOB storage. Data fabric promotes data discoverability.

Data Lake

Data Lake Data Warehouse Data-driven Metadata

How Data Lineage Improves Data Compliance

Octopai

DECEMBER 11, 2022

It’s for that reason that even as the first BCBS-239 implementation deadline came into effect a few years ago, McKinsey reported that one-third of Global Systemically Important Banks had focused on “documenting data lineage up to the level of provisioning data elements and including data transformation.”.

Insurance

Insurance Risk Metadata Visualization

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Data ingestion – Steps 1 and 2 use AWS DMS, which connects to the source database and moves full and incremental data (CDC) to Amazon S3 in Parquet format. Data transformation – Steps 3 and 4 represent an EMR Serverless Spark application (Amazon EMR 6.9 Let’s refer to this S3 bucket as the raw layer.

Data Lake

Data Lake Dashboards Metrics Metadata

The Modern Data Stack Explained: What The Future Holds

Alation

JANUARY 17, 2023

You should look for a data warehouse that is scalable, flexible, and efficient. Popular cloud data warehouses today include Snowflake, Databricks, and BigQuery. If your organization is large, you definitely need to look for robustness. Good data warehouses should be reliable. How Can I Build a Modern Data Stack?

Data Warehouse

Data Warehouse Cost-Benefit Data Science Data Transformation

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

Alternatively, you can use AWS Glue for Apache Spark, which provides built-in support for bucketing configurations during the data transformation process. AWS Glue allows you to define bucketing parameters, such as the number of buckets and the columns to bucket on, providing an optimized data layout for efficient querying with Athena.

Optimization

Optimization Data Lake Cost-Benefit Reporting

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

The quick and dirty definition of data mapping is the process of connecting different types of data from various data sources. Data mapping is a crucial step in data modeling and can help organizations achieve their business goals by enabling data integration, migration, transformation, and quality.

Data Warehouse

Data Warehouse Reporting Data Transformation Visualization

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

Introduction Why should I read the definitive guide to embedded analytics? The Definitive Guide to Embedded Analytics is designed to answer any and all questions you have about the topic. It is now most definitely a need-to-have. Data Transformation and Enrichment Data can be enriched for analysis.

Analytics

Analytics Cost-Benefit Visualization Dashboards

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless

AWS Big Data

MAY 12, 2025

Streaming pipelines used Spark Streaming to ingest real-time data from Kafka, writing raw datasets to an Amazon Simple Storage Service (Amazon S3) data lake while simultaneously loading them into BigQuery and Google Cloud Storage to build logical data layers.

Metrics

Metrics Cost-Benefit Metadata Data Lake

Unleashing GenAI — Ensuring Data Quality at Scale (Part 2)

Wayne Yaddow

MARCH 28, 2025

Second: During data integration, data harmonization and standardization are essential. Different schemas, naming standards, and data definitions are frequently used by disparate repository source systems, which can lead to datasets that are incompatible or conflicting.

Data Quality

Data Quality Data Integration Data Governance Metadata

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

AWS Big Data

NOVEMBER 14, 2024

We use the built-in features of Data Firehose, including AWS Lambda for necessary data transformation and Amazon Simple Notification Service (Amazon SNS) for near real-time alerts. AWS Glue – The AWS Glue Data Catalog is your persistent technical metadata store in the AWS Cloud. Meters) GPS value Speed s 1.0 (km/h)

Data Lake

Data Lake Metadata Testing Data-driven

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Data’s dark secret: Why poor quality cripples AI and growth

Webinars

Trending Sources

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Webinars

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Alation and dbt Unlock Metadata and Increase Modern Data Stack Visibility

Ensuring Data Transformation Quality with dbt Core

The What & Why of Data Governance

How Your Finance Team Can Lead Your Enterprise Data Transformation

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Improve observability across Amazon MWAA tasks

Why Enterprise Data Lineage is Critical for the Success of Your Modern Data Stack

Empowering data mesh: The tools to deliver BI excellence

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

How healthcare organizations can analyze and create insights using price transparency data

Gain insights from historical location data using Amazon Location Service and AWS analytics services

How to modernize data lakes with a data lakehouse architecture

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Choosing A Graph Data Model to Best Serve Your Use Case

Data platform trinity: Competitive or complementary?

How Data Lineage Improves Data Compliance

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

The Modern Data Stack Explained: What The Future Holds

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

What is Data Mapping?

What Is Embedded Analytics?

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless

Unleashing GenAI — Ensuring Data Quality at Scale (Part 2)

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

Stay Connected