Data Transformation, Metadata and Reference

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

These data processing and analytical services support Structured Query Language (SQL) to interact with the data. Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values.

Metadata

Metadata Data Lake Modeling Data Warehouse

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Reporting being part of an effective DQM, we will also go through some data quality metrics examples you can use to assess your efforts in the matter. But first, let’s define what data quality actually is. What is the definition of data quality? Why Do You Need Data Quality Management? 2 – Data profiling.

Data Quality

Data Quality Metrics Data-driven Management

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases.

Management

Management Metadata Analytics Dashboards

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Wayne Yaddow

MARCH 5, 2025

The goal is to examine five major methods of verifying and validating data transformations in data pipelines with an eye toward high-quality data deployment. First, we look at how unit and integration tests uncover transformation errors at an early stage. Applicability by Transformation Type 2.

Testing

Testing Data Transformation Statistics Metadata

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of. This new capability can simplify your data journey. To learn more, refer to Amazon SageMaker Unified Studio.

Visualization

Visualization Data Processing Testing Publishing

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard. It seamlessly consolidates data from various data sources within AWS, including AWS Cost Explorer (and forecasting with Cost Explorer ), AWS Trusted Advisor , and AWS Compute Optimizer.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

AWS Big Data

APRIL 2, 2024

You can see the decompressed data has metadata information such as logGroup , logStream , and subscriptionFilters , and the actual data is included within the message field under logEvents (the following example shows an example of CloudTrail events in the CloudWatch Logs).

Metadata

Metadata Marketing Analytics Data Transformation

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

Solution overview The following diagram illustrates the solution architecture: The solution uses AWS Glue as an ETL engine to extract data from the source Amazon RDS database. Built-in data transformations then scrub columns containing PII using pre-defined masking functions. This saves time over manually defining schemas.

Visualization

Visualization Metadata Data Transformation Testing

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Data Vault 2.0 allows for the following: Agile data warehouse development Parallel data ingestion A scalable approach to handle multiple data sources even on the same entity A high level of automation Historization Full lineage support However, Data Vault 2.0 JOB_NAME All The process name from the ETL framework.

Enterprise

Enterprise Data Warehouse Data Lake Optimization

How healthcare organizations can analyze and create insights using price transparency data

AWS Big Data

OCTOBER 11, 2023

Under the Transparency in Coverage (TCR) rule , hospitals and payors to publish their pricing data in a machine-readable format. For more information, refer to Delivering Consumer-friendly Healthcare Transparency in Coverage On AWS. The Data Catalog now contains references to the machine-readable data.

Visualization

Visualization Dashboards Data-driven Gap analysis

Automate discovery of data relationships using ML and Amazon Neptune graph technology

AWS Big Data

APRIL 19, 2023

Encounter 4 appears to refer to the customer with ID 8, but the email doesn’t match, and no Customer_ID is given. We took this a step further by creating a blueprint to create smart recommendations by linking similar data products using graph technology and ML.

Technology

Technology Data-driven Machine Learning Sales

Improve observability across Amazon MWAA tasks

AWS Big Data

FEBRUARY 6, 2023

In the next sections, we explore the following topics: The DAG file, in order to understand how to define and then pass the correlation ID in the AWS Glue and EMR tasks The code needed in the Python scripts to output information based on the correlation ID Refer to the GitHub repo for the detailed DAG definition and Spark scripts.

Management

Management Interactive Publishing Metadata

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches. Athena is used to run geospatial queries on the location data stored in the S3 buckets. You can test this solution yourself using the AWS Samples GitHub repository.

Analytics

Analytics IoT Metadata Internet of Things

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

AI governance refers to the practice of directing, managing and monitoring an organization’s AI activities. It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. Capture and document model metadata for report generation.

Risk

Risk Modeling Management Metadata

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

DECEMBER 9, 2022

Developers need to onboard new data sources, chain multiple data transformation steps together, and explore data as it travels through the flow. Figure 5: Parameter references in the configuration panel and auto-complete. Enabling self-service for developers. Interactivity when needed while saving costs.

Testing

Testing Cost-Benefit Interactive Visualization

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Data ingestion – Steps 1 and 2 use AWS DMS, which connects to the source database and moves full and incremental data (CDC) to Amazon S3 in Parquet format. Let’s refer to this S3 bucket as the raw layer. Data transformation – Steps 3 and 4 represent an EMR Serverless Spark application (Amazon EMR 6.9

Data Lake

Data Lake Dashboards Metrics Metadata

Cross-account integration between SaaS platforms using Amazon AppFlow

AWS Big Data

APRIL 25, 2023

The following AWS services are used for data ingestion, processing, and load: Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications like Salesforce, SAP, Marketo, Slack, and ServiceNow, and AWS services like Amazon S3 and Amazon Redshift , in just a few clicks.

Sales

Sales Visualization Software Metadata

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

Another popular transaction data lake use case is incremental query. Incremental query refers to a query strategy that focuses on processing and analyzing only the new or updated data within a data lake since the last query. Melody Yang is a Senior Big Data Solution Architect for Amazon EMR at AWS.

Data Lake

Data Lake Snapshot Big Data Data-driven

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

Alternatively, you can use AWS Glue for Apache Spark, which provides built-in support for bucketing configurations during the data transformation process. AWS Glue allows you to define bucketing parameters, such as the number of buckets and the columns to bucket on, providing an optimized data layout for efficient querying with Athena.

Optimization

Optimization Data Lake Cost-Benefit Reporting

The Modern Data Stack Explained: What The Future Holds

Alation

JANUARY 17, 2023

A modern data stack relies on cloud computing, whereas a legacy data stack stores data on servers instead of in the cloud. Modern data stacks provide access for more data professionals than a legacy data stack. Examples of data transformation tools include dbt and dataform.

Data Warehouse

Data Warehouse Cost-Benefit Data Science Data Transformation

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Refer to Catalogs for more information.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

For GlueDatabaseName , enter a unique name for the Data Catalog database to hold the Jira data table metadata (the default is jiralake ). This mode will scan all data and disable the change data capture (CDC) features of the stack. For full instructions, refer to Jira Cloud connector for Amazon AppFlow.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

Empowering data mesh: The tools to deliver BI excellence

erwin

APRIL 16, 2024

In this blog, we’ll delve into the critical role of governance and data modeling tools in supporting a seamless data mesh implementation and explore how erwin tools can be used in that role. erwin also provides data governance, metadata management and data lineage software called erwin Data Intelligence by Quest.

Metadata

Metadata Data Quality Data Governance Modeling

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

AWS Big Data

MARCH 15, 2023

To populate the database, the Infomedia team developed a data pipeline using Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue for data transformations, and Apache Hudi for CDC and record-level updates.

Cost-Benefit

Cost-Benefit Data Processing Optimization Data-driven

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In addition, more data is becoming available for processing / enrichment of existing and new use cases e.g., recently we have experienced a rapid growth in data collection at the edge and an increase in availability of frameworks for processing that data. As a result, alternative data integration technologies (e.g.,

Data Processing

Data Processing Data Warehouse Enterprise Visualization

Why The Public Sector Needs Data Governance

Alation

NOVEMBER 22, 2022

Before you implement a data governance framework, you need to know the data you already have. This means you need to: Inventory data: Know all information resources and relevant metadata. Classify data: Organize structured and unstructured data into relevant categories. Reuse metadata productively.

Data Governance

Data Governance Metadata Data-driven Unstructured Data

Tableau further democratizes analytics with AI-fueled features

CIO Business Intelligence

APRIL 30, 2024

Einstein Copilot for Tableau remains in beta, but Tableau announced two new features for the AI assistant as well: AI-assisted data transformation. This feature can automate a data transformation pipeline with step-by-step suggestions for preparing data for analysis.

Analytics

Analytics Metrics Visualization Dashboards

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

AWS Big Data

NOVEMBER 14, 2024

We use the built-in features of Data Firehose, including AWS Lambda for necessary data transformation and Amazon Simple Notification Service (Amazon SNS) for near real-time alerts. AWS Glue – The AWS Glue Data Catalog is your persistent technical metadata store in the AWS Cloud. Meters) GPS value Speed s 1.0 (km/h)

Data Lake

Data Lake Metadata Testing Data-driven

Introducing the HubSpot connector for AWS Glue

AWS Big Data

DECEMBER 2, 2024

AWS Glue establishes a secure connection to HubSpot using OAuth for authorization and TLS for data encryption in transit. AWS Glue also supports the ability to apply complex data transformations, enabling efficient data integration and preparation to meet your needs. For more information on AWS Glue, visit AWS Glue.

Data Lake

Data Lake Testing Data Integration Metadata

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

These include managing complex extract, transform, and load (ETL) processes, handling schema validation, providing reliable delivery, and maintaining custom code for data transformations. Firehose delivers streaming data with configurable buffering options that can be optimized for near-zero latency.

Snapshot

Snapshot Optimization Data Lake Metadata

Hybrid big data analytics with Amazon EMR on AWS Outposts

AWS Big Data

JANUARY 29, 2025

We also use the AWS Glue Data Catalog as the external Hive compatible metastore, which serves as the central technical metadata catalog. The Data Catalog is a centralized metadata repository for all your data assets across various data sources. We also submit Spark jobs as a step on the EMR cluster.

Big Data

Big Data Data Analytics Analytics Interactive

“You Complete Me,” said Data Lineage to DataOps Observability.

DataKitchen

JANUARY 23, 2023

It is important to have additional tools and processes in place to understand the impact of data errors and to minimize their effect on the data pipeline and downstream systems. These operations can include data movement, validation, cleaning, transformation, aggregation, analysis, and more.

Testing

Testing Data Governance Data Quality Data-driven

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

that gathers data from many sources. Requirement Multi-Source Data Blending Data from multiple sources is compiled and the output is a single view, metric, or visualization. Data Transformation and Enrichment Data can be enriched for analysis. Ask your vendors for references. It’s all about context.

Analytics

Analytics Cost-Benefit Visualization Dashboards

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

This field guide to data mapping will explore how data mapping connects volumes of data for enhanced decision-making. Why Data Mapping is Important Data mapping is a critical element of any data management initiative, such as data integration, data migration, data transformation, data warehousing, or automation.

Data Warehouse

Data Warehouse Reporting Data Transformation Visualization

Data Leaders Brief

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Webinars

Trending Sources

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Webinars

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

How healthcare organizations can analyze and create insights using price transparency data

Automate discovery of data relationships using ML and Amazon Neptune graph technology

Improve observability across Amazon MWAA tasks

Gain insights from historical location data using Amazon Location Service and AWS analytics services

How to use foundation models and trusted governance to manage AI workflow risk

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Cross-account integration between SaaS platforms using Amazon AppFlow

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

The Modern Data Stack Explained: What The Future Holds

Build a data lake with Apache Flink on Amazon EMR

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Empowering data mesh: The tools to deliver BI excellence

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

Addressing the Three Scalability Challenges in Modern Data Platforms

Why The Public Sector Needs Data Governance

Tableau further democratizes analytics with AI-fueled features

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

Introducing the HubSpot connector for AWS Glue

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Hybrid big data analytics with Amazon EMR on AWS Outposts

“You Complete Me,” said Data Lineage to DataOps Observability.

What Is Embedded Analytics?

What is Data Mapping?

Stay Connected