Data Transformation and Reference

Reference guide to build inventory management and forecasting solutions on AWS

AWS Big Data

APRIL 11, 2023

ElastiCache manages the real-time application data caching, allowing your customers to experience microsecond response times while supporting high-throughput handling of hundreds of millions of operations per second. In the inventory management and forecasting solution, AWS Glue is recommended for data transformation.

Forecasting

Forecasting Management IoT Data-driven

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

AWS Big Data

OCTOBER 30, 2024

You can now use your tool of choice, including Tableau, to quickly derive business insights from your data while using standardized definitions and decentralized ownership. Refer to the detailed blog post on how you can use this to connect through various other tools.

Analytics

Analytics Visualization Data Governance Data-driven

Accelerate your data workflows with Amazon Redshift Data API persistent sessions

AWS Big Data

NOVEMBER 22, 2024

You can create temporary tables once and reference them throughout, without having to constantly refresh database connections and restart from scratch. Please refer to Redshift Quotas and Limits here. After 24 hours the session is forcibly closed, and in-progress queries are terminated.

Data Warehouse

Data Warehouse Recreation/Entertainment Cost-Benefit Data-driven

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

AWS Big Data

JANUARY 6, 2025

With Amazon AppFlow, you can run data flows at nearly any scale and at the frequency you chooseon a schedule, in response to a business event, or on demand. You can configure data transformation capabilities such as filtering and validation to generate rich, ready-to-use data as part of the flow itself, without additional steps.

Analytics

Analytics Data Warehouse Big Data Metrics

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

AWS Big Data

MAY 2, 2025

This middleware consists of custom code that runs data flows to stitch data transformations, search queries, and AI enrichments in varying combinations tailored to use cases, datasets, and requirements. Ingest flows are created to enrich data as its added to an index. Flows are a pipeline of processor resources.

Machine Learning

Machine Learning Visualization Dashboards Metadata

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Reporting being part of an effective DQM, we will also go through some data quality metrics examples you can use to assess your efforts in the matter. But first, let’s define what data quality actually is. What is the definition of data quality? Why Do You Need Data Quality Management? date, month, and year).

Data Quality

Data Quality Metrics Data-driven Management

Functional Gaps in Your Data Transformation Testing Tools?

Wayne Yaddow

FEBRUARY 11, 2025

Managing tests of complex data transformations when automated data testing tools lack important features? Photo by Marvin Meyer on Unsplash Introduction Data transformations are at the core of modern business intelligence, blending and converting disparate datasets into coherent, reliable outputs.

Testing

Testing Data Transformation Data Quality Statistics

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

Your generated jobs can use a variety of data transformations, including filters, projections, unions, joins, and aggregations, giving you the flexibility to handle complex data processing requirements. In this post, we discuss how Amazon Q data integration transforms ETL workflow development.

Data Integration

Data Integration Visualization Data Processing Big Data

Ensuring Data Transformation Results with Great Expectations

Wayne Yaddow

MARCH 12, 2025

Data quality rules are codified into structured Expectation Suites by Great Expectations instead of relying on ad-hoc scripts or manual checks. The framework ensures that your data transformations comply with rigorous specifications from the moment they are created through every iteration of your pipeline.

Data Transformation

Data Transformation Data Quality Testing Data Warehouse

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Wayne Yaddow

MARCH 5, 2025

The goal is to examine five major methods of verifying and validating data transformations in data pipelines with an eye toward high-quality data deployment. First, we look at how unit and integration tests uncover transformation errors at an early stage.

Testing

Testing Data Transformation Statistics Metadata

Key Challenges Affecting Data Transformations—Dev and Testing

Wayne Yaddow

FEBRUARY 6, 2025

Common challenges and practical mitigation strategies for reliable data transformations. Photo by Mika Baumeister on Unsplash Introduction Data transformations are important processes in data engineering, enabling organizations to structure, enrich, and integrate data for analytics , reporting, and operational decision-making.

Testing

Testing Data Transformation Data-driven Manufacturing

Data Engineers Are Using AI to Verify Data Transformations

Wayne Yaddow

FEBRUARY 26, 2025

AI is transforming how senior data engineers and data scientists validate data transformations and conversions. Artificial intelligence-based verification approaches aid in the detection of anomalies, the enforcement of data integrity, and the optimization of pipelines for improved efficiency.

Data Transformation

Data Transformation Testing Data-driven Data Quality

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Adding data transformation details to metadata can be challenging because of the dispersed nature of this information across data processing pipelines, making it difficult to extract and incorporate into table-level metadata. Maintaining lists of possible values for the columns requires continuous updates.

Metadata

Metadata Data Lake Modeling Data Warehouse

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

but to reference concrete tooling used today in order to ground what could otherwise be a somewhat abstract exercise. Adapted from the book Effective Data Science Infrastructure. Let’s now take a tour of the various layers, to begin to map the territory. Along the way, we’ll provide illustrative examples. Model Development.

IT

IT Testing Experimentation Software

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of. This new capability can simplify your data journey. To learn more, refer to Amazon SageMaker Unified Studio.

Visualization

Visualization Data Processing Testing Publishing

How the BMW Group analyses semiconductor demand with AWS Glue

AWS Big Data

APRIL 26, 2023

The Cloud Data Hub processes and combines anonymized data from vehicle sensors and other sources across the enterprise to make it easily accessible for internal teams creating customer-facing and internal applications. To learn more about the Cloud Data Hub, refer to BMW Group Uses AWS-Based Data Lake to Unlock the Power of Data.

Forecasting

Forecasting Manufacturing Data Lake Big Data

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases.

Management

Management Metadata Analytics Dashboards

The Best Data Management Tools For Small Businesses

Smart Data Collective

APRIL 29, 2020

What is data management? Data management can be defined in many ways. Usually the term refers to the practices, techniques and tools that allow access and delivery through different fields and data structures in an organisation. Extraction, Transform, Load (ETL). Data transformation.

Management

Management Data Warehouse Digital Transformation Dashboards

Introducing blueprint discovery and other UI enhancements for Amazon OpenSearch Ingestion

AWS Big Data

MAY 22, 2024

Amazon OpenSearch Ingestion is a fully managed serverless pipeline that allows you to ingest, filter, transform, enrich, and route data to an Amazon OpenSearch Service domain or Amazon OpenSearch Serverless collection. You can control the costs OCUs incur by configuring maximum OCUs that a pipeline is allowed to scale.

Data Architecture

Data Architecture Visualization Data Transformation Management

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Together with price-performance, Amazon Redshift offers capabilities such as serverless architecture, machine learning integration within your data warehouse and secure data sharing across the organization. dbt Cloud is a hosted service that helps data teams productionize dbt deployments.

Data Warehouse

Data Warehouse Analytics Testing Sales

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard. Additionally, it manages table definitions in the AWS Glue Data Catalog , containing references to data sources and targets of extract, transform, and load (ETL) jobs in AWS Glue.

Analytics

Analytics Dashboards Metadata Data Warehouse

Transition from Amazon CloudSearch to Amazon OpenSearch Service

AWS Big Data

JULY 25, 2024

OpenSearch Ingestion can ingest data from a wide variety of sources, such as Amazon Simple Storage Service (Amazon S3) buckets and HTTP endpoints, and has a rich ecosystem of built-in processors to take care of your most complex data transformation needs.

Cost-Benefit

Cost-Benefit Machine Learning Dashboards Management

Automating the Automators: Shift Change in the Robot Factory

O'Reilly on Data

JANUARY 17, 2023

” I, thankfully, learned this early in my career, at a time when I could still refer to myself as a software developer. Instead of invoking the open-source scikit-learn or Keras calls to build models, your team now goes from Pandas data transforms straight to … the API calls for AWS AutoPilot or GCP Vertex AI.

Machine Learning

Machine Learning Predictive Modeling Software Modeling

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

AWS Big Data

APRIL 2, 2024

With these settings, you can now seamlessly ingest decompressed CloudWatch log data into Splunk using Firehose. Pricing The Firehose decompression feature decompress the data and charges per GB of decompressed data. To understand decompression pricing, refer to Amazon Data Firehose pricing.

Metadata

Metadata Marketing Analytics Data Transformation

Stream VPC Flow Logs to Datadog via Amazon Kinesis Data Firehose

AWS Big Data

JUNE 20, 2023

Kinesis Data Firehose is a fully managed service for delivering near-real-time streaming data to various destinations for storage and performing near-real-time analytics. You can perform analytics on VPC flow logs delivered from your VPC using the Kinesis Data Firehose integration with Datadog as a destination.

Dashboards

Dashboards Visualization Metrics Data Transformation

Stream data to Amazon S3 for real-time analytics using the Oracle GoldenGate S3 handler

AWS Big Data

AUGUST 8, 2024

Oracle GoldenGate for Oracle Database and Big Data adapters Oracle GoldenGate is a real-time data integration and replication tool used for disaster recovery, data migrations, high availability. You can use temporary credentials; for more details, refer to Using temporary credentials with AWS resources.

Analytics

Analytics Big Data Software Data Integration

Tableau further democratizes analytics with AI-fueled features

CIO Business Intelligence

APRIL 30, 2024

Einstein Copilot for Tableau remains in beta, but Tableau announced two new features for the AI assistant as well: AI-assisted data transformation. This feature can automate a data transformation pipeline with step-by-step suggestions for preparing data for analysis.

Analytics

Analytics Metrics Visualization Dashboards

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

The goal of DataOps is to help organizations make better use of their data to drive business decisions and improve outcomes. ChatGPT> DataOps is a term that refers to the set of practices and tools that organizations use to improve the quality and speed of data analytics and machine learning.

Machine Learning

Machine Learning Data-driven Optimization Data Analytics

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

AWS Big Data

JANUARY 30, 2025

We set up our AWS CDK to refer to the contents of a specific directory and define a resource (for example, an AWS Step Functions state machine or an AWS Glue job) for each file it found in that directory. We also used it as a repository for storing code that could be retrieved and used by other services.

Data Warehouse

Data Warehouse Data Architecture Machine Learning Data Transformation

Improve observability across Amazon MWAA tasks

AWS Big Data

FEBRUARY 6, 2023

In the next sections, we explore the following topics: The DAG file, in order to understand how to define and then pass the correlation ID in the AWS Glue and EMR tasks The code needed in the Python scripts to output information based on the correlation ID Refer to the GitHub repo for the detailed DAG definition and Spark scripts.

Management

Management Interactive Publishing Metadata

Time for New Partnership Paradigms to Be Future-fit

CIO Business Intelligence

DECEMBER 6, 2023

Airbus was conceiving an ambitious plan to develop an open aviation data platform, Skywise, as a single platform of reference for all major aviation players that would enable them to improve their operational performance and business results and support Airbus’ own digital transformation.

Digital Transformation

Digital Transformation Software Cost-Benefit Manufacturing

Amazon Redshift data ingestion options

AWS Big Data

SEPTEMBER 5, 2024

If storing operational data in a data warehouse is a requirement, synchronization of tables between operational data stores and Amazon Redshift tables is supported. In scenarios where data transformation is required, you can use Redshift stored procedures to modify data in Redshift tables.

IoT

IoT Data Warehouse Cost-Benefit Reporting

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

For full instructions, refer to Jira Cloud connector for Amazon AppFlow. You can do this by updating the CloudFormation stack with a flag that includes the CDC and data transformation steps. This will enable both the CDC steps and the data transformation steps for the Jira data. Choose Update.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

Introducing Amazon Q data integration in AWS Glue

AWS Big Data

APRIL 30, 2024

Amazon Q Developer can now generate complex data integration jobs with multiple sources, destinations, and data transformations. Generated jobs can use a variety of data transformations, including filter, project, union, join, and custom user-supplied SQL.

Data Integration

Data Integration Data Lake Data Warehouse Software

Automate discovery of data relationships using ML and Amazon Neptune graph technology

AWS Big Data

APRIL 19, 2023

Encounter 4 appears to refer to the customer with ID 8, but the email doesn’t match, and no Customer_ID is given. To learn more about ML in Neptune, refer to Amazon Neptune ML for machine learning on graphs. You can also explore Neptune notebooks demonstrating ML and data science for graphs.

Technology

Technology Data-driven Machine Learning Sales

Ten new visual transforms in AWS Glue Studio

AWS Big Data

MAY 9, 2023

AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. It allows you to visually compose data transformation workflows using nodes that represent different data handling steps, which later are converted automatically into code to run.

Visualization

Visualization Marketing Big Data IT

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

Refer to Enabling AWS PrivateLink in the Snowflake documentation to verify the steps, required access level, and service level to set the configurations. For Data sources , search for and select Snowflake. To obtain the Snowflake PrivateLink account URL, refer to parameters obtained in the prerequisites. Choose Next.

Analytics

Analytics Data-driven Data Integration Data Lake

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

For example, the Flink FileSystem connector has FileSystemTableFactory to read/write data in Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3), the Flink HBase connector has HBase2DynamicTableFactory to read/write data in HBase, and the Flink Kafka connector has KafkaDynamicTableFactory to read/write data in Kafka.

Data Lake

Data Lake Metadata Business Analysis Data-driven

How healthcare organizations can analyze and create insights using price transparency data

AWS Big Data

OCTOBER 11, 2023

Under the Transparency in Coverage (TCR) rule , hospitals and payors to publish their pricing data in a machine-readable format. For more information, refer to Delivering Consumer-friendly Healthcare Transparency in Coverage On AWS. The Data Catalog now contains references to the machine-readable data.

Visualization

Visualization Dashboards Data-driven Gap analysis

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Data Vault 2.0 allows for the following: Agile data warehouse development Parallel data ingestion A scalable approach to handle multiple data sources even on the same entity A high level of automation Historization Full lineage support However, Data Vault 2.0 JOB_NAME All The process name from the ETL framework.

Enterprise

Enterprise Data Warehouse Data Lake Optimization

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

AWS Big Data

MARCH 15, 2023

To populate the database, the Infomedia team developed a data pipeline using Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue for data transformations, and Apache Hudi for CDC and record-level updates.

Cost-Benefit

Cost-Benefit Data Processing Optimization Data-driven

7 key Microsoft Azure analytics services (plus one extra)

CIO Business Intelligence

JUNE 29, 2022

But the features in Power BI Premium are now more powerful than the functionality in Azure Analysis Services, so while the service isn’t going away, Microsoft will offer an automated migration tool in the second half of this year for customers who want to move their data models into Power BI instead. Azure Data Factory.

Data Lake

Data Lake Analytics Data Warehouse Machine Learning

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

For more details on how to configure and schedule the log collector, refer to the yarn-log-collector GitHub repo. Transform the YARN job history logs from JSON to CSV After obtaining YARN logs, you run a YARN log organizer, yarn-log-organizer.py, which is a parser to transform JSON-based logs to CSV files.

Dashboards

Dashboards Optimization Data Lake Cost-Benefit

Reference guide to build inventory management and forecasting solutions on AWS

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

Webinars

Trending Sources

Accelerate your data workflows with Amazon Redshift Data API persistent sessions

Webinars

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Functional Gaps in Your Data Transformation Testing Tools?

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Ensuring Data Transformation Results with Great Expectations

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Key Challenges Affecting Data Transformations—Dev and Testing

Data Engineers Are Using AI to Verify Data Transformations

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

MLOps and DevOps: Why Data Makes It Different

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

How the BMW Group analyses semiconductor demand with AWS Glue

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

The Best Data Management Tools For Small Businesses

Introducing blueprint discovery and other UI enhancements for Amazon OpenSearch Ingestion

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Transition from Amazon CloudSearch to Amazon OpenSearch Service

Automating the Automators: Shift Change in the Robot Factory

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

Stream VPC Flow Logs to Datadog via Amazon Kinesis Data Firehose

Stream data to Amazon S3 for real-time analytics using the Oracle GoldenGate S3 handler

Tableau further democratizes analytics with AI-fueled features

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

An AI Chat Bot Wrote This Blog Post …

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

Improve observability across Amazon MWAA tasks

Time for New Partnership Paradigms to Be Future-fit

Amazon Redshift data ingestion options

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Introducing Amazon Q data integration in AWS Glue

Automate discovery of data relationships using ML and Amazon Neptune graph technology

Ten new visual transforms in AWS Glue Studio

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

Build a data lake with Apache Flink on Amazon EMR

How healthcare organizations can analyze and create insights using price transparency data

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

7 key Microsoft Azure analytics services (plus one extra)

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Stay Connected