Data Transformation, Optimization and Reference

Reference guide to build inventory management and forecasting solutions on AWS

AWS Big Data

APRIL 11, 2023

Accurately predicting demand for products allows businesses to optimize inventory levels, minimize stockouts, and reduce holding costs. Solution overview In today’s highly competitive business landscape, it’s essential for retailers to optimize their inventory management processes to maximize profitability and improve customer satisfaction.

Forecasting

Forecasting Management IoT Data-driven

Accelerate your data workflows with Amazon Redshift Data API persistent sessions

AWS Big Data

NOVEMBER 22, 2024

Maintaining reusable database sessions to help optimize the use of database connections, preventing the API server from exhausting the available connections and improving overall system scalability. Please refer to Redshift Quotas and Limits here. After 24 hours the session is forcibly closed, and in-progress queries are terminated.

Data Warehouse

Data Warehouse Recreation/Entertainment Cost-Benefit Data-driven

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

AWS Big Data

OCTOBER 30, 2024

This new JDBC connectivity feature enables our governed data to flow seamlessly into these tools, supporting productivity across our teams.” Use case Amazon DataZone addresses your data sharing challenges and optimizes data availability.

Analytics

Analytics Visualization Data Governance Data-driven

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

but to reference concrete tooling used today in order to ground what could otherwise be a somewhat abstract exercise. Adapted from the book Effective Data Science Infrastructure. However, none of these layers help with modeling and optimization. Let’s now take a tour of the various layers, to begin to map the territory.

IT

IT Testing Experimentation Software

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

AWS Big Data

JANUARY 6, 2025

With Amazon AppFlow, you can run data flows at nearly any scale and at the frequency you chooseon a schedule, in response to a business event, or on demand. You can configure data transformation capabilities such as filtering and validation to generate rich, ready-to-use data as part of the flow itself, without additional steps.

Analytics

Analytics Data Warehouse Big Data Metrics

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Reporting being part of an effective DQM, we will also go through some data quality metrics examples you can use to assess your efforts in the matter. But first, let’s define what data quality actually is. What is the definition of data quality? Why Do You Need Data Quality Management? date, month, and year).

Data Quality

Data Quality Metrics Data-driven Management

Key Challenges Affecting Data Transformations—Dev and Testing

Wayne Yaddow

FEBRUARY 6, 2025

Common challenges and practical mitigation strategies for reliable data transformations. Photo by Mika Baumeister on Unsplash Introduction Data transformations are important processes in data engineering, enabling organizations to structure, enrich, and integrate data for analytics , reporting, and operational decision-making.

Testing

Testing Data Transformation Data-driven Manufacturing

Data Engineers Are Using AI to Verify Data Transformations

Wayne Yaddow

FEBRUARY 26, 2025

AI is transforming how senior data engineers and data scientists validate data transformations and conversions. Artificial intelligence-based verification approaches aid in the detection of anomalies, the enforcement of data integrity, and the optimization of pipelines for improved efficiency.

Data Transformation

Data Transformation Testing Data-driven Data Quality

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Adding data transformation details to metadata can be challenging because of the dispersed nature of this information across data processing pipelines, making it difficult to extract and incorporate into table-level metadata. Enriching the prompt You can enhance the prompts with query optimization rules like partition pruning.

Metadata

Metadata Data Lake Modeling Data Warehouse

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making. However, as data volumes continue to grow, optimizing data layout and organization becomes crucial for efficient querying and analysis.

Optimization

Optimization Data Lake Cost-Benefit Reporting

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

The goal of DataOps is to help organizations make better use of their data to drive business decisions and improve outcomes. ChatGPT> DataOps is a term that refers to the set of practices and tools that organizations use to improve the quality and speed of data analytics and machine learning.

Machine Learning

Machine Learning Data-driven Optimization Data Analytics

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

BMW Group uses 4,500 AWS Cloud accounts across the entire organization but is faced with the challenge of reducing unnecessary costs, optimizing spend, and having a central place to monitor costs. For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Together with price-performance, Amazon Redshift offers capabilities such as serverless architecture, machine learning integration within your data warehouse and secure data sharing across the organization. dbt Cloud is a hosted service that helps data teams productionize dbt deployments.

Data Warehouse

Data Warehouse Analytics Testing Modeling

Transition from Amazon CloudSearch to Amazon OpenSearch Service

AWS Big Data

JULY 25, 2024

If you want deeper control over your infrastructure for cost and latency optimization, you can choose OpenSearch Service’s managed clusters deployment option. With managed clusters, you get granular control over the instances you would like to use, indexing and data-sharding strategy, and more.

Cost-Benefit

Cost-Benefit Machine Learning Dashboards Management

Amazon Redshift data ingestion options

AWS Big Data

SEPTEMBER 5, 2024

With auto-copy, automation enhances the COPY command by adding jobs for automatic ingestion of data. If storing operational data in a data warehouse is a requirement, synchronization of tables between operational data stores and Amazon Redshift tables is supported.

IoT

IoT Data Warehouse Cost-Benefit Reporting

Time for New Partnership Paradigms to Be Future-fit

CIO Business Intelligence

DECEMBER 6, 2023

Notably, a partner with global reach can be particularly valuable to an organisation with operations with a global presence; since the structure of most multinational organisations is optimised to support their core business rather than initiatives like digital transformation.

Digital Transformation

Digital Transformation Software Cost-Benefit Manufacturing

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Data Vault 2.0 allows for the following: Agile data warehouse development Parallel data ingestion A scalable approach to handle multiple data sources even on the same entity A high level of automation Historization Full lineage support However, Data Vault 2.0

Enterprise

Enterprise Data Warehouse Data Lake Optimization

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

In this post, we explore how AWS Glue can serve as the data integration service to bring the data from Snowflake for your data integration strategy, enabling you to harness the power of your data ecosystem and drive meaningful outcomes across various use cases. Store the extracted and transformed data in Amazon S3.

Analytics

Analytics Data-driven Data Integration Data Lake

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

However, you might face significant challenges when planning for a large-scale data warehouse migration. This includes the ETL processes that capture source data, the functional refinement and creation of data products, the aggregation for business metrics, and the consumption from analytics, business intelligence (BI), and ML.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

Additionally, a TCO calculator generates the TCO estimation of an optimized EMR cluster for facilitating the migration. For more details on how to configure and schedule the log collector, refer to the yarn-log-collector GitHub repo. For more information on how to use the YARN log organizer, refer to the yarn-log-organizer GitHub repo.

Dashboards

Dashboards Optimization Data Lake Cost-Benefit

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

AWS Big Data

APRIL 2, 2024

With these settings, you can now seamlessly ingest decompressed CloudWatch log data into Splunk using Firehose. This enables you to run high-performance, cost-efficient analytics on streaming data in Amazon S3 using services such as Amazon Athena , Amazon EMR , Amazon Redshift Spectrum , and Amazon QuickSight.

Metadata

Metadata Marketing Analytics Data Transformation

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

Stream data to Amazon S3 for real-time analytics using the Oracle GoldenGate S3 handler

AWS Big Data

AUGUST 8, 2024

Oracle GoldenGate for Oracle Database and Big Data adapters Oracle GoldenGate is a real-time data integration and replication tool used for disaster recovery, data migrations, high availability. You can use temporary credentials; for more details, refer to Using temporary credentials with AWS resources.

Analytics

Analytics Big Data Software Data Integration

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

AWS Glue is a serverless data discovery, load, and transformation service that will prepare data for consumption in BI and AI/ML activities. Solution overview This solution uses Amazon AppFlow to retrieve data from the Jira Cloud. For full instructions, refer to Jira Cloud connector for Amazon AppFlow.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

AWS Big Data

JANUARY 30, 2025

We set up our AWS CDK to refer to the contents of a specific directory and define a resource (for example, an AWS Step Functions state machine or an AWS Glue job) for each file it found in that directory. We also used it as a repository for storing code that could be retrieved and used by other services.

Data Warehouse

Data Warehouse Data Architecture Machine Learning Data Transformation

How SOCAR handles large IoT data with Amazon MSK and Amazon ElastiCache for Redis

AWS Big Data

MAY 3, 2023

In this post, we provide a detailed overview of streaming messages with Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon ElastiCache for Redis , covering technical aspects and design considerations that are essential for achieving optimal results. We also discuss the key features, considerations, and design of the solution.

IoT

IoT Internet of Things Data Transformation Management

7 key Microsoft Azure analytics services (plus one extra)

CIO Business Intelligence

JUNE 29, 2022

If you can’t make sense of your business data, you’re effectively flying blind. Insights hidden in your data are essential for optimizing business operations, finetuning your customer experience, and developing new products — or new lines of business, like predictive maintenance. Azure Data Factory.

Data Lake

Data Lake Analytics Data Warehouse Machine Learning

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

It supports modern analytical data lake operations such as create table as select (CTAS), upsert and merge, and time travel queries. Athena also supports the ability to create views and perform VACUUM (snapshot expiration) on Apache Iceberg tables to optimize storage and performance.

Data Lake

Data Lake Snapshot Optimization Data Transformation

How healthcare organizations can analyze and create insights using price transparency data

AWS Big Data

OCTOBER 11, 2023

Under the Transparency in Coverage (TCR) rule , hospitals and payors to publish their pricing data in a machine-readable format. For more information, refer to Delivering Consumer-friendly Healthcare Transparency in Coverage On AWS. The Data Catalog now contains references to the machine-readable data.

Visualization

Visualization Dashboards Data-driven Gap analysis

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

AWS Big Data

MARCH 15, 2023

Infomedia was looking to build a cloud-based data platform to take advantage of highly scalable data storage with flexible and cloud-native processing tools to ingest, transform, and deliver datasets to their SaaS applications. The raw input data is stored in Amazon S3 in JSON format (called the bronze dataset layer).

Cost-Benefit

Cost-Benefit Data Processing Optimization Data-driven

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset.

Reporting

Reporting Metrics Optimization Data Lake

Stored procedure enhancements in Amazon Redshift

AWS Big Data

SEPTEMBER 6, 2023

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific logic. You can also schedule stored procedures to automate data processing on Amazon Redshift. For more information, refer to Bringing your stored procedures to Amazon Redshift.

Data Warehouse

Data Warehouse Insurance Statistics Software

Set up alerts and orchestrate data quality rules with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

Furthermore, it allows for necessary actions to be taken, such as rectifying errors in the data source, refining data transformation processes, and updating data quality rules. This automated approach reduces the need for manual intervention and streamlines the data quality evaluation process.

Data Quality

Data Quality Metrics Data-driven Visualization

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

By preserving historical versions, data lake time travel provides benefits such as auditing and compliance, data recovery and rollback, reproducible analysis, and data exploration at different points in time. Another popular transaction data lake use case is incremental query.

Data Lake

Data Lake Snapshot Big Data Data-driven

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In addition, more data is becoming available for processing / enrichment of existing and new use cases e.g., recently we have experienced a rapid growth in data collection at the edge and an increase in availability of frameworks for processing that data. As a result, alternative data integration technologies (e.g.,

Data Processing

Data Processing Data Warehouse Enterprise Visualization

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Amazon Redshift enables you to use SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning (ML) to deliver the best price-performance at scale. Shashank Tewari is a Senior Technical Account Manager at AWS.

Data Warehouse

Data Warehouse Analytics Data Lake Data Science

Ten new visual transforms in AWS Glue Studio

AWS Big Data

MAY 9, 2023

AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. It allows you to visually compose data transformation workflows using nodes that represent different data handling steps, which later are converted automatically into code to run.

Visualization

Visualization Marketing Big Data IT

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

This method uses GZIP compression to optimize storage consumption and query performance. You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches. You can test this solution yourself using the AWS Samples GitHub repository.

Analytics

Analytics IoT Metadata Internet of Things

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Data ingestion – Steps 1 and 2 use AWS DMS, which connects to the source database and moves full and incremental data (CDC) to Amazon S3 in Parquet format. Let’s refer to this S3 bucket as the raw layer. Data transformation – Steps 3 and 4 represent an EMR Serverless Spark application (Amazon EMR 6.9

Data Lake

Data Lake Dashboards Metrics Metadata

Orchestrate Amazon EMR Serverless jobs with AWS Step functions

AWS Big Data

OCTOBER 12, 2023

With EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run applications with these frameworks. You can run analytics workloads at any scale with automatic scaling that resizes resources in seconds to meet changing data volumes and processing requirements. Do you have follow-up questions or feedback?

Big Data

Big Data Data-driven Management Visualization

Building Better Data Models to Unlock Next-Level Intelligence

Sisense

MAY 11, 2021

With our strategy in mind, we factored in our consumers and consuming services, which primarily are Sisense Fusion Analytics and Cloud Data Teams. Interestingly, this ad hoc analysis benefits from a single source of truth that is easy to query to allow for quickly querying of raw data alongside the cleanest data (i.e.,

Modeling

Modeling Big Data IoT Data Warehouse

Database vs. Data Warehouse: What’s the Difference?

Jet Global

MAY 28, 2019

Databases can be stored either on a local server or in the cloud and can be access for reporting in many different ways, through limited native tools included with the system collecting the data itself, to Excel exports or various direct connectivity options. Enter the Warehouse.

Data Warehouse

Data Warehouse Reporting Business Intelligence Sales

Use Snowflake with Amazon MWAA to orchestrate data pipelines

AWS Big Data

OCTOBER 31, 2023

Customers rely on data from different sources such as mobile applications, clickstream events from websites, historical data, and more to deduce meaningful patterns to optimize their products, services, and processes. citibike-tripdata-destination-ACCOUNT_ID – The bucket used for storing the transformed dataset.

Data Processing

Data Processing Management Publishing Visualization

How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with Amazon EMR on Amazon EKS

AWS Big Data

FEBRUARY 21, 2023

We use Apache Spark as our main data processing engine and have over 1,000 Spark applications running over massive amounts of data every day. These Spark applications implement our business logic ranging from data transformation, machine learning (ML) model inference, to operational tasks. Their costs were climbing.

Cost-Benefit

Cost-Benefit Informatics Optimization Management

Reference guide to build inventory management and forecasting solutions on AWS

Accelerate your data workflows with Amazon Redshift Data API persistent sessions

Webinars

Trending Sources

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

Webinars

MLOps and DevOps: Why Data Makes It Different

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Key Challenges Affecting Data Transformations—Dev and Testing

Data Engineers Are Using AI to Verify Data Transformations

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

An AI Chat Bot Wrote This Blog Post …

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Transition from Amazon CloudSearch to Amazon OpenSearch Service

Amazon Redshift data ingestion options

Time for New Partnership Paradigms to Be Future-fit

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Stream data to Amazon S3 for real-time analytics using the Oracle GoldenGate S3 handler

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

How SOCAR handles large IoT data with Amazon MSK and Amazon ElastiCache for Redis

7 key Microsoft Azure analytics services (plus one extra)

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

How healthcare organizations can analyze and create insights using price transparency data

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

Automate alerting and reporting for AWS Glue job resource usage

Stored procedure enhancements in Amazon Redshift

Set up alerts and orchestrate data quality rules with AWS Glue Data Quality

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Addressing the Three Scalability Challenges in Modern Data Platforms

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Ten new visual transforms in AWS Glue Studio

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Orchestrate Amazon EMR Serverless jobs with AWS Step functions

Building Better Data Models to Unlock Next-Level Intelligence

Database vs. Data Warehouse: What’s the Difference?

Use Snowflake with Amazon MWAA to orchestrate data pipelines

How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with Amazon EMR on Amazon EKS

Stay Connected