Data Integration, Data Lake and Metrics

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Bridging the gap between mainframe data and hybrid cloud environments

CIO Business Intelligence

FEBRUARY 27, 2025

According to a study from Rocket Software and Foundry , 76% of IT decision-makers say challenges around accessing mainframe data and contextual metadata are a barrier to mainframe data usage, while 64% view integrating mainframe data with cloud data sources as the primary challenge.

Metadata

Metadata Data Lake Cost-Benefit Forecasting

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

AWS Big Data

NOVEMBER 20, 2023

For any modern data-driven company, having smooth data integration pipelines is crucial. These pipelines pull data from various sources, transform it, and load it into destination systems for analytics and reporting. This post demonstrates how the new enhanced metrics help you monitor and debug AWS Glue jobs.

Metrics

Metrics Data Lake Cost-Benefit Dashboards

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics: Part 2

AWS Big Data

FEBRUARY 13, 2024

AWS Glue has made this more straightforward with the launch of AWS Glue job observability metrics , which provide valuable insights into your data integration pipelines built on AWS Glue. This post, walks through how to integrate AWS Glue job observability metrics with Grafana using Amazon Managed Grafana.

Metrics

Metrics Dashboards Visualization Key Performance Indicator

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Since Apache Iceberg is well supported by AWS data services and Cloudinary was already using Spark on Amazon EMR, they could integrate writing to Data Catalog and start an additional Spark cluster to handle data maintenance and compaction. For example, for certain queries, Athena runtime was 2x–4x faster than Snowflake.

Data Lake

Data Lake Metadata Snapshot Analytics

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. The default output is log based.

Metadata

Metadata Snapshot Data Lake Metrics

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

In Part 2 of this series, we discussed how to enable AWS Glue job observability metrics and integrate them with Grafana for real-time monitoring. In this post, we explore how to connect QuickSight to Amazon CloudWatch metrics and build graphs to uncover trends in AWS Glue job observability metrics.

Metrics

Metrics Visualization Dashboards Publishing

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

cycle_end"', "sagemakedatalakeenvironment_sub_db", ctas_approach=False) A similar approach is used to connect to shared data from Amazon Redshift, which is also shared using Amazon DataZone. While real-time data is processed by other applications, this setup maintains high-performance analytics without the expense of continuous processing.

IoT

IoT Machine Learning Metadata Data-driven

AtScale Universal Semantic Layer Democratizes and Scales Analytics

David Menninger's Analyst Perspectives

FEBRUARY 10, 2022

Organizations of all sizes are dealing with exponentially increasing data volume and data sources, which creates challenges such as siloed information, increased technical complexities across various systems and slow reporting of important business metrics.

Analytics

Analytics Business Intelligence Metrics Reporting

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities. The gold model joins the technical logs with billing data and organizes the metrics per business unit.

Data Lake

Data Lake Management Metrics Data Warehouse

Navigating the Chaos of Unruly Data: Solutions for Data Teams

DataKitchen

NOVEMBER 10, 2023

The Perilous State of Today’s Data Environments Data teams often navigate a labyrinth of chaos within their databases. Extrinsic Control Deficit: Many of these changes stem from tools and processes beyond the immediate control of the data team. Identifying Anomalies: Use advanced algorithms to detect anomalies in data patterns.

Data Quality

Data Quality Testing Data Lake Data Integration

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

AWS Big Data

JANUARY 6, 2025

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Analytics

Analytics Data Warehouse Big Data Metrics

Denodo Provides a Logical Approach to Data Management

David Menninger's Analyst Perspectives

OCTOBER 24, 2024

The ability to discover and access data via Denodo Platform is enabled by Denodo Data Catalog , which provides a search-based interface for finding data sources based on metadata or content, as well as metrics related to data popularity and usage.

Management

Management Data-driven Data Governance Data Lake

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

CIO Business Intelligence

AUGUST 9, 2024

Jon Pruitt, director of IT at Hartsfield-Jackson Atlanta International Airport, and his team crafted a visual business intelligence dashboard for a top executive in its Emergency Response Team to provide key metrics at a glance, including weather status, terminal occupancy, concessions operations, and parking capacity.

Data Transformation

Data Transformation Machine Learning Data Lake Dashboards

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

Comparison of modern data architectures : Architecture Definition Strengths Weaknesses Best used when Data warehouse Centralized, structured and curated data repository. Inflexible schema, poor for unstructured or real-time data. Data lake Raw storage for all types of structured and unstructured data.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights. From enhancing data lakes to empowering AI-driven analytics, AWS unveiled new tools and services that are set to shape the future of data and analytics.

Analytics

Analytics Data Lake Metadata Data Warehouse

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

AWS Big Data

FEBRUARY 22, 2023

In this post, we show how Ruparupa implemented an incrementally updated data lake to get insights into their business using Amazon Simple Storage Service (Amazon S3), AWS Glue , Apache Hudi , and Amazon QuickSight. An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data.

Data Lake

Data Lake Dashboards Cost-Benefit Data Warehouse

Don’t Fear Artificial Intelligence; Embrace it Through Data Governance

CIO Business Intelligence

APRIL 29, 2022

This would be straightforward task were it not for the fact that, during the digital-era, there has been an explosion of data – collected and stored everywhere – much of it poorly governed, ill-understood, and irrelevant. Further, data management activities don’t end once the AI model has been developed. Addressing the Challenge.

Data Governance

Data Governance IT Data Lake Risk

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

As organizations increasingly rely on data stored across various platforms, such as Snowflake , Amazon Simple Storage Service (Amazon S3), and various software as a service (SaaS) applications, the challenge of bringing these disparate data sources together has never been more pressing.

Analytics

Analytics Data-driven Data Integration Data Lake

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

AWS Big Data

SEPTEMBER 22, 2023

At Stitch Fix, we have been powered by data science since its foundation and rely on many modern data lake and data processing technologies. In our infrastructure, Apache Kafka has emerged as a powerful tool for managing event streams and facilitating real-time data processing.

Management

Management Metrics Cost-Benefit Data Lake

AWS Glue Data Quality is Generally Available

AWS Big Data

JUNE 6, 2023

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. For an up-to-date list, refer to Data Quality Definition Language (DQDL).

Data Quality

Data Quality Statistics Data Lake Visualization

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

However, enterprise data generated from siloed sources combined with the lack of a data integration strategy creates challenges for provisioning the data for generative AI applications. As part of the transformation, the objects need to be treated to ensure data privacy (for example, PII redaction).

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Get started with AWS Glue Data Quality dynamic rules for ETL pipelines

AWS Big Data

MAY 23, 2024

Hundreds of thousands of organizations build data integration pipelines to extract and transform data. They establish data quality rules to ensure the extracted data is of high quality for accurate business decisions. These rules assess the data based on fixed criteria reflecting current business states.

Data Quality

Data Quality Metrics Sales Data Lake

Nexthink scales to trillions of events per day with Amazon MSK

AWS Big Data

MARCH 29, 2024

Amazon MSK enables us to tailor the data retention duration to our specific requirements, ranging from seconds to unlimited duration. This flexibility grants uninterrupted data availability to our application, which wasn’t possible with our previous architecture.

Data-driven

Data-driven Cost-Benefit Metrics Management

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

AWS Big Data

JUNE 6, 2023

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning (ML), and application development. Hundreds of thousands of customers use data lakes for analytics and ML to make data-driven business decisions. Choose Save ruleset.

Data Quality

Data Quality Data-driven Data Lake Metrics

Connect your data for faster decisions with AWS

AWS Big Data

NOVEMBER 7, 2023

Third, AWS continues adding support for more data sources including connections to software as a service (SaaS) applications, on-premises applications, and other clouds so organizations can act on their data. They can now analyze business metrics in near-real time and make data-driven decisions faster than ever before.

Dashboards

Dashboards Data-driven Data Integration Data Lake

Extract data from SAP ERP using AWS Glue and the SAP SDK

AWS Big Data

FEBRUARY 8, 2023

Vyaire developed a custom data integration platform, iDataHub, powered by AWS services such as AWS Glue , AWS Lambda , and Amazon API Gateway. In this post, we share how we extracted data from SAP ERP using AWS Glue and the SAP SDK. Prahalathan M is the Data Integration Architect at Vyaire Medical Inc.

Testing

Testing Data Integration Data Lake Enterprise

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

The following figure shows some of the metrics derived from the study. Data ingestion You have to build ingestion pipelines based on factors like types of data sources (on-premises data stores, files, SaaS applications, third-party data), and flow of data (unbounded streams or batch data).

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

Let’s go through the ten Azure data pipeline tools Azure Data Factory : This cloud-based data integration service allows you to create data-driven workflows for orchestrating and automating data movement and transformation. Azure Blob Storage serves as the data lake to store raw data.

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

A hybrid approach in healthcare data warehousing with Amazon Redshift

AWS Big Data

FEBRUARY 21, 2023

Loading complex multi-point datasets into a dimensional model, identifying issues, and validating data integrity of the aggregated and merged data points are the biggest challenges that clinical quality management systems face. Although data lakes resemble data vaults, a data vault provides more features of a data warehouse.

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Metadata

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

AWS Big Data

FEBRUARY 2, 2023

Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data. For Report path prefix , enter cur-data/account-cur-daily.

Reporting

Reporting Data Lake Management Optimization

How AWS helped Altron Group accelerate their vision for optimized customer engagement

AWS Big Data

JULY 13, 2023

To verify the data quality of the sources through statistically-relevant metrics, AWS Glue Data Quality runs data quality tasks on relevant AWS Glue tables. Foundations for a data lake with data governance controls and data quality checks.

Optimization

Optimization B2B Data Quality Sales

Understanding Data Entities in Microsoft Dynamics 365

Jet Global

OCTOBER 7, 2020

In the future, customers will be able to deploy Data Entities and replicate transactional tables in an Azure Data Lake. This includes the ability to drill down through live D365 F&SCM data through balances, journal entries, and into subledger transactions to find and fix data integrity and reconciliation issues fast.

Data Warehouse

Data Warehouse OLAP Reporting Finance

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

While this approach provides isolation, it creates another significant challenge: duplication of data, metadata, and security policies, or ‘split-brain’ data lake. Now the admins need to synchronize multiple copies of the data and metadata and ensure that users across the many clusters are not viewing stale information.

Metadata

Metadata Data Lake Optimization Strategy

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

Optimization

Optimization Forecasting Data Lake Metadata

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

It has been well published since the State of DevOps 2019 DORA Metrics were published that with DevOps, companies can deploy software 208 times more often and 106 times faster, recover from incidents 2,604 times faster, and release 7 times fewer defects. Finally, data integrity is of paramount importance.

Software

Software Data Lake Testing Dashboards

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

AWS Big Data

JUNE 26, 2023

Companies are faced with the daunting task of ingesting all this data, cleansing it, and using it to provide outstanding customer experience. Typically, companies ingest data from multiple sources into their data lake to derive valuable insights from the data. This will open the ML transforms page. Choose Next.

Insurance

Insurance Visualization Data Lake Metrics

Go Fast Using Data Virtualization

Data Virtualization

JANUARY 14, 2022

Reading Time: 3 minutes During a recent house move I discovered an old notebook with metrics from when I was in the role of a Data Warehouse Project Manager and used to estimate data delivery projects. For the delivery a single data mart with.

Data Warehouse

Data Warehouse Metrics Data Integration Management

Dimensional modeling in Amazon Redshift

AWS Big Data

JULY 19, 2023

We show how to perform extract, transform, and load (ELT), an integration process focused on getting the raw data from a data lake into a staging layer to perform the modeling. The data (business process) needs to be integrated across various departments, in this case, marketing can access the sales data.

Modeling

Modeling Sales Data Warehouse Snapshot

How data stores and governance impact your AI initiatives

IBM Big Data Hub

OCTOBER 12, 2023

To optimize data analytics and AI workloads, organizations need a data store built on an open data lakehouse architecture. This type of architecture combines the performance and usability of a data warehouse with the flexibility and scalability of a data lake.

Cost-Benefit

Cost-Benefit Metadata Data Governance Optimization

How Data Governance Supports Analytics

Alation

JANUARY 27, 2022

Creating a single view of any data, however, requires the integration of data from disparate sources. Data integration is valuable for businesses of all sizes due to the many benefits of analyzing data from different sources. But data integration is not trivial.

Data Governance

Data Governance Analytics Cost-Benefit Data-driven

Configure end-to-end data pipelines with Etleap, Amazon Redshift, and dbt

AWS Big Data

JULY 12, 2023

Amazon Redshift helps you break down the data silos and allows you to run unified, self-service, real-time, and predictive analytics on all data across your operational databases, data lake, data warehouse, and third-party datasets with built-in governance.

Data Warehouse

Data Warehouse Modeling Dashboards Data Lake

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

AWS Big Data

NOVEMBER 22, 2024

Troubleshooting these production issues requires extensive analysis of logs and metrics, often leading to extended downtimes and delayed insights from critical data pipelines. Usually, troubleshooting requires an experienced data engineer to manually go over the following steps to identify the root cause.

Metrics

Metrics Data Lake Software Optimization

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Bridging the gap between mainframe data and hybrid cloud environments

Webinars

Trending Sources

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

Webinars

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics: Part 2

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

How EUROGATE established a data mesh architecture using Amazon DataZone

AtScale Universal Semantic Layer Democratizes and Scales Analytics

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Navigating the Chaos of Unruly Data: Solutions for Data Teams

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

Denodo Provides a Logical Approach to Data Management

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

Data’s dark secret: Why poor quality cripples AI and growth

Top analytics announcements of AWS re:Invent 2024

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

Don’t Fear Artificial Intelligence; Embrace it Through Data Governance

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

AWS Glue Data Quality is Generally Available

Data governance in the age of generative AI

Get started with AWS Glue Data Quality dynamic rules for ETL pipelines

Nexthink scales to trillions of events per day with Amazon MSK

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

Connect your data for faster decisions with AWS

Extract data from SAP ERP using AWS Glue and the SAP SDK

Create an end-to-end data strategy for Customer 360 on AWS

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

A hybrid approach in healthcare data warehousing with Amazon Redshift

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

How AWS helped Altron Group accelerate their vision for optimized customer engagement

Understanding Data Entities in Microsoft Dynamics 365

Improving Multi-tenancy with Virtual Private Clusters

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

Go Fast Using Data Virtualization

Dimensional modeling in Amazon Redshift

How data stores and governance impact your AI initiatives

How Data Governance Supports Analytics

Configure end-to-end data pipelines with Etleap, Amazon Redshift, and dbt

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

Stay Connected