Data Architecture, Events and Metadata

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. These conflicts are particularly common in large-scale data cleanup operations. Determine the changes in transaction, and write new data files.

Snapshot

Snapshot Management Metadata Big Data

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

This post was co-written with Dipankar Mazumdar, Staff Data Engineering Advocate with AWS Partner OneHouse. Data architecture has evolved significantly to handle growing data volumes and diverse workloads. This allows the existing data to be interpreted as if it were originally written in any of these formats.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

We also examine how centralized, hybrid and decentralized data architectures support scalable, trustworthy ecosystems. As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

Need for a data mesh architecture Because entities in the EUROGATE group generate vast amounts of data from various sourcesacross departments, locations, and technologiesthe traditional centralized data architecture struggles to keep up with the demands for real-time insights, agility, and scalability.

IoT

IoT Machine Learning Metadata Data-driven

How to Manage Risk with Modern Data Architectures

Cloudera

JUNE 29, 2023

To improve the way they model and manage risk, institutions must modernize their data management and data governance practices. Implementing a modern data architecture makes it possible for financial institutions to break down legacy data silos, simplifying data management, governance, and integration — and driving down costs.

Data Architecture

Data Architecture Risk Management Risk Management

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue.

Data Lake

Data Lake Data Processing Optimization Machine Learning

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

This post describes how HPE Aruba automated their Supply Chain management pipeline, and re-architected and deployed their data solution by adopting a modern data architecture on AWS. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.

Data Integration

Data Integration Data Lake Statistics Data-driven

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Top 10 Metadata Management Influencers, Sites, and Blogs You Must Follow in 2021

Octopai

APRIL 19, 2021

Aptly named, metadata management is the process in which BI and Analytics teams manage metadata, which is the data that describes other data. In other words, data is the context and metadata is the content. Without metadata, BI teams are unable to understand the data’s full story. Dataconomy.

Metadata

Metadata Management Business Intelligence Data Governance

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

This premier event showcased groundbreaking advancements, keynotes from AWS leadership, hands-on technical sessions, and exciting product launches. Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights.

Analytics

Analytics Data Lake Metadata Data Warehouse

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Enterprises and organizations across the globe want to harness the power of data to make better decisions by putting data at the center of every decision-making process. However, throughout history, data services have held dominion over their customers’ data. This concept makes Iceberg extremely versatile.

Data Lake

Data Lake Metadata Snapshot Analytics

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Those decentralization efforts appeared under different monikers through time, e.g., data marts versus data warehousing implementations (a popular architectural debate in the era of structured data) then enterprise-wide data lakes versus smaller, typically BU-Specific, “data ponds”.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions.

Data Lake

Data Lake Snapshot Metadata Data Architecture

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The program must introduce and support standardization of enterprise data. Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

First, you must understand the existing challenges of the data team, including the data architecture and end-to-end toolchain. The DataOps pipeline you have built has enough automated tests to catch errors, and error events are tied to some form of real-time alerts. Monitoring Job Metadata.

Testing

Testing Metadata Dashboards Statistics

Top 6 Benefits of Automating End-to-End Data Lineage

erwin

SEPTEMBER 17, 2020

With complex data architectures and systems within so many organizations, tracking data in motion and data at rest is daunting to say the least. Harvesting the data through automation seamlessly removes ambiguity and speeds up the processing time-to-market capabilities.

Cost-Benefit

Cost-Benefit Data Governance Metadata Reporting

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Data domain producers publish data assets using datasource run to Amazon DataZone in the Central Governance account. This populates the technical metadata in the business data catalog for each data asset. Data ownership remains with the producer.

Data Lake

Data Lake Publishing Metadata Data-driven

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

These logs can track activity, such as data access patterns, lifecycle and management activity, and security events. AWS Glue Data Catalog stores information as metadata tables, where each table specifies a single data store. Alternatively, you can configure the crawler to run based on Amazon S3 events.

Metadata

Metadata Dashboards Metrics Visualization

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

It aims to provide a framework to create low-latency streaming applications on the AWS Cloud using Amazon Kinesis Data Streams and AWS purpose-built data analytics services. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Processing large records with Amazon Kinesis Data Streams

AWS Big Data

OCTOBER 16, 2023

To meet this need, AWS offers Amazon Kinesis Data Streams , a powerful and scalable real-time data streaming service. With Kinesis Data Streams, you can effortlessly collect, process, and analyze streaming data in real time at any scale. The following diagram illustrates the architecture of this solution.

Cost-Benefit

Cost-Benefit Testing Optimization Strategy

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

AWS Big Data

SEPTEMBER 6, 2023

The rising trend in today’s tech landscape is the use of streaming data and event-oriented structures. They are being applied in numerous ways, including monitoring website traffic, tracking industrial Internet of Things (IoT) devices, analyzing video game player behavior, and managing data for cutting-edge analytics systems.

Testing

Testing Metadata Cost-Benefit Internet of Things

Embedding AI Into Every Aspect of Your Business

Cloudera

JULY 20, 2021

Invest in maturing and improving your enterprise business metrics and metadata repositories, a multitiered data architecture, continuously improving data quality, and managing data acquisitions. enhanced customer experiences by accelerating the use of data across the organization.

Manufacturing

Manufacturing Forecasting IoT Insurance

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale. The following table summarizes the features.

Data Lake

Data Lake Metadata Statistics Optimization

IBM named a leader in the 2022 Gartner® Magic Quadrant™ for Data Integration Tools

IBM Big Data Hub

AUGUST 24, 2022

A data fabric architecture can help – it requires strong data integration capabilities facilitating governed data access blending the right delivery pattern to match the use case. Remote runtime data integration as-a-service execution capabilities for on-premises and multi-cloud execution.

Data Integration

Data Integration Metadata Data-driven Data Architecture

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

In this post, we are excited to summarize the features that the AWS Glue Data Catalog, AWS Glue crawler, and Lake Formation teams delivered in 2022. Whether you are a data platform builder, data engineer, data scientist, or any technology leader interested in data lake solutions, this post is for you.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

For example, in a chatbot, data events could pertain to an inventory of flights and hotels or price changes that are constantly ingested to a streaming storage engine. Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor.

Data Lake

Data Lake Unstructured Data Management Snapshot

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

Ontotext

DECEMBER 1, 2023

The event held the space for presentations, discussions, and one-on-one meetings, where more than 20 partners, 1064 Registrants from 41 countries, spanning across 25 industries came together. Most organisations are missing this ability to connect all the data together.

Metadata

Metadata Sales Machine Learning Consulting

How Finance is Leveraging Automated Data Lineage for Regulations Compliance

Octopai

APRIL 8, 2020

While there are many factors that led to this event, one critical dynamic was the inadequacy of the data architectures supporting banks and their risk management systems. It required banks to maintain data architecture supporting risk aggregation at all times.

Finance

Finance Cost-Benefit Metadata Data Architecture

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

July brings summer vacations, holiday gatherings, and for the first time in two years, the return of the Massachusetts Institute of Technology (MIT) Chief Data Officer symposium as an in-person event. A key area of focus for the symposium this year was the design and deployment of modern data platforms. What is a data fabric?

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

How Zurich Insurance Group built a log management solution on AWS

AWS Big Data

JULY 16, 2024

Priority 2 logs, such as operating system security logs, firewall, identity provider (IdP), email metadata, and AWS CloudTrail , are ingested into Amazon OpenSearch Service to enable the following capabilities. She currently serves as the Global Head of Cyber Data Management at Zurich Group.

Insurance

Insurance Management Cost-Benefit Optimization

Data Centric Revolution: Is Knowledge Ontology the Missing Link?

TDAN

JUNE 6, 2023

I think in the rare event the term came up I internally conflated it with Knowledge Graphs and moved on. You would think that after knocking around in semantics and knowledge graphs for over two decades I’d have had a pretty good idea about Knowledge Management, but it turns out I didn’t. The first tap […]

Management

Management IT Data Architecture Metadata

How Knowledge Graphs Power Data Mesh and Data Fabric

Ontotext

APRIL 10, 2024

“Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together.”

Metadata

Metadata Data Lake Data Warehouse Data Quality

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Cost and resource efficiency – This is an area where Acast observed a reduction in data duplication, and therefore cost reduction (in some accounts, removing the copy of data 100%), by reading data across accounts while enabling scaling. In this approach, teams responsible for generating data are referred to as producers.

Data-driven

Data-driven Advertising Metadata Data Architecture

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

Overview of solution As a data-driven company, smava relies on the AWS Cloud to power their analytics use cases. smava ingests data from various external and internal data sources into a landing stage on the data lake based on Amazon Simple Storage Service (Amazon S3).

Data Lake

Data Lake Data Warehouse Data-driven B2B

Data integrity vs. data quality: Is there a difference?

IBM Big Data Hub

JULY 13, 2023

This is the practice of creating, updating and consistently enforcing the processes, rules and standards that prevent errors, data loss, data corruption, mishandling of sensitive or regulated data, and data breaches. Effective data security protocols and tools contribute to strong data integrity.

Data Quality

Data Quality Data Integration Metadata Cost-Benefit

Insights from Gartner Data & Analytics Summit Orlando 2023

Alation

MARCH 31, 2023

Alation was proud to have been among the thought leaders at the annual gathering of data experts from around the world. The foundations of successful data governance The state of data governance was also top of mind. The active metadata helix Indeed, automation was on everyone’s minds. We couldn’t agree more.

Data Analytics

Data Analytics Analytics Metadata Data Governance

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Cloudera

AUGUST 21, 2020

For example, if a credit card was used in the United States and shortly afterward the same card was used in Spain to withdraw the same amount, these two events in isolation could appear legitimate. However, in the context of time and geography, these two events point to a pattern of fraud.

Enterprise

Enterprise Data Lake Strategy Metadata

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

You can subscribe to data products that help enrich customer profiles, for example demographics data, advertising data, and financial markets data. Amazon Kinesis ingests streaming events in real time from point-of-sales systems, clickstream data from mobile apps and websites, and social media data.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

The Cloud Connection: How Governance Supports Security

Alation

APRIL 14, 2022

In today’s AI/ML-driven world of data analytics, explainability needs a repository just as much as those doing the explaining need access to metadata, EG, information about the data being used. The Cloud Data Migration Challenge. Supports the ability to interact with the actual data and perform analysis on it.

Metadata

Metadata Data Governance Data-driven Modeling

Announcing the 2019 Data Impact Awards

Cloudera

MAY 22, 2019

The program recognizes organizations that are using Cloudera’s platform and services to unlock the power of data, with massive business and social impact. Cloudera’s data superheroes design modern data architectures that work across hybrid and multi-cloud and solve complex data management and analytic use cases spanning from the Edge to AI.

Machine Learning

Machine Learning Data Architecture IoT Data Warehouse

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Trending Sources

Data’s dark secret: Why poor quality cripples AI and growth

Webinars

How EUROGATE established a data mesh architecture using Amazon DataZone

How to Manage Risk with Modern Data Architectures

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Top 10 Metadata Management Influencers, Sites, and Blogs You Must Follow in 2021

Top analytics announcements of AWS re:Invent 2024

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

What is data governance? Best practices for managing data assets

How Cargotec uses metadata replication to enable cross-account data sharing

A Day in the Life of a DataOps Engineer

Top 6 Benefits of Automating End-to-End Data Lineage

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Processing large records with Amazon Kinesis Data Streams

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

Embedding AI Into Every Aspect of Your Business

Choosing an open table format for your transactional data lake on AWS

IBM named a leader in the 2022 Gartner® Magic Quadrant™ for Data Integration Tools

AWS Lake Formation 2022 year in review

Exploring real-time streaming for generative AI Applications

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

How Finance is Leveraging Automated Data Lineage for Regulations Compliance

Demystifying Modern Data Platforms

How Zurich Insurance Group built a log management solution on AWS

Data Centric Revolution: Is Knowledge Ontology the Missing Link?

How Knowledge Graphs Power Data Mesh and Data Fabric

Design a data mesh on AWS that reflects the envisioned organization

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Data integrity vs. data quality: Is there a difference?

Insights from Gartner Data & Analytics Summit Orlando 2023

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Create an end-to-end data strategy for Customer 360 on AWS

The Cloud Connection: How Governance Supports Security

Announcing the 2019 Data Impact Awards

Stay Connected