Events and Metadata - Data Leaders Brief

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

Consider a streaming pipeline ingesting real-time event data while a scheduled compaction job runs to optimize file sizes. However, commits can still fail if the latest metadata is updated after the base metadata version is established. Generate new metadata files. Commit the metadata files to the catalog.

Snapshot

Snapshot Management Metadata Big Data

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

SAP Datasphere Powers Business at the Speed of Data

Rocket-Powered Data Science

MARCH 20, 2023

Datasphere goes beyond the “big three” data usage end-user requirements (ease of discovery, access, and delivery) to include data orchestration (data ops and data transformations) and business data contextualization (semantics, metadata, catalog services). As you would guess, maintaining context relies on metadata.

Data Warehouse

Data Warehouse Metadata Digital Transformation Machine Learning

Webinars

Going Beyond Chatbots: Connecting AI to Your Tools, Systems, & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

MORE WEBINARS

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Central to a transactional data lake are open table formats (OTFs) such as Apache Hudi , Apache Iceberg , and Delta Lake , which act as a metadata layer over columnar formats. XTable isn’t a new table format but provides abstractions and tools to translate the metadata associated with existing formats.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

For example, you can use metadata about the Kinesis data stream name to index by data stream ( ${getMetadata("kinesis_stream_name") ), or you can use document fields to index data depending on the CloudWatch log group or other document data ( ${path/to/field/in/document} ).

Metadata

Metadata Metrics Analytics Data Processing

Metadata Management Best Practices: How to Plan Your Metadata Management Program

Octopai

NOVEMBER 10, 2021

Metadata has been defined as the who, what, where, when, why, and how of data. Without the context given by metadata, data is just a bunch of numbers and letters. But going on a rampage to define, categorize, and otherwise metadata-ize your data doesn’t necessarily give you the key to the value in your data. Hold on tight!

Metadata

Metadata Management Interactive Strategy

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In this blog post, we’ll discuss how the metadata layer of Apache Iceberg can be used to make data lakes more efficient. You will learn about an open-source solution that can collect important metrics from the Iceberg metadata layer. This ensures that each change is tracked and reversible, enhancing data governance and auditability.

Metadata

Metadata Snapshot Data Lake Metrics

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

AWS Big Data

OCTOBER 23, 2024

The Airflow REST API facilitates a wide range of use cases, from centralizing and automating administrative tasks to building event-driven, data-aware data pipelines. Event-driven architectures – The enhanced API facilitates seamless integration with external events, enabling the triggering of Airflow DAGs based on these events.

Interactive

Interactive Testing Data-driven Data Lake

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

The proposed solution involves creating a custom subscription workflow that uses the event-driven architecture of Amazon DataZone. Amazon DataZone keeps you informed of key activities (events) within your data portal, such as subscription requests, updates, comments, and system events. Enter a name for the asset.

Publishing

Publishing Unstructured Data Metadata Data-driven

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

An Iceberg table’s metadata stores a history of snapshots, which are updated with each transaction. Over time, this creates multiple data files and metadata files as changes accumulate. Additionally, they can impact query performance due to the overhead of handling large amounts of metadata.

Snapshot

Snapshot Metadata Data Lake Optimization

The Power of Graph Databases, Linked Data, and Graph Algorithms

Rocket-Powered Data Science

MARCH 10, 2020

Entities are the nodes in the graph — these can be people, events, objects, concepts, or places. Each of those cases deeply involves entities (people, objects, events, actions, concepts, and places) and their relationships (touch points, both causal and simple associations).

Metadata

Metadata Machine Learning Prescriptive Analytics ROI

Are You Content with Your Organization’s Content Strategy?

Rocket-Powered Data Science

JULY 6, 2021

This is accomplished through tags, annotations, and metadata (TAM). Smart content includes labeled (tagged, annotated) metadata (TAM). The key to success is to start enhancing and augmenting content management systems (CMS) with additional features: semantic content and context. Collect, curate, and catalog (i.e.,

Strategy

Strategy Machine Learning Metadata Knowledge Discovery

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

The Institutional Data & AI platform adopts a federated approach to data while centralizing the metadata to facilitate simpler discovery and sharing of data products. A data portal for consumers to discover data products and access associated metadata. Subscription workflows that simplify access management to the data products.

Metadata

Metadata Data Governance Data Quality Data-driven

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

AWS Big Data

MARCH 22, 2023

If your organization uses Microsoft Azure Active Directory (Azure AD) for centralized authentication and utilizes its user attributes to organize the users, you can enable federation across all QuickSight accounts as well as manage users and their group membership in QuickSight using events generated in the AWS platform.

Management

Management Metadata Enterprise Testing

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Rocket-Powered Data Science

FEBRUARY 15, 2023

Know thy data: understand what it is (formats, types, sampling, who, what, when, where, why), encourage the use of data across the enterprise, and enrich your datasets with searchable (semantic and content-based) metadata (labels, annotations, tags). Do not covet thy data’s correlations: a random six-sigma event is one-in-a-million.

Strategy

Strategy Experimentation Uncertainty Machine Learning

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

O'Reilly on Data

JUNE 14, 2024

It offers a wealth of books, on-demand courses, live events, short-form posts, interactive labs, expert playlists, and more—formed from the proprietary content of thousands of independent authors, industry experts, and several of the largest education publishers in the world.

Metadata

Metadata Publishing Data-driven Modeling

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. The data in the central data warehouse in Amazon Redshift is then processed for analytical needs and the metadata is shared to the consumers through Amazon DataZone. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

We recommend using AWS Step Functions Workflow Studio , and setting up Amazon S3 event notifications and an SNS FIFO queue to receive the filename as messages. Because a CDC file can contain data for multiple tables, the job loops over the tables in a file and loads the table metadata from the source table ( RDS column names).

Data Lake

Data Lake Data Processing Optimization Machine Learning

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

Metadata analysis makes it possible to build data catalogs, which in turn allow humans to discover data that’s relevant to their projects. The customer demographics are different; but more than that, the event sources are different. Joe Hellerstein on how "Metadata services can lead to performance and organizational improvements".

Machine Learning

Machine Learning Software Metadata Testing

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

AWS Big Data

JUNE 19, 2023

When it comes to near-real-time analysis of data as it arrives in Security Lake and responding to security events your company cares about, Amazon OpenSearch Service provides the necessary tooling to help you make sense of the data found in Security Lake. Under Log and event sources , specify what the subscriber is authorized to ingest.

Publishing

Publishing Dashboards Visualization Management

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant. Data fabric Metadata-rich integration layer across distributed systems. Implementation complexity, relies on robust metadata management.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

The data is also registered in the Glue Data Catalog , a metadata repository. Amazon EventBridge , a serverless event bus service, triggers a downstream process that allows you to build event-driven architecture as soon as your new data arrives in your target. Check CloudWatch log events for the SEED Load.

Data Integration

Data Integration Data Lake Statistics Data-driven

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Business analysts enhance the data with business metadata/glossaries and publish the same as data assets or data products. Users can search for assets in the Amazon DataZone catalog, view the metadata assigned to them, and access the assets. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

Upon successful authentication, the custom claims provider triggers the custom authentication extensions token issuance start event listener. The custom authentication extension calls an Azure function (your REST API endpoint) with information about the event, user profile, session data, and other context. Select it and choose Next.

Sales

Sales Metadata Enterprise Testing

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

The following diagram illustrates an indexing flow involving a metadata update in OR1 During indexing operations, individual documents are indexed into Lucene and also appended to a write-ahead log also known as a translog. In the event of an infrastructure failure, an OpenSearch domain can end up losing one or more nodes.

Optimization

Optimization Snapshot Metadata Cost-Benefit

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

This may require frequent truncation in certain tables to retain only the latest stream of events. The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. Agent states are reported in agent-state events.

Management

Management Metadata Analytics Dashboards

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview

AWS Big Data

JULY 8, 2024

With the Amazon DataZone OpenLineage-compatible API, domain administrators and data producers can capture and store lineage events beyond what is available in Amazon DataZone, including transformations in Amazon Simple Storage Service (Amazon S3), AWS Glue , and other AWS services.

Visualization

Visualization Metadata Publishing Sales

The Superpowers of Ontotext’s Relation and Event Detector

Ontotext

FEBRUARY 26, 2024

Ontotext’s Relation and Event Detector (RED) is designed to assess and analyze the impact of market-moving events. Entity linking allows events to be associated with specific companies in the graph and correlated with information from 3rd party databases, namely Crunchbase, and public information about stock prices.

Data-driven

Data-driven Risk Modeling Risk Management

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

This premier event showcased groundbreaking advancements, keynotes from AWS leadership, hands-on technical sessions, and exciting product launches. S3 Metadata is designed to automatically capture metadata from objects as they are uploaded into a bucket, and to make that metadata queryable in a read-only table.

Analytics

Analytics Data Lake Metadata Data Warehouse

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

Optimization

Optimization Snapshot Data Lake Metadata

Three Emerging Analytics Products Derived from Value-driven Data Innovation and Insights Discovery in the Enterprise

Rocket-Powered Data Science

JULY 19, 2023

using high-dimensional data feature space to disambiguate events that seem to be similar, but are not). The thing itself (or the data about the thing) may not be surprising (though it could be), but the context (the “metadata”, which is “other data about the primary data”) provides a signal that something needs attention here.

Data-driven

Data-driven Enterprise Analytics Machine Learning

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Data-driven decisions lead to more effective responses to unexpected events, increase innovation and allow organizations to create better experiences for their customers. Short overview of Cloudinary’s infrastructure Cloudinary infrastructure handles over 20 billion requests daily with every request generating event logs.

Data Lake

Data Lake Metadata Snapshot Analytics

The Need For Personalized Data Journeys for Your Data Consumers

DataKitchen

OCTOBER 20, 2023

’ It assigns unique identifiers to each data item—referred to as ‘payloads’—related to each event. Payload DJs facilitate capturing metadata, lineage, and test results at each phase, enhancing tracking efficiency and reducing the risk of data loss.

Insurance

Insurance Metadata Data-driven Data Quality

How AppsFlyer modernized their interactive workload by moving to Amazon Athena and saved 80% of costs

AWS Big Data

AUGUST 8, 2024

AppsFlyer develops a leading measurement solution focused on privacy, which enables marketers to gauge the effectiveness of their marketing activities and integrates them with the broader marketing world, managing a vast volume of 100 billion events every day. This led the team to examine partition indexing.

Interactive

Interactive Metadata Optimization Testing

Data Intelligence in the Next Normal; Why, Who and When?

erwin

JANUARY 14, 2021

As we enter 2021, we will also be building off the events of 2020 – both positive and negative – including the acceleration of digital transformation as the next normal begins to be defined. Technical metadata is what makes up database schema and table definitions.

Digital Transformation

Digital Transformation Metadata Big Data Data-driven

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

In this blog, we discuss the technical challenges faced by Cargotec in replicating their AWS Glue metadata across AWS accounts, and how they navigated these challenges successfully to enable cross-account data sharing. Solution overview Cargotec required a single catalog per account that contained metadata from their other AWS accounts.

Metadata

Metadata Data Lake Machine Learning Big Data

Why data observability is essential to AI governance

erwin

DECEMBER 9, 2024

Maybe your AI model monitors sales data, and the data is spiking for one region of the country due to a world event. Metadata is the basis of trust for data forensics as we answer the questions of fact or fiction when it comes to the data we see. Lets give a for instance.

Metadata

Metadata Data Quality Sales Modeling

The Increasing Importance of Open Table Formats

David Menninger's Analyst Perspectives

OCTOBER 31, 2024

Initially called Onetable, the project became Apache XTable in September 2024 and provides a lightweight translation layer to translate metadata between table formats without the need to duplicate or modify the data. In 2023, Onehouse announced an initiative to provide interoperability across table formats.

Data Lake

Data Lake Unstructured Data Data Warehouse Software

The Benefits of a Knowledge Graph-based Metadata Hub

Ontotext

DECEMBER 15, 2022

Enter metadata. Metadata describes data and includes information such as how old data is, where it was created, who owns it, and what concepts (or other data) it relates to. As a result, leveraging metadata has become a core capability for businesses trying to extract value from their data. Knowledge (metadata) layer.

Metadata

Metadata Unstructured Data Structured Data Enterprise

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Unfiltered Table Metadata This tab displays the response of the AWS Glue API GetUnfilteredTableMetadata policies for the selected table. Get table data and metadata for this user to see how Lake Formation permissions are enforced and so the two users can see different data (on the Authorized Data tab).

Data Processing

Data Processing Metadata Publishing Testing

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Data and Metadata: Data inputs and data outputs produced based on the application logic. Also included, business and technical metadata, related to both data inputs / data outputs, that enable data discovery and achieving cross-organizational consensus on the definitions of data assets.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Gartner Data & Analytics Summit – March 3-5 in Orlando Florida

Octopai

FEBRUARY 26, 2025

Gartner Data & Analytics Summit is a must-attend event for Data Leaders and experts. To kick off the conference season Octopai and Cloudera will be attending and sponsoring the Gartner D&A Orlando event from March 3-5 in Orlando Florida.

Data Analytics

Data Analytics Analytics Metadata Marketing

Disaster recovery strategies for Amazon MWAA – Part 1

AWS Big Data

JANUARY 16, 2024

Within Airflow, the metadata database is a core component storing configuration variables, roles, permissions, and DAG run histories. A healthy metadata database is therefore critical for your Airflow environment. The AWS Health Dashboard provides information about AWS Health events that can affect your account.

Strategy

Strategy Metadata Metrics Dashboards

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Build a high-performance quant research platform with Apache Iceberg

Webinars

Trending Sources

SAP Datasphere Powers Business at the Speed of Data

Webinars

Run Apache XTable in AWS Lambda for background conversion of open table formats

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Metadata Management Best Practices: How to Plan Your Metadata Management Program

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Use open table format libraries on AWS Glue 5.0 for Apache Spark

The Power of Graph Databases, Linked Data, and Graph Algorithms

Are You Content with Your Organization’s Content Strategy?

Enhance data governance with enforced metadata rules in Amazon DataZone

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

How EUROGATE established a data mesh architecture using Amazon DataZone

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Deep automation in machine learning

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

Data’s dark secret: Why poor quality cripples AI and growth

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview

The Superpowers of Ontotext’s Relation and Event Detector

Top analytics announcements of AWS re:Invent 2024

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Three Emerging Analytics Products Derived from Value-driven Data Innovation and Insights Discovery in the Enterprise

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

The Need For Personalized Data Journeys for Your Data Consumers

How AppsFlyer modernized their interactive workload by moving to Amazon Athena and saved 80% of costs

Data Intelligence in the Next Normal; Why, Who and When?

How Cargotec uses metadata replication to enable cross-account data sharing

Why data observability is essential to AI governance

The Increasing Importance of Open Table Formats

The Benefits of a Knowledge Graph-based Metadata Hub

Integrate custom applications with AWS Lake Formation – Part 2

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Gartner Data & Analytics Summit – March 3-5 in Orlando Florida

Disaster recovery strategies for Amazon MWAA – Part 1

Stay Connected