Metadata and Reference - Data Leaders Brief

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

Under the hood, UniForm generates Iceberg metadata files (including metadata and manifest files) that are required for Iceberg clients to access the underlying data files in Delta Lake tables. Both Delta Lake and Iceberg metadata files reference the same data files. in Delta Lake public document. Appendix 1.

Metadata

Metadata Data Warehouse Big Data Data Lake

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Central to a transactional data lake are open table formats (OTFs) such as Apache Hudi , Apache Iceberg , and Delta Lake , which act as a metadata layer over columnar formats. For more examples and references to other posts, refer to the following GitHub repository. This post is one of multiple posts about XTable on AWS.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

However, commits can still fail if the latest metadata is updated after the base metadata version is established. Iceberg uses a layered architecture to manage table state and data: Catalog layer Maintains a pointer to the current table metadata file, serving as the single source of truth for table state.

Snapshot

Snapshot Management Metadata Big Data

Metadata is Like Packaging: Seeing Beyond the Library Card Metaphor

Ontotext

MARCH 19, 2021

way we package information has a lot to do with metadata. The somewhat conventional metaphor about metadata is the one of the library card. This metaphor has it that books are the data and library cards are the metadata helping us find what we need, want to know more about or even what we don’t know we were looking for.

Metadata

Metadata Publishing Enterprise Management

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

RDF-Star: Metadata Complexity Simplified

Ontotext

JUNE 10, 2021

This is a graph of millions of edges and vertices – in enterprise data management terms it is a giant piece of master/reference data. Not Every Graph is a Knowledge Graph: Schemas and Semantic Metadata Matter. open-world vs. closed-world assumptions). They aren’t concerned with publishing or integrating data.

Metadata

Metadata Cost-Benefit OLAP Modeling

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Best Practices for Metadata Management

Alation

JULY 19, 2021

What Is Metadata? Metadata is information about data. A clothing catalog or dictionary are both examples of metadata repositories. Indeed, a popular online catalog, like Amazon, offers rich metadata around products to guide shoppers: ratings, reviews, and product details are all examples of metadata.

Metadata

Metadata Management Data Governance Machine Learning

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

For more details, refer to Iceberg Release 1.6.1. An Iceberg table’s metadata stores a history of snapshots, which are updated with each transaction. Over time, this creates multiple data files and metadata files as changes accumulate. You can set a time limit for how long to keep a reference when you create it.

Snapshot

Snapshot Metadata Data Lake Optimization

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

AWS Big Data

APRIL 9, 2025

Whether youre a data analyst seeking a specific metric or a data steward validating metadata compliance, this update delivers a more precise, governed, and intuitive search experience. Refer to the product documentation to learn more about how to set up metadata rules for subscription and publishing workflows.

Metadata

Metadata Metrics Cost-Benefit Data-driven

What Your Phone Number’s Metadata Means for Data Privacy

Smart Data Collective

AUGUST 11, 2023

One of the issues that we need to be aware of is the role of phone metadata. Our Phone Metadata Can Be a Threat to Our Data Privacy Data privacy protections against government surveillance often focus on communications content and exclude communications metadata. This can lead to some serious data privacy concerns.

Metadata

Metadata IT Management

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

O'Reilly on Data

JUNE 14, 2024

At the same time, Miso went about an in-depth chunking and metadata-mapping of every book in the O’Reilly catalog to generate enriched vector snippet embeddings of each work. If the original Answers release was a LLM-driven retrieval engine, today’s new version of Answers is an LLM-driven research engine (in the truest sense).

Metadata

Metadata Publishing Data-driven Modeling

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. Launch summary Following is the launch summary which provides the announcement links and reference blogs for the key announcements.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

AWS Big Data

NOVEMBER 19, 2024

Solution overview By combining the powerful vector search capabilities of OpenSearch Service with the access control features provided by Amazon Cognito , this solution enables organizations to manage access controls based on custom user attributes and document metadata. Refer to Service Quotas for more details.

Management

Management Metadata Manufacturing Testing

Security Reference Architecture Summary for Cloudera Data Platform

Cloudera

JANUARY 21, 2022

System metadata is reviewed and updated regularly. We’ve summarized the key security features of a CDP Private Cloud Base cluster and subsequent posts will go into more detail with reference implementation examples of all of the key features. . Auditing procedures keep track of who accesses the cluster (and how).

Data Processing

Data Processing Management Cost-Benefit Finance

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Cloudera

JULY 15, 2021

All three will be quorums of Zookeepers and HDFS Journal nodes to track changes to HDFS Metadata stored on the Namenodes. In summary we have provided a reference for the tuning and configuration of the host resources in order to maximise the performance and security of your cluster. Networking .

Data Processing

Data Processing Metadata Testing Management

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

AWS Big Data

NOVEMBER 29, 2023

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift. This post discusses restricting listing of data catalog metadata as per the granted permissions.

Metadata

Metadata Data Warehouse Analytics Data Analytics

What Is a Metadata Catalog? (And How it Can Dramatically Improve Your Data Accuracy)

Octopai

JANUARY 31, 2022

If you’re a mystery lover, I’m sure you’ve read that classic tale: Sherlock Holmes and the Case of the Deceptive Data, and you know how a metadata catalog was a key plot element. Let me tell you about metadata and cataloging.”. A metadata catalog, Holmes informed Guy, addresses all the benign reasons for inaccurate data.

Metadata

Metadata IT Unstructured Data IoT

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Pricing and availability Amazon MWAA pricing dimensions remains unchanged, and you only pay for what you use: The environment class Metadata database storage consumed Metadata database storage pricing remains the same. Refer to Amazon Managed Workflows for Apache Airflow Pricing for rates and more details.

Metadata

Metadata Cost-Benefit Metrics Optimization

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

reduces the Amazon DynamoDB cost associated with KCL by optimizing read operations on the DynamoDB table storing metadata. KCL uses DynamoDB to store metadata such as shard-worker mapping and checkpoints. x benefits, refer to Use features of the AWS SDK for Java 2.x. Refer to Step 4 of Migrating from KCL 2.x x to KCL 3.x

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

Amazon SageMaker Lakehouse now supports attribute-based access control

AWS Big Data

APRIL 24, 2025

Ava defines the user attributes as static IAM tags that could also include attributes stored in the identity provider (IdP) or as session tags dynamically to represent the user metadata. For more details, refer to Tags for AWS Identity and Access Management resources and Pass session tags in AWS STS.

Sales

Sales Data Lake Management Data-driven

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

AWS Big Data

FEBRUARY 6, 2023

For instructions, refer to How to Set Up a MongoDB Cluster. Choose the table to view the schema and other metadata. Conclusion In this post, we showed how to set up an AWS Glue crawler to crawl over a MongoDB Atlas collection, gathering metadata and creating table records in the AWS Glue Data Catalog.

Metadata

Metadata Data Lake Machine Learning Big Data

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

To learn more about working with events using EventBridge, refer to Events via Amazon EventBridge default bus. After you create the asset, you can add glossaries or metadata forms, but its not necessary for this post. We refer to this role as the instance-role throughout the post. Enter a name for the asset.

Publishing

Publishing Unstructured Data Metadata Data-driven

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Unfiltered Table Metadata This tab displays the response of the AWS Glue API GetUnfilteredTableMetadata policies for the selected table. Get table data and metadata for this user to see how Lake Formation permissions are enforced and so the two users can see different data (on the Authorized Data tab).

Data Processing

Data Processing Metadata Publishing Testing

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge provides a configurable framework to seamlessly convert legacy metadata and code into more modern services such as Amazon Redshift. For more details, refer to the BladeBridge Analyzer Demo. Refer to this BladeBridge documentation to get more details on SQL and expression conversion.

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

The Power of Graph Databases, Linked Data, and Graph Algorithms

Rocket-Powered Data Science

MARCH 10, 2020

The book is awesome, an absolute must-have reference volume, and it is free (for now, downloadable from Neo4j ). Any node and its relationship to a particular node becomes a type of contextual metadata for that particular note. Graph Algorithms book. Context may include time, location, related events, nearby entities, and more.

Metadata

Metadata Machine Learning Prescriptive Analytics ROI

Metadata Management and Data Governance with Cloudera SDX

Cloudera

JANUARY 26, 2024

This will allow a data office to implement access policies over metadata management assets like tags or classifications, business glossaries, and data catalog entities, laying the foundation for comprehensive data access control. First, a set of initial metadata objects are created by the data steward.

Metadata

Metadata Data Governance Management Finance

Do I Need a Data Catalog?

erwin

JUNE 26, 2020

Organizations with particularly deep data stores might need a data catalog with advanced capabilities, such as automated metadata harvesting to speed up the data preparation process. Three Types of Metadata in a Data Catalog. The metadata provides information about the asset that makes it easier to locate, understand and evaluate.

Metadata

Metadata Cost-Benefit Measurement Data-driven

Data Intelligence and Its Role in Combating Covid-19

erwin

MARCH 30, 2020

Unraveling Data Complexities with Metadata Management. Metadata management will be critical to the process for cataloging data via automated scans. Essentially, metadata management is the administration of data that describes other data, with an emphasis on associations and lineage. Data lineage to support impact analysis.

Metadata

Metadata IT Data Governance Data Quality

Introducing support for Apache Kafka on Raft mode (KRaft) with Amazon MSK clusters

AWS Big Data

MAY 29, 2024

Since its inception, Apache Kafka has depended on Apache Zookeeper for storing and replicating the metadata of Kafka brokers and topics. the Kafka community has adopted KRaft (Apache Kafka on Raft), a consensus protocol, to replace Kafka’s dependency on ZooKeeper for metadata management. For Metadata mode , select KRaft.

Metadata

Metadata Cost-Benefit Management Big Data

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. For the template and setup information, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator. We use two datasets in this post.

Management

Management Metadata Analytics Dashboards

Alation and Salesforce partner on data governance for Data Cloud

CIO Business Intelligence

SEPTEMBER 19, 2024

This enables companies to directly access key metadata (tags, governance policies, and data quality indicators) from over 100 data sources in Data Cloud, it said. Additional to that, we are also allowing the metadata inside of Alation to be read into these agents.”

Data Governance

Data Governance Metadata Unstructured Data Structured Data

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

AWS Big Data

SEPTEMBER 12, 2024

Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance. The Glue Data Catalog monitors tables daily, removes snapshots from table metadata, and removes the data files and orphan files that are no longer needed.

Optimization

Optimization Snapshot Metadata Metrics

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Proposals for model vulnerability and security

O'Reilly on Data

MARCH 20, 2019

Data poisoning refers to someone systematically changing your training data to manipulate your model’s predictions. Watermarking is a term borrowed from the deep learning security literature that often refers to putting special pixels into an image to trigger a desired outcome from your model. Data poisoning attacks. Watermark attacks.

Modeling

Modeling Machine Learning Predictive Modeling Consulting

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

AWS Big Data

NOVEMBER 17, 2023

These include internet-scale web and mobile applications, low-latency metadata stores, high-traffic retail websites, Internet of Things (IoT) and time series data, online gaming, and more. Table metadata, such as column names and data types, is stored using the AWS Glue Data Catalog. To create an S3 bucket, refer to Creating a bucket.

Visualization

Visualization Metadata Testing Internet of Things

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

AWS Big Data

JUNE 2, 2023

We use AWS Glue , a fully managed, serverless, ETL (extract, transform, and load) service, and the Google BigQuery Connector for AWS Glue (for more information, refer to Migrating data from Google BigQuery to Amazon S3 using AWS Glue custom connectors ). If you don’t have one, refer to Amazon Redshift Serverless. An S3 bucket.

Metadata

Metadata Data Warehouse Big Data Analytics

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

AWS Big Data

MAY 2, 2025

We will use generative multimodal AI to modernize image search, eliminating the need for labor to maintain image tags and other metadata. Without any image metadata, OpenSearch finds images of sunset-colored dresses, and responds with accurate and colorful commentary. Were looking for a low-maintenance image search solution.

Machine Learning

Machine Learning Visualization Dashboards Metadata

Disaster recovery strategies for Amazon MWAA – Part 2

AWS Big Data

JUNE 17, 2024

Backup and restore architecture The backup and restore strategy involves periodically backing up Amazon MWAA metadata to Amazon Simple Storage Service (Amazon S3) buckets in the primary Region. Refer to the detailed deployment steps in the README file to deploy it in your own accounts. The steps are as follows: [1.a]

Strategy

Strategy Metadata Recreation/Entertainment Metrics

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

Metadata analysis makes it possible to build data catalogs, which in turn allow humans to discover data that’s relevant to their projects. Both refer to the source of the data: where does the data come from, how was it gathered, and how was it modified along the way? Companies are investing in these foundational technologies.

Machine Learning

Machine Learning Software Metadata Testing

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview

AWS Big Data

JULY 8, 2024

It also offers reference implementation of an object model to persist metadata along with integration to major data and analytics tools. Lineage form types – Form types, or facets , provide additional metadata or context about lineage entities or events, enabling richer and more descriptive lineage information. Choose Run.

Visualization

Visualization Metadata Publishing Sales

Configure SAML federation with Amazon OpenSearch Serverless and Keycloak

AWS Big Data

JULY 24, 2024

aoss:UpdateSecurityConfig – Modify a given SAML provider configuration, including the XML metadata. For more details, refer to Considerations. IdP Metadata by choosing the link on the Realm settings page. You need this metadata, when you create the SAML provider in OpenSearch Serverless. Choose Create SAML provider.

Dashboards

Dashboards Metadata Visualization Consulting

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Webinars

Trending Sources

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Metadata is Like Packaging: Seeing Beyond the Library Card Metaphor

Write queries faster with Amazon Q generative SQL for Amazon Redshift

RDF-Star: Metadata Complexity Simplified

Build a high-performance quant research platform with Apache Iceberg

Best Practices for Metadata Management

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

What Your Phone Number’s Metadata Means for Data Privacy

Enhance data governance with enforced metadata rules in Amazon DataZone

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

Recap of Amazon Redshift key product announcements in 2024

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

Security Reference Architecture Summary for Cloudera Data Platform

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

What Is a Metadata Catalog? (And How it Can Dramatically Improve Your Data Accuracy)

Introducing Amazon MWAA micro environments for Apache Airflow

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Amazon SageMaker Lakehouse now supports attribute-based access control

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Integrate custom applications with AWS Lake Formation – Part 2

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

The Power of Graph Databases, Linked Data, and Graph Algorithms

Metadata Management and Data Governance with Cloudera SDX

Do I Need a Data Catalog?

Data Intelligence and Its Role in Combating Covid-19

Introducing support for Apache Kafka on Raft mode (KRaft) with Amazon MSK clusters

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Alation and Salesforce partner on data governance for Data Cloud

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Proposals for model vulnerability and security

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

Disaster recovery strategies for Amazon MWAA – Part 2

Deep automation in machine learning

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview

Configure SAML federation with Amazon OpenSearch Serverless and Keycloak

Stay Connected