Metadata and Modeling - Data Leaders Brief

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Generative AI models can translate natural language questions into valid SQL queries, a capability known as text-to-SQL generation.

Metadata

Metadata Data Lake Modeling Data Warehouse

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

This acquisition delivers access to trusted data so organizations can build reliable AI models and applications by combining data from anywhere in their environment. It leverages knowledge graphs to keep track of all the data sources and data flows, using AI to fill the gaps so you have the most comprehensive metadata management solution.

Metadata

Metadata Management Data Governance Data-driven

Underlying Engineering Behind Alexa’s Contextual ASR

Analytics Vidhya

SEPTEMBER 17, 2022

Introduction Conventionally, an automatic speech recognition (ASR) system leverages a single statistical language model to rectify ambiguities, regardless of context. Any type of contextual information, like device context, conversational context, and metadata, […].

Metadata

Metadata Statistics Data Science Publishing

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

What are model governance and model operations?

O'Reilly on Data

JUNE 19, 2019

A look at the landscape of tools for building and deploying robust, production-ready machine learning models. We are also beginning to see researchers share sample code written in popular open source libraries, and some even share pre-trained models. Model development. Model governance. Source: Ben Lorica.

Modeling

Modeling Machine Learning Testing Metrics

How to Operationalize Data From Multiple Sources to Deliver Actionable Insights

Speaker: Speakers from SafeGraph, Facteus, AWS Data Exchange, SimilarWeb, and AtScale

Join this webinar to learn how to blend Geospatial data (from SafeGraph), Financial Market and Transaction Data (from Facteus), & Global Websites Visit and Engagement KPIs (from SimilarWeb) to enrich, augment, and improve self-service analytics as well as predictive models.

Metadata

Neptune.ai?—?A Metadata Store for MLOps

Analytics Vidhya

JANUARY 27, 2022

A centralized location for research and production teams to govern models and experiments by storing metadata throughout the ML model lifecycle. Introduction When working on a machine learning project, it’s one thing to receive impressive results from a single model-training run. Keeping track of […].

Metadata

Metadata Machine Learning Data Science Publishing

Knowledge Graphs are Critical to Data Intelligence and AI

David Menninger's Analyst Perspectives

MAY 22, 2025

These catalogs combine technical and business metadata and data governance capabilities with knowledge graph functionality to deliver a holistic, business-level view of data production and consumption. I recently described how business data catalogs are evolving into data intelligence catalogs.

Metadata

Metadata Enterprise Data-driven Publishing

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

We will explore Icebergs concurrency model, examine common conflict scenarios, and provide practical implementation patterns of both automatic retry mechanisms and situations requiring custom conflict resolution logic for building resilient data pipelines. Generate new metadata files. Commit the metadata files to the catalog.

Snapshot

Snapshot Management Metadata Big Data

What is SCOR? A model to improve supply chain management

CIO Business Intelligence

MAY 20, 2025

Thats where the SCOR model comes in. What is the SCOR model? The SCOR model is designed to evaluate your supply chain for effectiveness and efficiency of sales and operational planning (S&OP). What is the main focus of the SCOR model? model to further address the growing need for digitization of supply chains.

Modeling

Modeling Management Metrics Measurement

Proposals for model vulnerability and security

O'Reilly on Data

MARCH 20, 2019

Apply fair and private models, white-hat and forensic model debugging, and common sense to protect machine learning models from malicious actors. Like many others, I’ve known for some time that machine learning models themselves could pose security risks. This is like a denial-of-service (DOS) attack on your model itself.

Modeling

Modeling Machine Learning Predictive Modeling Consulting

Specialized tools for machine learning development and model governance are becoming essential

O'Reilly on Data

APRIL 2, 2019

Model packaging: companies are using MLflow to incorporate custom logic and dependencies as part of a model’s package abstraction before deploying it to their production environment (example: a recommendation system might be programmed to not display certain images to minors). Model governance. there aren’t enough of them.

Machine Learning

Machine Learning Modeling Data Science Software

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

The Symbiotic Relationship Between Data Governance and AI

David Menninger's Analyst Perspectives

MAY 14, 2025

GenAI models also learn from historical data, which may contain biases. Enterprises need robust mechanisms to detect and rectify bias during model training and deployment. Our research illustrates a gap between awareness of the need for governance in AI initiatives and policies to govern AI and machine learning models.

Data Governance

Data Governance Data Quality Data-driven Metadata

SAP Datasphere Powers Business at the Speed of Data

Rocket-Powered Data Science

MARCH 20, 2023

Datasphere goes beyond the “big three” data usage end-user requirements (ease of discovery, access, and delivery) to include data orchestration (data ops and data transformations) and business data contextualization (semantics, metadata, catalog services). As you would guess, maintaining context relies on metadata.

Data Warehouse

Data Warehouse Metadata Digital Transformation Machine Learning

Enterprises can gain an edge with Metadata Management

CIO Business Intelligence

SEPTEMBER 6, 2024

Central to this is metadata management, a critical component for driving future success AI and ML need large amounts of accurate data for companies to get the most out of the technology. Let’s dive into what that looks like, what workarounds some IT teams use today, and why metadata management is the key to success.

Metadata

Metadata Enterprise Management Cost-Benefit

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In this blog post, we’ll discuss how the metadata layer of Apache Iceberg can be used to make data lakes more efficient. You will learn about an open-source solution that can collect important metrics from the Iceberg metadata layer. This ensures that each change is tracked and reversible, enhancing data governance and auditability.

Metadata

Metadata Snapshot Data Lake Metrics

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Central to a transactional data lake are open table formats (OTFs) such as Apache Hudi , Apache Iceberg , and Delta Lake , which act as a metadata layer over columnar formats. XTable isn’t a new table format but provides abstractions and tools to translate the metadata associated with existing formats.

Metadata

Metadata Data Lake Snapshot Data Warehouse

The state of data quality in 2020

O'Reilly on Data

FEBRUARY 11, 2020

These include the basics, such as metadata creation and management, data provenance, data lineage, and other essentials. They’re still struggling with the basics: tagging and labeling data, creating (and managing) metadata, managing unstructured data, etc. They don’t have the resources they need to clean up data quality problems.

Data Quality

Data Quality Metadata Data Governance Publishing

Bridging the gap between mainframe data and hybrid cloud environments

CIO Business Intelligence

FEBRUARY 27, 2025

According to a study from Rocket Software and Foundry , 76% of IT decision-makers say challenges around accessing mainframe data and contextual metadata are a barrier to mainframe data usage, while 64% view integrating mainframe data with cloud data sources as the primary challenge.

Metadata

Metadata Data Lake Cost-Benefit Forecasting

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

EUROGATEs data science team aims to create machine learning models that integrate key data sources from various AWS accounts, allowing for training and deployment across different container terminals. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

O'Reilly on Data

JUNE 14, 2024

Generative AI models are trained on large repositories of information and media. They are then able to take in prompts and produce outputs based on the statistical weights of the pretrained models of those corpora. The newest Answers release is again built with an open source model—in this case, Llama 3.

Metadata

Metadata Publishing Data-driven Modeling

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

This model balances node or domain-level autonomy with enterprise-level oversight, creating a scalable and consistent framework across ANZ. The Institutional Data & AI platform adopts a federated approach to data while centralizing the metadata to facilitate simpler discovery and sharing of data products.

Metadata

Metadata Data Governance Data Quality Data-driven

Comprehensive data management for AI: The next-gen data management engine that will drive AI to new heights

CIO Business Intelligence

NOVEMBER 19, 2024

You pull an open-source large language model (LLM) to train on your corporate data so that the marketing team can build better assets, and the customer service team can provide customer-facing chatbots. You build your model, but the history and context of the data you used are lost, so there is no way to trace your model back to the source.

Management

Management Unstructured Data Deep Learning Metadata

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight.

Data Lake

Data Lake Sales Metadata Machine Learning

When is data too clean to be useful for enterprise AI?

CIO Business Intelligence

NOVEMBER 27, 2024

Data quality for AI needs to cover bias detection, infringement prevention, skew detection in data for model features, and noise detection. Not all columns are equal, so you need to prioritize cleaning data features that matter to your model, and your business outcomes. asks Friedman.

Enterprise

Enterprise Data Quality Structured Data Modeling

Automating ethics

O'Reilly on Data

MARCH 22, 2019

When the model for spam detection is systematically wrong, users can correct it. For example, we wouldn’t want real estate agents “correcting” a model to recommend houses based on race or religion; and we could even discuss whether similar behavior would be appropriate for spam detection. It's possible to abuse or to game any solution.

Metadata

Metadata Advertising Insurance Modeling

PyCaret 2.2: Efficient Pipelines for Model Development

Domino Data Lab

JANUARY 11, 2021

Even for experienced developers and data scientists, the process of developing a model could involve stringing together many steps from many packages, in ways that might not be as elegant or efficient as one might like. the experience is still rooted in the same goal: simple efficiency for the whole model development lifecycle.

Modeling

Modeling Metrics Data Science Testing

The Power of Graph Databases, Linked Data, and Graph Algorithms

Rocket-Powered Data Science

MARCH 10, 2020

And this: perhaps the most powerful node in a graph model for real-world use cases might be “context”. How does one express “context” in a data model? After all, the standard relational model of databases instantiated these types of relationships in its very foundation decades ago: the ERD (Entity-Relationship Diagram).

Metadata

Metadata Machine Learning Prescriptive Analytics ROI

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

We need to do more than automate model building with autoML; we need to automate tasks at every stage of the data pipeline. There is no GitHub for data, though we are starting to see version control projects for machine learning models, such as DVC. Automation is more than model building. Toward a sustainable ML practice.

Machine Learning

Machine Learning Software Metadata Testing

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

These strategies, such as investing in AI-powered cleansing tools and adopting federated governance models, not only address the current data quality challenges but also pave the way for improved decision-making, operational efficiency and customer satisfaction. Data fabric Metadata-rich integration layer across distributed systems.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

AWS Big Data

NOVEMBER 19, 2024

Solution overview By combining the powerful vector search capabilities of OpenSearch Service with the access control features provided by Amazon Cognito , this solution enables organizations to manage access controls based on custom user attributes and document metadata. If you don’t already have an AWS account, you can create one.

Management

Management Metadata Manufacturing Testing

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

AWS Big Data

MAY 20, 2025

Enable Amazon Bedrock large language model (LLM) access for Amazon Nova Pro. Then complete the following steps to enable model access in Amazon Bedrock: On the Amazon Bedrock console, in the navigation pane, choose Model access. Choose Enable specific models. The following diagram provides an overview of the solution.

Structured Data

Structured Data Data Warehouse Analytics Finance

Data Insights for Everyone — The Semantic Layer to the Rescue

Rocket-Powered Data Science

SEPTEMBER 20, 2021

They realized that the search results would probably not provide an answer to my question, but the results would simply list websites that included my words on the page or in the metadata tags: “Texas”, “Cows”, “How”, etc. The semantic layer bridges the gaps between the data cloud, the decision-makers, and the data science modelers.

Data Science

Data Science Forecasting Business Intelligence Sales

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. Lakehouse allows you to use preferred analytics engines and AI models of your choice with consistent governance across all your data.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Are You Content with Your Organization’s Content Strategy?

Rocket-Powered Data Science

JULY 6, 2021

This is accomplished through tags, annotations, and metadata (TAM). Smart content includes labeled (tagged, annotated) metadata (TAM). The key to success is to start enhancing and augmenting content management systems (CMS) with additional features: semantic content and context. Collect, curate, and catalog (i.e.,

Strategy

Strategy Machine Learning Metadata Knowledge Discovery

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

Instead of writing code with hard-coded algorithms and rules that always behave in a predictable manner, ML engineers collect a large number of examples of input and output pairs and use them as training data for their models. The model is produced by code, but it isn’t code; it’s an artifact of the code and the training data.

Management

Management Machine Learning Experimentation Metrics

Copyright, AI, and Provenance

O'Reilly on Data

DECEMBER 12, 2023

If the output of a model can’t be owned by a human, who (or what) is responsible if that output infringes existing copyright? In an article in The New Yorker , Jaron Lanier introduces the idea of data dignity, which implicitly distinguishes between training a model and generating output using a model.

Modeling

Modeling Sales Software Statistics

Have we reached the end of ‘too expensive’ for enterprise software?

CIO Business Intelligence

JANUARY 9, 2025

Generative artificial intelligence ( genAI ) and in particular large language models ( LLMs ) are changing the way companies develop and deliver software. The commodity effect of LLMs over specialized ML models One of the most notable transformations generative AI has brought to IT is the democratization of AI capabilities.

Software

Software Enterprise Key Performance Indicator Machine Learning

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Additionally, customers adopting a federated deployment model find it challenging to provide isolated environments for different teams or departments, and at the same time optimize cost. It’s essential to monitor key metrics such as metadata database memory usage, and CPU utilization of the worker/scheduler hybrid container.

Metadata

Metadata Cost-Benefit Metrics Optimization

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI. It is a critical feature for delivering unified access to data in distributed, multi-engine architectures.

Metadata

Metadata Data Lake Dashboards Interactive

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

As a producer, you can also monetize your data through the subscription model using AWS Data Exchange. To achieve this, they plan to use machine learning (ML) models to extract insights from data. Business analysts enhance the data with business metadata/glossaries and publish the same as data assets or data products.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Rocket-Powered Data Science

FEBRUARY 15, 2023

While generative AI has been around for several years , the arrival of ChatGPT (a conversational AI tool for all business occasions, built and trained from large language models) has been like a brilliant torch brought into a dark room, illuminating many previously unseen opportunities. So, if you have 1 trillion data points (g.,

Strategy

Strategy Experimentation Uncertainty Machine Learning

AI adoption in the enterprise 2020

O'Reilly on Data

MARCH 18, 2020

Whether it’s controlling for common risk factors—bias in model development, missing or poorly conditioned data, the tendency of models to degrade in production—or instantiating formal processes to promote data governance, adopters will have their work cut out for them as they work to establish reliable AI production lines.

Enterprise

Enterprise Deep Learning Data Governance Risk

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Webinars

Trending Sources

Underlying Engineering Behind Alexa’s Contextual ASR

Webinars

What are model governance and model operations?

How to Operationalize Data From Multiple Sources to Deliver Actionable Insights

Neptune.ai?—?A Metadata Store for MLOps

Knowledge Graphs are Critical to Data Intelligence and AI

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

What is SCOR? A model to improve supply chain management

Proposals for model vulnerability and security

Specialized tools for machine learning development and model governance are becoming essential

Build a high-performance quant research platform with Apache Iceberg

The Symbiotic Relationship Between Data Governance and AI

SAP Datasphere Powers Business at the Speed of Data

Enterprises can gain an edge with Metadata Management

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Run Apache XTable in AWS Lambda for background conversion of open table formats

The state of data quality in 2020

Bridging the gap between mainframe data and hybrid cloud environments

Write queries faster with Amazon Q generative SQL for Amazon Redshift

How EUROGATE established a data mesh architecture using Amazon DataZone

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Comprehensive data management for AI: The next-gen data management engine that will drive AI to new heights

How BMW streamlined data access using AWS Lake Formation fine-grained access control

When is data too clean to be useful for enterprise AI?

Automating ethics

PyCaret 2.2: Efficient Pipelines for Model Development

The Power of Graph Databases, Linked Data, and Graph Algorithms

Deep automation in machine learning

Data’s dark secret: Why poor quality cripples AI and growth

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

Data Insights for Everyone — The Semantic Layer to the Rescue

Recap of Amazon Redshift key product announcements in 2024

Are You Content with Your Organization’s Content Strategy?

What you need to know about product management for AI

Copyright, AI, and Provenance

Have we reached the end of ‘too expensive’ for enterprise software?

Introducing Amazon MWAA micro environments for Apache Airflow

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

AI adoption in the enterprise 2020

Stay Connected