Blog, Management and Metadata - Data Leaders Brief

Blog

Management

Metadata

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

Additionally, multiple copies of the same data locked in proprietary systems contribute to version control issues, redundancies, staleness, and management headaches. It leverages knowledge graphs to keep track of all the data sources and data flows, using AI to fill the gaps so you have the most comprehensive metadata management solution.

Metadata

Metadata Management Data Governance Data-driven

Unifying metadata governance across Amazon SageMaker and Collibra

AWS Big Data

JULY 16, 2025

Managing metadata across tools and teams is a growing challenge for organizations building modern data and AI platforms. Teams use Collibra to curate business context, classify sensitive data, and manage access to information in line with compliance requirements. This post was co-written with Vasiliki Nikolopoulou from Collibra.

Metadata

Metadata Publishing Management Modeling

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Trending Sources

MLFlow Mastery: A Complete Guide to Experiment Tracking and Model Management

KDnuggets

JUNE 23, 2025

It helps you track, manage, and deploy models. It helps you track, manage, and deploy models. It manages the entire machine learning lifecycle. MLflow also manages models after deployment. Managing ML projects without MLFlow is challenging. Reproducibility : MLFlow standardizes how experiments are managed.

Modeling

Modeling Management Machine Learning Data Science

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Building a Custom PDF Parser with PyPDF and LangChain

KDnuggets

JUNE 12, 2025

Install them with: pip install pypdf langchain If you want to manage dependencies neatly, create a requirements.txt file with: pypdf langchain requests And run: pip install -r requirements.txt Step 1: Set Up the PDF Parser(parser.py) The core class CustomPDFParser uses PyPDF to extract text and metadata from each PDF page.

Metadata

Metadata Data Science Machine Learning Advertising

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

The landscape of big data management has been transformed by the rising popularity of open table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake. Both Delta Lake and Iceberg metadata files reference the same data files.

Metadata

Metadata Data Warehouse Big Data Data Lake

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

It is appealing to migrate from self-managed OpenSearch and Elasticsearch clusters in legacy versions to Amazon OpenSearch Service to enjoy the ease of use, native integration with AWS services, and rich features from the open-source environment ( OpenSearch is now part of Linux Foundation ).

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

How Volkswagen Autoeuropa built a data solution with a robust governance framework, simplifying access to quality data using Amazon DataZone

AWS Big Data

NOVEMBER 13, 2024

The second use case enables the creation of reports containing shop floor key metrics for different management levels. In addition, the team aligned on business metadata attributes that would help with data discovery. The data solution uses Amazon DataZone glossaries and metadata forms to provide business context to their data.

Metadata

Metadata Data Quality Digital Transformation Data-driven

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight. This led to inefficiencies in data governance and access control.

Data Lake

Data Lake Sales Metadata Machine Learning

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. You can also create new data lake tables using Redshift Managed Storage (RMS) as a native storage option.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Critical Data Elements: Your Shortcut to Data Governance That Actually Works

DataKitchen

JULY 31, 2025

Step 2: Certify & Test: Data Quality Testing CDEs Apply strict data quality controls to your CDEs: Implement rigorous data validation rules, ensure consistent naming conventions, track comprehensive metadata, and assign clear ownership. ” Manage CDEs like products across their lifecycle: Define → Deliver → Monitor → Improve.

Data Governance

Data Governance Data Quality Dashboards Metadata

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Amazon DataZone , a data management service, helps you catalog, discover, share, and govern data stored across AWS, on-premises systems, and third-party sources. This Lambda function contains the logic to manage access policies for the subscribed unmanaged asset, automating the subscription process for unstructured S3 assets.

Publishing

Publishing Unstructured Data Metadata Data-driven

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. As stated earlier, the first step involves data ingestion.

Sales

Sales Data-driven Data Processing Key Performance Indicator

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

This isn’t just about making data management effortless—it’s about using AI to make your data work harder for you, unlocking insights that might otherwise remain hidden, and enabling everyone in your organization to work with data confidently, regardless of their technical expertise. Having confidence in your data is key.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena

AWS Big Data

JULY 25, 2025

However, managing schema evolution at scale presents significant challenges. To address this challenge, this post demonstrates how to build such a solution by combining Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue Data Catalog for schema management, and Amazon Athena for one-time querying.

IoT

IoT Analytics Metadata Measurement

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. The adoption of open table formats is a crucial consideration for organizations looking to optimize their data management practices and extract maximum value from their data.

Snapshot

Snapshot Metadata Data Lake Optimization

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Cloudera

OCTOBER 10, 2024

It combines the flexibility and scalability of data lake storage with the data analytics, data governance, and data management functionality of the data warehouse. They reduce data management effort and overhead by automating some of the most tedious lakehouse maintenance tasks. Go sign up for our 5-day trial here to see for yourself.

Optimization

Optimization Snapshot Data Lake Cost-Benefit

AI-Powered Feature Engineering with n8n: Scaling Data Science Intelligence

KDnuggets

AUGUST 8, 2025

Combine data processing, AI analysis, and professional reporting without jumping between tools or managing complex infrastructure. Integration with Feature Stores Connect the workflow output to feature stores like Feast or Tecton for automated feature pipeline creation and management. // 2.

Data Science

Data Science Statistics Machine Learning Advertising

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. We take care of the ETL for you by automating the creation and management of data replication. Zero-ETL provides service-managed replication. Glue ETL offers customer-managed data ingestion. What is zero-ETL?

Data Integration

Data Integration Data Lake Statistics Data-driven

My Take on the 2024 Gartner® Critical Capabilities for Data Integration Tools Report

Data Virtualization

AUGUST 5, 2025

The post My Take on the 2024 Gartner® Critical Capabilities for Data Integration Tools Report appeared first on Data Management Blog - Data Integration and Modern Data Management Articles, Analysis and Information.

Data Integration

Data Integration Reporting Metadata Data Architecture

Generative AI: A Self-Study Roadmap

KDnuggets

JULY 11, 2025

This API-first approach offers several advantages: you get access to cutting-edge capabilities without managing infrastructure, you can experiment with different models quickly, and you can focus on application logic rather than model implementation. Understanding Model Capabilities : Each foundation model excels in different areas.

Machine Learning

Machine Learning Testing Data Science Cost-Benefit

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

Third, some services require you to set up and manage compute resources used for federated connectivity, and capabilities like connection testing and data preview arent available in all services. This approach simplifies your data journey and helps you meet your security requirements. For Database , enter your database name. Choose Add data.

Visualization

Visualization Data Processing Testing Publishing

Key Takeaways from AWS re:Invent 2024

Cloudera

DECEMBER 19, 2024

Inevitably, the majority of companies will find themselves managing distributed systems, often in multiple clouds and on-premises. Clouderas investment in and support for open metadata standards, our true hybrid architecture, and our native Spark offering for Iceberg combine to make us the ideal Iceberg data lakehouse.

Metadata

Metadata Data Processing Machine Learning Cost-Benefit

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Today, they play a critical role in syncing with customer applications, enabling the ability to manage concurrent data operations while maintaining the integrity and consistency of information. By using features like Icebergs compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale.

Data Lake

Data Lake IoT Metadata Testing

Data Insights Assure Quality Data and Confident Decisions!

Smarten

NOVEMBER 26, 2024

Today, organizations look to data and to technology to help them understand historical results, and predict the future needs of the enterprise to manage everything from suppliers and supplies to new locations, new products and services, hiring, training and investments. But too much data can also create issues.

Machine Learning

Machine Learning Data Quality Predictive Modeling Metadata

Why data observability is essential to AI governance

erwin

DECEMBER 9, 2024

Will the new creative, diverse and scalable data pipelines you are building also incorporate the AI governance guardrails needed to manage and limit your organizational risk? Metadata is the basis of trust for data forensics as we answer the questions of fact or fiction when it comes to the data we see.

Metadata

Metadata Data Quality Sales Modeling

How To Use Airbyte, dbt-teradata, Dagster, and Teradata Vantage™ for Seamless Data Integration

Teradata

MAY 30, 2025

schema.yml`: YAML file defining metadata, tests, and descriptions for the models in this directory. schema.yml`: YAML file defining metadata, tests, and descriptions for the staging models. With Dagster, you can easily manage diverse data operations across your ecosystem. toml │ setup. py │ └───orchestration assets. py __init__.

Data Integration

Data Integration Data Processing Metadata Testing

Data Quality Testing: A Shared Resource for Modern Data Teams

DataKitchen

JUNE 6, 2025

We have ingestion engineers, analytic engineers, stewards, governors, modelers, owners, scientists, product managers, compliance officers, and executives. Data Governance Teams: Data Governance professionals employ quality testing as a means to enhance data catalogs with high-quality metadata. But it also introduces a problem.

Data Quality

Data Quality Testing Dashboards Metrics

The R in RAG

Data Virtualization

JULY 30, 2025

The post The R in RAG appeared first on Data Management Blog - Data Integration and Modern Data Management Articles, Analysis and Information. Many know that it stands for retrieval augmented generation, but recently I’ve encountered some confusion around the “R” (retrieval) aspect of RAG. I think that much of that confusion.

Data Integration

Data Integration Management IT ROI

How Far We Can Go with GenAI as an Information Extraction Tool

Ontotext

JANUARY 10, 2025

This blog post summarizes our findings, focusing on NER as a first-step key task for knowledge extraction. You can use the Ontotext Metadata Studio (OMDS) to integrate any NER model and apply it to your documents to extract the entities you are interested in.

Informatics

Informatics Metadata Modeling Experimentation

Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMaker

AWS Big Data

JULY 15, 2025

SageMaker is natively integrated with Apache Airflow and Amazon Managed Workflows for Apache Airflow (Amazon MWAA), and is used to automate the workflow orchestration for jobs, querybooks, and notebooks with a Python-based DAG definition. To use the sample data provided in this blog post, your domain should be in us-east-1 region.

Data Processing

Data Processing Visualization Metadata Software

Introducing Jobs in Amazon SageMaker

AWS Big Data

JULY 15, 2025

Together, you can use these capabilities to author, manage, operate, and monitor data processing workloads across your organization. Select the Amazon S3 source node and enter the following values: S3 URI: s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Apparel/ Format: Parquet Select Update node.

Visualization

Visualization Data Processing Metrics Big Data

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

AWS Big Data

FEBRUARY 13, 2025

Amazon Managed Streaming for Apache Kafka (Amazon MSK) now offers a new broker type called Express brokers. Express brokers provide straightforward operations with hands-free storage management by offering unlimited storage without pre-provisioning, eliminating disk-related bottlenecks.

Metrics

Metrics Metadata Strategy Management

Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques

AWS Big Data

JANUARY 9, 2025

This feature will be discussed in detail later in this blog. The raw metadata is assumed to be not more than 100Gb. Vamshi Vijay Nakkirtha is a software engineering manager working on the OpenSearch Project and Amazon OpenSearch Service. For detailed implementation steps, see to the OpenSearch documentation.

Optimization

Optimization Metrics Modeling Key Performance Indicator

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Ali Tore, Senior Vice President of Advanced Analytics at Salesforce, highlighting the value of this integration, says “We’re excited to partner with Amazon to bring Tableau’s powerful data exploration and AI-driven analytics capabilities to customers managing data across organizational boundaries with Amazon DataZone.

Visualization

Visualization Data Lake Testing Data Governance

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Better Metadata Management Add Descriptions and Data Product tags to tables and columns in the Data Catalog for improved governance. Enhanced Column Profiling Displays Get clearer insights with redesigned views in the Data Catalog, Profiling Results, Hygiene Issues, and Test Results pages. DataOps just got more intelligent.

Data Quality

Data Quality Scorecard Testing Dashboards

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Given the importance of data in the world today, organizations face the dual challenges of managing large-scale, continuously incoming data while vetting its quality and reliability. AWS Glue is a serverless data integration service that you can use to effectively monitor and manage data quality through AWS Glue Data Quality.

Data Quality

Data Quality Publishing Snapshot Data Lake

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

REST Catalog Value Proposition It provides open, metastore-agnostic APIs for Iceberg metadata operations, dramatically simplifying the Iceberg client and metastore/engine integration. It provides real time metadata access by directly integrating with the Iceberg-compatible metastore. You will see the 2 carrier records in the table.

Metadata

Metadata Data Warehouse ROI Snapshot

Introducing erwin Data Modeler 15.0: Bridging the Gap Between Data Modeling & Data Engineering

erwin

JULY 9, 2025

With native Jira integration, teams can now create and manage workflows directly within their existing project management environment. Microsoft Entra ID SSO support : Simplify authentication and enhance security through centralized identity management. Modern data stack integration DBT integration : This is a game-changer.

Modeling

Modeling Metadata Visualization Data Architecture

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI. It is a critical feature for delivering unified access to data in distributed, multi-engine architectures.

Metadata

Metadata Data Lake Dashboards Interactive

Model first, move smart: Why data modeling is the key to successful migrations

erwin

JULY 14, 2025

This is where business glossaries and metadata come in. Metadata management tools and business glossary capabilities can help align these definitions early, before the move. The post Model first, move smart: Why data modeling is the key to successful migrations appeared first on erwin Expert Blog. They happen by design.

Modeling

Modeling Risk Metadata Finance

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

The problem isn’t just the volume of the data, but also how difficult it is to manage and make sense of it. All of this data is essential for investigations and threat hunting, but existing systems often struggle to manage it efficiently. In many traditional systems, query planning can take as long as executing the query itself.

Analytics

Analytics Metadata Snapshot Data-driven

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

AWS Big Data

DECEMBER 11, 2024

To address these challenges, organizations often build bespoke integrations between services, tools, and their own access management systems. Build with projects : ML and generative AI model Build, train, and deploy ML and foundation models with fully managed infrastructure, tools, and workflows. option("multiLine", "true").option("header",

Data Lake

Data Lake Data Warehouse Data-driven Big Data

Amazon OpenSearch Service 101: Create your first search application with OpenSearch

AWS Big Data

JUNE 25, 2025

Organizations today face the challenge of managing and deriving insights from an ever-expanding universe of data in real time. The cost of commercial observability solutions becomes prohibitive, forcing teams to manage multiple separate tools and increasing both operational overhead and troubleshooting complexity.

Dashboards

Dashboards IoT Interactive Visualization

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Unifying metadata governance across Amazon SageMaker and Collibra

Webinars

Trending Sources

MLFlow Mastery: A Complete Guide to Experiment Tracking and Model Management

Webinars

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Building a Custom PDF Parser with PyPDF and LangChain

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

How Volkswagen Autoeuropa built a data solution with a robust governance framework, simplifying access to quality data using Amazon DataZone

How BMW streamlined data access using AWS Lake Formation fine-grained access control

Recap of Amazon Redshift key product announcements in 2024

Critical Data Elements: Your Shortcut to Data Governance That Actually Works

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

AI-Powered Feature Engineering with n8n: Scaling Data Science Intelligence

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

My Take on the 2024 Gartner® Critical Capabilities for Data Integration Tools Report

Generative AI: A Self-Study Roadmap

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Key Takeaways from AWS re:Invent 2024

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Data Insights Assure Quality Data and Confident Decisions!

Why data observability is essential to AI governance

How To Use Airbyte, dbt-teradata, Dagster, and Teradata Vantage™ for Seamless Data Integration

Data Quality Testing: A Shared Resource for Modern Data Teams

The R in RAG

How Far We Can Go with GenAI as an Information Extraction Tool

Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMaker

Introducing Jobs in Amazon SageMaker

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Announcing Open Source DataOps Data Quality TestGen 3.0

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Introducing erwin Data Modeler 15.0: Bridging the Gap Between Data Modeling & Data Engineering

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Model first, move smart: Why data modeling is the key to successful migrations

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Amazon OpenSearch Service 101: Create your first search application with OpenSearch

Stay Connected