Blog, Management, Metadata and Optimization

Blog

Management

Metadata

Optimization

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

Additionally, multiple copies of the same data locked in proprietary systems contribute to version control issues, redundancies, staleness, and management headaches. It leverages knowledge graphs to keep track of all the data sources and data flows, using AI to fill the gaps so you have the most comprehensive metadata management solution.

Metadata

Metadata Management Data Governance Data-driven

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Trending Sources

Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques

AWS Big Data

JANUARY 9, 2025

To mitigate this issue, various compression techniques can be used to optimize memory usage and computational efficiency. Amazon OpenSearch Service , as a vector database, supports scalar and product quantization techniques to optimize memory usage and reduce operational costs.

Optimization

Optimization Metrics Modeling Key Performance Indicator

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Cloudera

OCTOBER 10, 2024

It combines the flexibility and scalability of data lake storage with the data analytics, data governance, and data management functionality of the data warehouse. Let’s take a look at some of the features in Cloudera Lakehouse Optimizer, the benefits they provide, and the road ahead for this service.

Optimization

Optimization Snapshot Data Lake Cost-Benefit

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

First query response times for dashboard queries have significantly improved by optimizing code execution and reducing compilation overhead. We have enhanced autonomics algorithms to generate and implement smarter and quicker optimal data layout recommendations for distribution and sort keys, further optimizing performance.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

It is appealing to migrate from self-managed OpenSearch and Elasticsearch clusters in legacy versions to Amazon OpenSearch Service to enjoy the ease of use, native integration with AWS services, and rich features from the open-source environment ( OpenSearch is now part of Linux Foundation ).

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

Generative AI: A Self-Study Roadmap

KDnuggets

JULY 11, 2025

Traditional machine learning systems excel at classification, prediction, and optimization—they analyze existing data to make decisions about new inputs. Instead of optimizing for accuracy metrics, you evaluate creativity, coherence, and usefulness. This difference shapes everything about how you work with these systems.

Machine Learning

Machine Learning Testing Data Science Cost-Benefit

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. The adoption of open table formats is a crucial consideration for organizations looking to optimize their data management practices and extract maximum value from their data.

Snapshot

Snapshot Metadata Data Lake Optimization

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. As stated earlier, the first step involves data ingestion.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Today, they play a critical role in syncing with customer applications, enabling the ability to manage concurrent data operations while maintaining the integrity and consistency of information. By using features like Icebergs compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale.

Data Lake

Data Lake IoT Metadata Testing

AI-Powered Feature Engineering with n8n: Scaling Data Science Intelligence

KDnuggets

AUGUST 8, 2025

Combine data processing, AI analysis, and professional reporting without jumping between tools or managing complex infrastructure. Integration with Feature Stores Connect the workflow output to feature stores like Feast or Tecton for automated feature pipeline creation and management. // 2.

Data Science

Data Science Statistics Machine Learning Advertising

Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena

AWS Big Data

JULY 25, 2025

However, managing schema evolution at scale presents significant challenges. To address this challenge, this post demonstrates how to build such a solution by combining Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue Data Catalog for schema management, and Amazon Athena for one-time querying.

IoT

IoT Analytics Metadata Measurement

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. We take care of the ETL for you by automating the creation and management of data replication. Zero-ETL provides service-managed replication. Glue ETL offers customer-managed data ingestion. What is zero-ETL?

Data Integration

Data Integration Data Lake Statistics Data-driven

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Better Metadata Management Add Descriptions and Data Product tags to tables and columns in the Data Catalog for improved governance. With updated TestGen 3.0 , you have the power to score, monitor, and optimize your data quality like never before. DataOps just got more intelligent.

Data Quality

Data Quality Scorecard Testing Dashboards

How Far We Can Go with GenAI as an Information Extraction Tool

Ontotext

JANUARY 10, 2025

This blog post summarizes our findings, focusing on NER as a first-step key task for knowledge extraction. We also experimented with prompt optimization tools, however these experiments did not yield promising results. In many cases, prompt optimizers were removing crucial entity-specific information and oversimplifying.

Informatics

Informatics Metadata Modeling Experimentation

How To Use Airbyte, dbt-teradata, Dagster, and Teradata Vantage™ for Seamless Data Integration

Teradata

MAY 30, 2025

Finally, the purchase_patterns table examines customer purchase behavior over time, aiding in understanding buying trends and optimizing the customer journey. schema.yml`: YAML file defining metadata, tests, and descriptions for the models in this directory. customer_demographics.sql`: Model for transforming customer demographic data.

Data Integration

Data Integration Data Processing Metadata Testing

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

AWS Big Data

FEBRUARY 13, 2025

Amazon Managed Streaming for Apache Kafka (Amazon MSK) now offers a new broker type called Express brokers. Express brokers provide straightforward operations with hands-free storage management by offering unlimited storage without pre-provisioning, eliminating disk-related bottlenecks.

Metrics

Metrics Metadata Strategy Management

Optimize traffic costs of Amazon MSK consumers on Amazon EKS with rack awareness

AWS Big Data

JULY 30, 2025

Are you incurring significant cross Availability Zone traffic costs when running an Apache Kafka client in containerized environments on Amazon Elastic Kubernetes Service (Amazon EKS) that consume data from Amazon Managed Streaming for Apache Kafka (Amazon MSK) topics? An Apache Kafka client consumer will register to read against a topic.

Optimization

Optimization Metadata Management Data Processing

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

The problem isn’t just the volume of the data, but also how difficult it is to manage and make sense of it. All of this data is essential for investigations and threat hunting, but existing systems often struggle to manage it efficiently. In many traditional systems, query planning can take as long as executing the query itself.

Analytics

Analytics Metadata Snapshot Data-driven

Near real-time baggage operational insights for airlines using Amazon Kinesis Data Streams

AWS Big Data

JULY 8, 2025

Importance of baggage analytics Baggage management is a process that starts at baggage check-in and ends with the passenger claiming their baggage in a happy path scenario. The following figure explains the high-level baggage management process and respective key performance indicators (KPI).

Internet of Things

Internet of Things IoT Metrics Data-driven

Amazon OpenSearch Service 101: Create your first search application with OpenSearch

AWS Big Data

JUNE 25, 2025

Organizations today face the challenge of managing and deriving insights from an ever-expanding universe of data in real time. The cost of commercial observability solutions becomes prohibitive, forcing teams to manage multiple separate tools and increasing both operational overhead and troubleshooting complexity.

Dashboards

Dashboards IoT Interactive Visualization

Model first, move smart: Why data modeling is the key to successful migrations

erwin

JULY 14, 2025

You can’t optimize what you don’t understand. This is where business glossaries and metadata come in. Metadata management tools and business glossary capabilities can help align these definitions early, before the move. Maybe the only person who understood how that legacy CRM database was structured retired last year.

Modeling

Modeling Risk Metadata Finance

Don’t get left in the dark with SAP PowerDesigner: Keep the lights on with erwin

erwin

MAY 22, 2025

Superior functionality Enjoy advanced metadata management, model and database comparisons, roundtrip engineering and deep integration with data catalogs and business glossaries. Learn More Now The post Dont get left in the dark with SAP PowerDesigner: Keep the lights on with erwin appeared first on erwin Expert Blog.

Uncertainty

Uncertainty Modeling Metadata Data Integration

Build a multi-tenant healthcare system with Amazon OpenSearch Service

AWS Big Data

AUGUST 5, 2025

Healthcare systems face significant challenges managing vast amounts of data while maintaining regulatory compliance, security, and performance. In this post, we address common multi-tenancy challenges and provide actionable solutions for security, tenant isolation, workload management, and cost optimization across diverse healthcare tenants.

Insurance

Insurance Cost-Benefit Optimization Metadata

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

AWS Big Data

MARCH 21, 2025

In this blog post, we will demonstrate how business units can use Amazon SageMaker Unified Studio to discover, subscribe to, and analyze these distributed data assets. SageMaker Lakehouse streamlines connecting to, cataloging, and managing permissions on data from multiple sources.

Data Warehouse

Data Warehouse Metadata Publishing Sales

A Field Guide to Rapidly Improving AI Products

O'Reilly on Data

APRIL 15, 2025

Its like optimizing your websites load time while your checkout process is brokenyoure getting better at the wrong thing. Instead of focusing on the few metrics that matter for your specific use case, youre trying to optimize multiple dimensions simultaneously. Second, too many metrics fragment your attention.

Experimentation

Experimentation Testing Metrics Measurement

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

Despite their advantages, traditional data lake architectures often grapple with challenges such as understanding deviations from the most optimal state of the table over time, identifying issues in data pipelines, and monitoring a large number of tables.

Metadata

Metadata Snapshot Data Lake Metrics

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

This is part of our series of blog posts on recent enhancements to Impala. Impala Optimizations for Small Queries. We’ll discuss the various phases Impala takes a query through and how small query optimizations are incorporated into the design of each phase. The entire collection is available here. Query Planner Design.

Optimization

Optimization Metadata Statistics Cost-Benefit

RDF-Star: Metadata Complexity Simplified

Ontotext

JUNE 10, 2021

Relational databases benefit from decades of tweaks and optimizations to deliver performance. This is a graph of millions of edges and vertices – in enterprise data management terms it is a giant piece of master/reference data. Not Every Graph is a Knowledge Graph: Schemas and Semantic Metadata Matter.

Metadata

Metadata Cost-Benefit OLAP Modeling

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

Optimization

Optimization Snapshot Data Lake Metadata

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

1) What Is Data Quality Management? However, with all good things comes many challenges and businesses often struggle with managing their information in the correct way. Enters data quality management. What Is Data Quality Management (DQM)? Why Do You Need Data Quality Management? Table of Contents.

Data Quality

Data Quality Metrics Data-driven Management

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

This blog post is co-written with Hardeep Randhawa and Abhay Kumar from HPE. Their large inventory requires extensive supply chain management to source parts, make products, and distribute them globally. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

5 Ways Data Modeling Is Critical to Data Governance

erwin

JANUARY 9, 2020

Enterprises are trying to manage data chaos. For decades, data modeling has been the optimal way to design and deploy new relational databases with high-quality data sources and support application development. erwin DM 2020 is an essential source of metadata and a critical enabler of data governance and intelligence efforts.

Data Governance

Data Governance Modeling Metadata Unstructured Data

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

With all these diverse metadata sources, it is difficult to understand the complicated web they form much less get a simple visual flow of data lineage and impact analysis. The metadata-driven suite automatically finds, models, ingests, catalogs and governs cloud data assets. But let’s be honest – no one likes to move.

Data Governance

Data Governance Metadata Data Lake Testing

Do I Need a Data Catalog?

erwin

JUNE 26, 2020

You also can manage the effectiveness of your business and ensure you understand what critical systems are for business continuity and measuring corporate performance. The most optimal and streamlined way to achieve this is by using a data catalog, which can provide a first stop for users ahead of working in BI platforms.

Metadata

Metadata Cost-Benefit Measurement Data-driven

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

On AWS, you can run Trino on Amazon EMR , where you have the flexibility to run your preferred version of open source Trino on Amazon Elastic Compute Cloud (Amazon EC2) instances that you manage, or on Amazon Athena for a serverless experience. and later, S3 file metadata-based join optimizations are turned on by default.

Metadata

Metadata Statistics Broadcasting Optimization

6 Case Studies on The Benefits of Business Intelligence And Analytics

datapine

JANUARY 31, 2022

The main use of business intelligence is to help business units, managers, top executives, and other operational workers make better-informed decisions backed up with accurate data. The top management believed that tackling this turnover would be key in improving the customer experience and that this would lead to higher revenues.

Business Intelligence

Business Intelligence Analytics Cost-Benefit ROI

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

The Ozone Manager is a critical component of Ozone. It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. As Ozone scales to exabytes of data, it is important to ensure that Ozone Manager can perform at scale.

Management

Management Metadata Metrics Optimization

Maximize your data dividends with active metadata

IBM Big Data Hub

NOVEMBER 28, 2022

Metadata management performs a critical role within the modern data management stack. However, as data volumes continue to grow, manual approaches to metadata management are sub-optimal and can result in missed opportunities. This puts into perspective the role of active metadata management.

Metadata

Metadata Data Quality Data-driven Data Governance

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake

Data Lake Data Processing Metadata Snapshot

Why You Need End-to-End Data Lineage

erwin

SEPTEMBER 10, 2020

In a previous blog , I explained that data lineage is basically the history of data, including a data set’s origin, characteristics, quality and movement over time. This information is critical to regulatory compliance, change management and data governance not to mention delivering an optimal customer experience.

Data Governance

Data Governance Key Performance Indicator Metadata Digital Transformation

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Cloudinary is a cloud-based media management platform that provides a comprehensive set of tools and services for managing, optimizing, and delivering images, videos, and other media assets on websites and mobile applications.

Data Lake

Data Lake Metadata Snapshot Analytics

How to Manage Risk with Modern Data Architectures

Cloudera

JUNE 29, 2023

To improve the way they model and manage risk, institutions must modernize their data management and data governance practices. Implementing a modern data architecture makes it possible for financial institutions to break down legacy data silos, simplifying data management, governance, and integration — and driving down costs.

Data Architecture

Data Architecture Risk Management Risk Management

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Webinars

Trending Sources

Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques

Webinars

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Recap of Amazon Redshift key product announcements in 2024

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Generative AI: A Self-Study Roadmap

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AI-Powered Feature Engineering with n8n: Scaling Data Science Intelligence

Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Announcing Open Source DataOps Data Quality TestGen 3.0

How Far We Can Go with GenAI as an Information Extraction Tool

How To Use Airbyte, dbt-teradata, Dagster, and Teradata Vantage™ for Seamless Data Integration

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

Optimize traffic costs of Amazon MSK consumers on Amazon EKS with rack awareness

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Near real-time baggage operational insights for airlines using Amazon Kinesis Data Streams

Amazon OpenSearch Service 101: Create your first search application with OpenSearch

Model first, move smart: Why data modeling is the key to successful migrations

Don’t get left in the dark with SAP PowerDesigner: Keep the lights on with erwin

Build a multi-tenant healthcare system with Amazon OpenSearch Service

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

A Field Guide to Rapidly Improving AI Products

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Keeping Small Queries Fast – Short query optimizations in Apache Impala

RDF-Star: Metadata Complexity Simplified

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Enhance data governance with enforced metadata rules in Amazon DataZone

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

5 Ways Data Modeling Is Critical to Data Governance

Doing Cloud Migration and Data Governance Right the First Time

Do I Need a Data Catalog?

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

6 Case Studies on The Benefits of Business Intelligence And Analytics

Boosting Object Storage Performance with Ozone Manager

Maximize your data dividends with active metadata

Use Apache Iceberg in a data lake to support incremental data processing

Why You Need End-to-End Data Lineage

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

How to Manage Risk with Modern Data Architectures

Stay Connected