Metadata and Optimization - Data Leaders Brief

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

It leverages knowledge graphs to keep track of all the data sources and data flows, using AI to fill the gaps so you have the most comprehensive metadata management solution. Together, Cloudera and Octopai will help reinvent how customers manage their metadata and track lineage across all their data sources.

Metadata

Metadata Management Data Governance Data-driven

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

We will also cover the pattern with automatic compaction through AWS Glue Data Catalog table optimization. Consider a streaming pipeline ingesting real-time event data while a scheduled compaction job runs to optimize file sizes. Load the tables latest metadata, and determine which metadata version is used as the base for the update.

Snapshot

Snapshot Management Metadata Big Data

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Better Metadata Management Add Descriptions and Data Product tags to tables and columns in the Data Catalog for improved governance. With updated TestGen 3.0 , you have the power to score, monitor, and optimize your data quality like never before. DataOps just got more intelligent.

Data Quality

Data Quality Scorecard Testing Dashboards

Enterprises can gain an edge with Metadata Management

CIO Business Intelligence

SEPTEMBER 6, 2024

Central to this is metadata management, a critical component for driving future success AI and ML need large amounts of accurate data for companies to get the most out of the technology. Let’s dive into what that looks like, what workarounds some IT teams use today, and why metadata management is the key to success.

Metadata

Metadata Enterprise Management Cost-Benefit

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

Despite their advantages, traditional data lake architectures often grapple with challenges such as understanding deviations from the most optimal state of the table over time, identifying issues in data pipelines, and monitoring a large number of tables. It is essential for optimizing read and write performance.

Metadata

Metadata Snapshot Data Lake Metrics

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

Impala Optimizations for Small Queries. We’ll discuss the various phases Impala takes a query through and how small query optimizations are incorporated into the design of each phase. Query optimization in databases is a long standing area of research, with much emphasis on finding near optimal query plans.

Optimization

Optimization Metadata Statistics Cost-Benefit

RDF-Star: Metadata Complexity Simplified

Ontotext

JUNE 10, 2021

Relational databases benefit from decades of tweaks and optimizations to deliver performance. Not Every Graph is a Knowledge Graph: Schemas and Semantic Metadata Matter. This metadata should then be represented, along with its intricate relationships, in a connected knowledge graph model that can be understood by the business teams”.

Metadata

Metadata Cost-Benefit OLAP Modeling

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

Amazon OpenSearch Service recently introduced the OpenSearch Optimized Instance family (OR1), which delivers up to 30% price-performance improvement over existing memory optimized instances in internal benchmarks, and uses Amazon Simple Storage Service (Amazon S3) to provide 11 9s of durability.

Optimization

Optimization Snapshot Metadata Cost-Benefit

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

AWS Big Data

SEPTEMBER 12, 2024

The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance.

Optimization

Optimization Snapshot Metadata Metrics

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

The adoption of open table formats is a crucial consideration for organizations looking to optimize their data management practices and extract maximum value from their data. An Iceberg table’s metadata stores a history of snapshots, which are updated with each transaction. In earlier posts, we discussed AWS Glue 5.0 for Apache Spark.

Snapshot

Snapshot Metadata Data Lake Optimization

What Is Active Metadata Management and How Does It Work?

Octopai

OCTOBER 18, 2021

First, what active metadata management isn’t : “Okay, you metadata! Now, what active metadata management is (well, kind of): “Okay, you metadata! I will, of course, end up with a very amateurish finished product, because I used sub-optimal tools to do the job. That takes active metadata management.

Metadata

Metadata Management IT Data Quality

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Customers maintain multiple MWAA environments to separate development stages, optimize resources, manage versions, enhance security, ensure redundancy, customize settings, improve scalability, and facilitate experimentation. micro, remember to monitor its performance using the recommended metrics to maintain optimal operation.

Metadata

Metadata Cost-Benefit Metrics Optimization

Why Is Metadata Discovery Important? (+ 5 Use Cases)

Octopai

OCTOBER 11, 2021

Data needs to be accompanied by the metadata that explains and gives it context. Without metadata, data is just a bunch of meaningless, unspecified numbers or words that are about as useful as a bunch of rocks (or shells). And without effective metadata discovery capabilities, metadata isn’t all that useful either.

Metadata

Metadata Data Collection Optimization IT

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

AWS Big Data

APRIL 9, 2025

Whether youre a data analyst seeking a specific metric or a data steward validating metadata compliance, this update delivers a more precise, governed, and intuitive search experience. This supports data hygiene and infrastructure cost optimization.

Metadata

Metadata Metrics Data-driven Cost-Benefit

Comprehensive data management for AI: The next-gen data management engine that will drive AI to new heights

CIO Business Intelligence

NOVEMBER 19, 2024

Some challenges include data infrastructure that allows scaling and optimizing for AI; data management to inform AI workflows where data lives and how it can be used; and associated data services that help data scientists protect AI workflows and keep their models clean.

Management

Management Unstructured Data Deep Learning Metadata

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

Optimization

Optimization Snapshot Data Lake Metadata

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

How RFS works OpenSearch and Elasticsearch snapshots are a directory tree that contains both data and metadata. Metadata files exist in the snapshot to provide details about the snapshot as a whole, the source cluster’s global metadata and settings, each index in the snapshot, and each shard in the snapshot.

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Optimization

Optimization Statistics Metadata Data Lake

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

This workload imbalance presents a challenge for customers seeking to optimize their resource utilization and stream processing efficiency. reduces the Amazon DynamoDB cost associated with KCL by optimizing read operations on the DynamoDB table storing metadata. and why it results in higher costs. Other benefits in KCL 3.0

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

For container terminal operators, data-driven decision-making and efficient data sharing are vital to optimizing operations and boosting supply chain efficiency. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This post is co-written by Dr. Leonard Heilig and Meliena Zlotos from EUROGATE.

IoT

IoT Machine Learning Metadata Data-driven

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

First query response times for dashboard queries have significantly improved by optimizing code execution and reducing compilation overhead. We have enhanced autonomics algorithms to generate and implement smarter and quicker optimal data layout recommendations for distribution and sort keys, further optimizing performance.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Within the ANZ enterprise data mesh strategy, aligning data mesh nodes with the ANZ Group’s divisional structure provides optimal alignment between data mesh principles and organizational structure, as shown in the following diagram. A data portal for consumers to discover data products and access associated metadata.

Metadata

Metadata Data Governance Data Quality Data-driven

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant. Data fabric Metadata-rich integration layer across distributed systems. Implementation complexity, relies on robust metadata management.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Hadoop Data Mining Tools Can Enhance The Value Of Digital Assets

Smart Data Collective

AUGUST 25, 2020

Some of the benefits are detailed below: Optimizing metadata for greater reach and branding benefits. One of the most overlooked factors is metadata. Metadata is important for numerous reasons. Search engines crawl metadata of image files, videos and other visual creative when they are indexing websites.

Data mining

Data mining Metadata Big Data ROI

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

This can help you optimize long-term cost for high-throughput use cases. This includes adding common fields to associate metadata with the indexed documents, as well as parsing the log data to make data more searchable. In general, we recommend using one Kinesis data stream for your log aggregation workload.

Metadata

Metadata Metrics Analytics Data Processing

5 Ways Data Modeling Is Critical to Data Governance

erwin

JANUARY 9, 2020

For decades, data modeling has been the optimal way to design and deploy new relational databases with high-quality data sources and support application development. That’s because it’s the best way to visualize metadata , and metadata is now the heart of enterprise data management and data governance/ intelligence efforts.

Data Governance

Data Governance Modeling Metadata Unstructured Data

Do I Need a Data Catalog?

erwin

JUNE 26, 2020

Organizations with particularly deep data stores might need a data catalog with advanced capabilities, such as automated metadata harvesting to speed up the data preparation process. The most optimal and streamlined way to achieve this is by using a data catalog, which can provide a first stop for users ahead of working in BI platforms.

Metadata

Metadata Cost-Benefit Measurement Data-driven

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

S3 Tables are specifically optimized for analytics workloads, resulting in up to 3 times faster query throughput and up to 10 times higher transactions per second compared to self-managed tables. These metadata tables are stored in S3 Tables, the new S3 storage offering optimized for tabular data.

Analytics

Analytics Data Lake Metadata Data Warehouse

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Cloudinary is a cloud-based media management platform that provides a comprehensive set of tools and services for managing, optimizing, and delivering images, videos, and other media assets on websites and mobile applications. This concept makes Iceberg extremely versatile.

Data Lake

Data Lake Metadata Snapshot Analytics

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

We won’t be writing code to optimize scheduling in a manufacturing plant; we’ll be training ML algorithms to find optimum performance based on historical data. With machine learning, the challenge isn’t writing the code; the algorithms are implemented in a number of well-known and highly optimized libraries.

Machine Learning

Machine Learning Software Metadata Testing

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

AWS Big Data

FEBRUARY 6, 2023

Choose the table to view the schema and other metadata. Conclusion In this post, we showed how to set up an AWS Glue crawler to crawl over a MongoDB Atlas collection, gathering metadata and creating table records in the AWS Glue Data Catalog. Note that the crawler captured nested data as a STRUCT and correctly listed the ARRAY fields.

Metadata

Metadata Data Lake Machine Learning Big Data

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. This metadata file is later used to read source file names during processing into the staging layer. These files follow the same naming pattern, with a daily system-generated timestamp appended to each file name.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Specialized tools for machine learning development and model governance are becoming essential

O'Reilly on Data

APRIL 2, 2019

Recall the following key attributes of a machine learning project: Unlike traditional software where the goal is to meet a functional specification , in ML the goal is to optimize a metric. Metadata and artifacts needed for audits: as an example, the output from the components of MLflow will be very pertinent for audits.

Machine Learning

Machine Learning Modeling Data Science Software

The Art of Lean Governance: Data Inventory Optimization

TDAN

JANUARY 31, 2023

Data inventory optimization is about efficiently solving the right problem. In this column, we will return to the idea of lean manufacturing and explore the critical area of inventory management on the factory floor.

Optimization

Optimization Manufacturing Management Metadata

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

With all these diverse metadata sources, it is difficult to understand the complicated web they form much less get a simple visual flow of data lineage and impact analysis. The metadata-driven suite automatically finds, models, ingests, catalogs and governs cloud data assets. GDPR, CCPA, HIPAA, SOX, PIC DSS).

Data Governance

Data Governance Metadata Testing Data Lake

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

When you use Trino on Amazon EMR or Athena, you get the latest open source community innovations along with proprietary, AWS developed optimizations. and Athena engine version 2, AWS has been developing query plan and engine behavior optimizations that improve query performance on Trino. Starting from Amazon EMR 6.8.0

Metadata

Metadata Statistics Broadcasting Optimization

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

In this context, Amazon DataZone is the optimal choice for managing the enterprise data platform. Business analysts enhance the data with business metadata/glossaries and publish the same as data assets or data products. As stated earlier, the first step involves data ingestion. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Over the last year, Amazon Redshift added several performance optimizations for data lake queries across multiple areas of query engine such as rewrite, planning, scan execution and consuming AWS Glue Data Catalog column statistics. Some of the queries in our benchmark experienced up to 12x speed up.

Data Lake

Data Lake Statistics Broadcasting Optimization

How to optimize application performance with NS1 traffic steering

IBM Big Data Hub

DECEMBER 18, 2023

Achieving consistently high performance requires an efficient routing system, optimizing traffic between the services your application depends on. In summary, IBM NS1 Connect offers a range of traffic steering options to meet diverse business needs to help ensure optimal application performance in the “now” era.

Optimization

Optimization Metadata Metrics Management

Data Intelligence in the Next Normal; Why, Who and When?

erwin

JANUARY 14, 2021

As the economy slowed, they focused on cost optimization. Even if you don’t have a formal data intelligence program in place, there is a good possibility your organization has intelligence about its data, because it is difficult for data to exist without some form of associated metadata.

Digital Transformation

Digital Transformation Metadata Big Data Data-driven

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Webinars

Trending Sources

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Webinars

Build a high-performance quant research platform with Apache Iceberg

Announcing Open Source DataOps Data Quality TestGen 3.0

Enterprises can gain an edge with Metadata Management

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Keeping Small Queries Fast – Short query optimizations in Apache Impala

RDF-Star: Metadata Complexity Simplified

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Write queries faster with Amazon Q generative SQL for Amazon Redshift

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

Use open table format libraries on AWS Glue 5.0 for Apache Spark

What Is Active Metadata Management and How Does It Work?

Introducing Amazon MWAA micro environments for Apache Airflow

Why Is Metadata Discovery Important? (+ 5 Use Cases)

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

Comprehensive data management for AI: The next-gen data management engine that will drive AI to new heights

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Speed up queries with the cost-based optimizer in Amazon Athena

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

How EUROGATE established a data mesh architecture using Amazon DataZone

Recap of Amazon Redshift key product announcements in 2024

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Data’s dark secret: Why poor quality cripples AI and growth

Hadoop Data Mining Tools Can Enhance The Value Of Digital Assets

Enhance data governance with enforced metadata rules in Amazon DataZone

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

5 Ways Data Modeling Is Critical to Data Governance

Do I Need a Data Catalog?

Top analytics announcements of AWS re:Invent 2024

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Deep automation in machine learning

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Specialized tools for machine learning development and model governance are becoming essential

The Art of Lean Governance: Data Inventory Optimization

Doing Cloud Migration and Data Governance Right the First Time

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

How to optimize application performance with NS1 traffic steering

Data Intelligence in the Next Normal; Why, Who and When?

Stay Connected