Metadata, Optimization and Reference

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

We will also cover the pattern with automatic compaction through AWS Glue Data Catalog table optimization. Consider a streaming pipeline ingesting real-time event data while a scheduled compaction job runs to optimize file sizes. Load the tables latest metadata, and determine which metadata version is used as the base for the update.

Snapshot

Snapshot Management Metadata Big Data

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

The adoption of open table formats is a crucial consideration for organizations looking to optimize their data management practices and extract maximum value from their data. For more details, refer to Iceberg Release 1.6.1. An Iceberg table’s metadata stores a history of snapshots, which are updated with each transaction.

Snapshot

Snapshot Metadata Data Lake Optimization

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

First query response times for dashboard queries have significantly improved by optimizing code execution and reducing compilation overhead. We have enhanced autonomics algorithms to generate and implement smarter and quicker optimal data layout recommendations for distribution and sort keys, further optimizing performance.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

RDF-Star: Metadata Complexity Simplified

Ontotext

JUNE 10, 2021

Relational databases benefit from decades of tweaks and optimizations to deliver performance. This is a graph of millions of edges and vertices – in enterprise data management terms it is a giant piece of master/reference data. Not Every Graph is a Knowledge Graph: Schemas and Semantic Metadata Matter.

Metadata

Metadata Cost-Benefit OLAP Modeling

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

AWS Big Data

SEPTEMBER 12, 2024

The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance.

Optimization

Optimization Snapshot Metadata Metrics

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

AWS Big Data

APRIL 9, 2025

Whether youre a data analyst seeking a specific metric or a data steward validating metadata compliance, this update delivers a more precise, governed, and intuitive search experience. This supports data hygiene and infrastructure cost optimization.

Metadata

Metadata Metrics Data-driven Cost-Benefit

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

Impala Optimizations for Small Queries. We’ll discuss the various phases Impala takes a query through and how small query optimizations are incorporated into the design of each phase. For a more in-depth description of these phases please refer to Impala: A Modern, Open-Source SQL Engine for Hadoop. Query Planner Design.

Optimization

Optimization Metadata Statistics Cost-Benefit

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Customers maintain multiple MWAA environments to separate development stages, optimize resources, manage versions, enhance security, ensure redundancy, customize settings, improve scalability, and facilitate experimentation. Refer to Amazon Managed Workflows for Apache Airflow Pricing for rates and more details.

Metadata

Metadata Cost-Benefit Metrics Optimization

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

This workload imbalance presents a challenge for customers seeking to optimize their resource utilization and stream processing efficiency. reduces the Amazon DynamoDB cost associated with KCL by optimizing read operations on the DynamoDB table storing metadata. x benefits, refer to Use features of the AWS SDK for Java 2.x.

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

AWS Big Data

MAY 20, 2025

Amazon Bedrock Knowledge Bases automatically translates these natural language queries into optimized SQL statements, thereby accelerating time to insight, enabling faster discoveries and efficient decision-making. For instructions, refer to Creating a general purpose bucket. Choose your Redshift workgroup. Choose Next.

Structured Data

Structured Data Data Warehouse Analytics Finance

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

Optimization

Optimization Snapshot Data Lake Metadata

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Optimization

Optimization Statistics Metadata Data Lake

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge provides a configurable framework to seamlessly convert legacy metadata and code into more modern services such as Amazon Redshift. For more details, refer to the BladeBridge Analyzer Demo. Refer to this BladeBridge documentation to get more details on SQL and expression conversion.

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

What is SCOR? A model to improve supply chain management

CIO Business Intelligence

MAY 20, 2025

The updated version includes more emerging drivers of supply chain success, covering topics such as omnichannel, metadata, and blockchain , according to the ASCM. The most recent version of the framework, SCOR 12.0, was released in 2017 by the ASCM.

Modeling

Modeling Management Metrics Measurement

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Unfiltered Table Metadata This tab displays the response of the AWS Glue API GetUnfilteredTableMetadata policies for the selected table. Get table data and metadata for this user to see how Lake Formation permissions are enforced and so the two users can see different data (on the Authorized Data tab).

Data Processing

Data Processing Metadata Publishing Testing

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

AWS Big Data

FEBRUARY 6, 2023

For instructions, refer to How to Set Up a MongoDB Cluster. Choose the table to view the schema and other metadata. Conclusion In this post, we showed how to set up an AWS Glue crawler to crawl over a MongoDB Atlas collection, gathering metadata and creating table records in the AWS Glue Data Catalog.

Metadata

Metadata Data Lake Machine Learning Big Data

Amazon SageMaker Lakehouse now supports attribute-based access control

AWS Big Data

APRIL 24, 2025

Bob now knows that he can quickly build Amazon QuickSight dashboards with queries that are optimized using Redshifts cost-based optimizer. Ava defines the user attributes as static IAM tags that could also include attributes stored in the identity provider (IdP) or as session tags dynamically to represent the user metadata.

Sales

Sales Data Lake Management Data-driven

Do I Need a Data Catalog?

erwin

JUNE 26, 2020

Organizations with particularly deep data stores might need a data catalog with advanced capabilities, such as automated metadata harvesting to speed up the data preparation process. The most optimal and streamlined way to achieve this is by using a data catalog, which can provide a first stop for users ahead of working in BI platforms.

Metadata

Metadata Cost-Benefit Measurement Data-driven

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

When you use Trino on Amazon EMR or Athena, you get the latest open source community innovations along with proprietary, AWS developed optimizations. and Athena engine version 2, AWS has been developing query plan and engine behavior optimizations that improve query performance on Trino. Starting from Amazon EMR 6.8.0

Metadata

Metadata Statistics Broadcasting Optimization

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

We won’t be writing code to optimize scheduling in a manufacturing plant; we’ll be training ML algorithms to find optimum performance based on historical data. With machine learning, the challenge isn’t writing the code; the algorithms are implemented in a number of well-known and highly optimized libraries.

Machine Learning

Machine Learning Software Metadata Testing

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. As stated earlier, the first step involves data ingestion.

Sales

Sales Data-driven Data Processing Key Performance Indicator

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. This metadata file is later used to read source file names during processing into the staging layer. These files follow the same naming pattern, with a daily system-generated timestamp appended to each file name.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Over the last year, Amazon Redshift added several performance optimizations for data lake queries across multiple areas of query engine such as rewrite, planning, scan execution and consuming AWS Glue Data Catalog column statistics. Some of the queries in our benchmark experienced up to 12x speed up.

Data Lake

Data Lake Statistics Broadcasting Optimization

Deploy Amazon QuickSight dashboards to monitor AWS Glue ETL job metrics and set alarms

AWS Big Data

NOVEMBER 3, 2023

Although we don’t cover optimizing your jobs for costs in this post, you can refer to Monitor and optimize cost on AWS Glue for Apache Spark to learn how to fine-tune your AWS Glue jobs for performance, efficiency ,and cost-optimization. Let’s dive in! If the tables don’t exist, Athena creates them.

Metrics

Metrics Dashboards Metadata Visualization

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

Each storage format implements this functionality in slightly different ways; for a comparison, refer to Choosing an open table format for your transactional data lake on AWS. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot

Snapshot Data Lake Metadata Optimization

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

We group the new capabilities into four categories: Discover and secure Connect with data sharing Scale and optimize Audit and monitor Let’s dive deeper and discuss the new capabilities introduced in 2023. To learn more about DataZone, refer to the User Guide. This enhancement simplifies many use cases to avoid metadata duplication.

Data Lake

Data Lake Metadata Data Governance Statistics

Maximize your data dividends with active metadata

IBM Big Data Hub

NOVEMBER 28, 2022

Metadata management performs a critical role within the modern data management stack. However, as data volumes continue to grow, manual approaches to metadata management are sub-optimal and can result in missed opportunities. This puts into perspective the role of active metadata management. Improve data discovery.

Metadata

Metadata Data Quality Data-driven Data Governance

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Data quality refers to the assessment of the information you have, relative to its purpose and its ability to serve that purpose. While the digital age has been successful in prompting innovation far and wide, it has also facilitated what is referred to as the “data crisis” – low-quality data. 2 – Data profiling.

Data Quality

Data Quality Metrics Data-driven Management

Building Your Human Benchmark with Ontotext Metadata Studio

Ontotext

FEBRUARY 16, 2023

Ontotext’s approach is to optimize models and algorithms through human contribution and benchmarking in order to create better and more accurate AI. To be able to annotate the specified content consistently and unambiguously, these experts usually follow a set of specific conventions, which are referred to as “annotation guidelines”.

Metadata

Metadata Measurement Metrics Modeling

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

They understand data modeling, including conceptualization and database optimization, and demonstrate a commitment to continuing education. According to Dataversity , good data architects have a solid understanding of the cloud, databases, and the applications and programs used by those databases.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

While data management has become a common term for the discipline, it is sometimes referred to as data resource management or enterprise information management (EIM). Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

Use case Consider a large company that relies heavily on data-driven insights to optimize its customer support processes. The data is also registered in the Glue Data Catalog , a metadata repository. The database will be used to store the metadata related to the data integrations performed by zero-ETL.

Data Integration

Data Integration Data Lake Statistics Data-driven

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

AWS Big Data

FEBRUARY 13, 2025

To learn more about Express brokers, refer to Introducing Express brokers for Amazon MSK to deliver high throughput and faster scaling for your Kafka clusters. However, you can use Amazon MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers.

Metrics

Metrics Metadata Strategy Management

Optimized joins & filtering with Bloom filter predicate in Kudu

Cloudera

JANUARY 15, 2021

Pushing down column predicate filters to Kudu allows for optimized execution by skipping reading column values for filtered out rows and reducing network IO between a client, like the distributed query engine Apache Impala, and Kudu. See the references section below for details on the table schema, loading process, and queries that were run.

Optimization

Optimization Broadcasting Testing Metadata

Get started managing partitions for Amazon S3 tables backed by the AWS Glue Data Catalog

AWS Big Data

JUNE 22, 2023

If you simply run queries without considering the optimal data layout on Amazon S3, it results in a high volume of data scanned, long-running queries, and increased cost. Partitioning is a common technique to lay out your data optimally for distributed analytics engines. We also can see the partition metadata on the AWS Glue console.

Metadata

Metadata Management Recreation/Entertainment Optimization

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

The Iceberg specification allows seamless table evolution such as schema and partition evolution, and its design is optimized for usage on Amazon S3. Iceberg stores the metadata pointer for all the metadata files. For more details on Iceberg format versions, refer to Format Versioning.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

Despite these capabilities, data lakes are not databases, and object storage does not provide support for ACID processing semantics, which you may require to effectively optimize and manage your data at scale across hundreds or thousands of users using a multitude of different technologies.

Data Lake

Data Lake Metadata Statistics Optimization

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake

Data Lake Data Processing Metadata Snapshot

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

BMW Group uses 4,500 AWS Cloud accounts across the entire organization but is faced with the challenge of reducing unnecessary costs, optimizing spend, and having a central place to monitor costs. For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard.

Analytics

Analytics Dashboards Metadata Data Warehouse

Processing large records with Amazon Kinesis Data Streams

AWS Big Data

OCTOBER 16, 2023

The individual pieces of data within these streams are often referred to as records. Store large records in Amazon S3 with a reference in Kinesis Data Streams A useful approach for storing large records involves utilizing an alternative storage solution while employing a reference within Kinesis Data Streams.

Cost-Benefit

Cost-Benefit Testing Optimization Strategy

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Build a high-performance quant research platform with Apache Iceberg

Webinars

Trending Sources

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Webinars

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Recap of Amazon Redshift key product announcements in 2024

RDF-Star: Metadata Complexity Simplified

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Introducing Amazon MWAA micro environments for Apache Airflow

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Enhance data governance with enforced metadata rules in Amazon DataZone

Speed up queries with the cost-based optimizer in Amazon Athena

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

What is SCOR? A model to improve supply chain management

Integrate custom applications with AWS Lake Formation – Part 2

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

Amazon SageMaker Lakehouse now supports attribute-based access control

Do I Need a Data Catalog?

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Deep automation in machine learning

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Deploy Amazon QuickSight dashboards to monitor AWS Glue ETL job metrics and set alarms

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Lake Formation 2023 year in review

Maximize your data dividends with active metadata

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Building Your Human Benchmark with Ontotext Metadata Studio

What is a data architect? Skills, salaries, and how to become a data framework master

What is data governance? Best practices for managing data assets

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

Optimized joins & filtering with Bloom filter predicate in Kudu

Get started managing partitions for Amazon S3 tables backed by the AWS Glue Data Catalog

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Choosing an open table format for your transactional data lake on AWS

Use Apache Iceberg in a data lake to support incremental data processing

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Processing large records with Amazon Kinesis Data Streams

Stay Connected