Metadata, Optimization and Reference

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

We will also cover the pattern with automatic compaction through AWS Glue Data Catalog table optimization. Consider a streaming pipeline ingesting real-time event data while a scheduled compaction job runs to optimize file sizes. Load the tables latest metadata, and determine which metadata version is used as the base for the update.

Snapshot

Snapshot Management Metadata Big Data

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

The adoption of open table formats is a crucial consideration for organizations looking to optimize their data management practices and extract maximum value from their data. For more details, refer to Iceberg Release 1.6.1. An Iceberg table’s metadata stores a history of snapshots, which are updated with each transaction.

Snapshot

Snapshot Metadata Data Lake Optimization

RDF-Star: Metadata Complexity Simplified

Ontotext

JUNE 10, 2021

Relational databases benefit from decades of tweaks and optimizations to deliver performance. This is a graph of millions of edges and vertices – in enterprise data management terms it is a giant piece of master/reference data. Not Every Graph is a Knowledge Graph: Schemas and Semantic Metadata Matter.

Metadata

Metadata Cost-Benefit OLAP Modeling

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

Impala Optimizations for Small Queries. We’ll discuss the various phases Impala takes a query through and how small query optimizations are incorporated into the design of each phase. For a more in-depth description of these phases please refer to Impala: A Modern, Open-Source SQL Engine for Hadoop. Query Planner Design.

Optimization

Optimization Metadata Statistics Cost-Benefit

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

AWS Big Data

SEPTEMBER 12, 2024

The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance.

Optimization

Optimization Snapshot Metadata Metrics

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Customers maintain multiple MWAA environments to separate development stages, optimize resources, manage versions, enhance security, ensure redundancy, customize settings, improve scalability, and facilitate experimentation. Refer to Amazon Managed Workflows for Apache Airflow Pricing for rates and more details.

Metadata

Metadata Cost-Benefit Metrics Optimization

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

First query response times for dashboard queries have significantly improved by optimizing code execution and reducing compilation overhead. We have enhanced autonomics algorithms to generate and implement smarter and quicker optimal data layout recommendations for distribution and sort keys, further optimizing performance.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

AWS Big Data

APRIL 9, 2025

Whether youre a data analyst seeking a specific metric or a data steward validating metadata compliance, this update delivers a more precise, governed, and intuitive search experience. This supports data hygiene and infrastructure cost optimization.

Metadata

Metadata Metrics Cost-Benefit Data-driven

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

This workload imbalance presents a challenge for customers seeking to optimize their resource utilization and stream processing efficiency. reduces the Amazon DynamoDB cost associated with KCL by optimizing read operations on the DynamoDB table storing metadata. x benefits, refer to Use features of the AWS SDK for Java 2.x.

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

Optimization

Optimization Snapshot Data Lake Metadata

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Optimization

Optimization Statistics Metadata Data Lake

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge provides a configurable framework to seamlessly convert legacy metadata and code into more modern services such as Amazon Redshift. For more details, refer to the BladeBridge Analyzer Demo. Refer to this BladeBridge documentation to get more details on SQL and expression conversion.

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

Do I Need a Data Catalog?

erwin

JUNE 26, 2020

Organizations with particularly deep data stores might need a data catalog with advanced capabilities, such as automated metadata harvesting to speed up the data preparation process. The most optimal and streamlined way to achieve this is by using a data catalog, which can provide a first stop for users ahead of working in BI platforms.

Metadata

Metadata Cost-Benefit Measurement Data-driven

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

We won’t be writing code to optimize scheduling in a manufacturing plant; we’ll be training ML algorithms to find optimum performance based on historical data. With machine learning, the challenge isn’t writing the code; the algorithms are implemented in a number of well-known and highly optimized libraries.

Machine Learning

Machine Learning Software Metadata Testing

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. As stated earlier, the first step involves data ingestion.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Unfiltered Table Metadata This tab displays the response of the AWS Glue API GetUnfilteredTableMetadata policies for the selected table. Get table data and metadata for this user to see how Lake Formation permissions are enforced and so the two users can see different data (on the Authorized Data tab).

Data Processing

Data Processing Metadata Publishing Testing

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

When you use Trino on Amazon EMR or Athena, you get the latest open source community innovations along with proprietary, AWS developed optimizations. and Athena engine version 2, AWS has been developing query plan and engine behavior optimizations that improve query performance on Trino. Starting from Amazon EMR 6.8.0

Metadata

Metadata Statistics Broadcasting Optimization

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

AWS Big Data

FEBRUARY 6, 2023

For instructions, refer to How to Set Up a MongoDB Cluster. Choose the table to view the schema and other metadata. Conclusion In this post, we showed how to set up an AWS Glue crawler to crawl over a MongoDB Atlas collection, gathering metadata and creating table records in the AWS Glue Data Catalog.

Metadata

Metadata Data Lake Machine Learning Big Data

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Over the last year, Amazon Redshift added several performance optimizations for data lake queries across multiple areas of query engine such as rewrite, planning, scan execution and consuming AWS Glue Data Catalog column statistics. Some of the queries in our benchmark experienced up to 12x speed up.

Data Lake

Data Lake Statistics Broadcasting Optimization

Maximize your data dividends with active metadata

IBM Big Data Hub

NOVEMBER 28, 2022

Metadata management performs a critical role within the modern data management stack. However, as data volumes continue to grow, manual approaches to metadata management are sub-optimal and can result in missed opportunities. This puts into perspective the role of active metadata management. Improve data discovery.

Metadata

Metadata Data Quality Data-driven Data Governance

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. This metadata file is later used to read source file names during processing into the staging layer. These files follow the same naming pattern, with a daily system-generated timestamp appended to each file name.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Deploy Amazon QuickSight dashboards to monitor AWS Glue ETL job metrics and set alarms

AWS Big Data

NOVEMBER 3, 2023

Although we don’t cover optimizing your jobs for costs in this post, you can refer to Monitor and optimize cost on AWS Glue for Apache Spark to learn how to fine-tune your AWS Glue jobs for performance, efficiency ,and cost-optimization. Let’s dive in! If the tables don’t exist, Athena creates them.

Metrics

Metrics Dashboards Metadata Visualization

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

Each storage format implements this functionality in slightly different ways; for a comparison, refer to Choosing an open table format for your transactional data lake on AWS. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot

Snapshot Data Lake Metadata Optimization

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Data quality refers to the assessment of the information you have, relative to its purpose and its ability to serve that purpose. While the digital age has been successful in prompting innovation far and wide, it has also facilitated what is referred to as the “data crisis” – low-quality data. 2 – Data profiling.

Data Quality

Data Quality Metrics Data-driven Management

Building Your Human Benchmark with Ontotext Metadata Studio

Ontotext

FEBRUARY 16, 2023

Ontotext’s approach is to optimize models and algorithms through human contribution and benchmarking in order to create better and more accurate AI. To be able to annotate the specified content consistently and unambiguously, these experts usually follow a set of specific conventions, which are referred to as “annotation guidelines”.

Metadata

Metadata Measurement Metrics Modeling

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

They understand data modeling, including conceptualization and database optimization, and demonstrate a commitment to continuing education. According to Dataversity , good data architects have a solid understanding of the cloud, databases, and the applications and programs used by those databases.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

We group the new capabilities into four categories: Discover and secure Connect with data sharing Scale and optimize Audit and monitor Let’s dive deeper and discuss the new capabilities introduced in 2023. To learn more about DataZone, refer to the User Guide. This enhancement simplifies many use cases to avoid metadata duplication.

Data Lake

Data Lake Metadata Data Governance Statistics

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

While data management has become a common term for the discipline, it is sometimes referred to as data resource management or enterprise information management (EIM). Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

AWS Big Data

FEBRUARY 13, 2025

To learn more about Express brokers, refer to Introducing Express brokers for Amazon MSK to deliver high throughput and faster scaling for your Kafka clusters. However, you can use Amazon MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers.

Metrics

Metrics Metadata Strategy Management

Optimized joins & filtering with Bloom filter predicate in Kudu

Cloudera

JANUARY 15, 2021

Pushing down column predicate filters to Kudu allows for optimized execution by skipping reading column values for filtered out rows and reducing network IO between a client, like the distributed query engine Apache Impala, and Kudu. See the references section below for details on the table schema, loading process, and queries that were run.

Optimization

Optimization Broadcasting Testing Metadata

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

Despite these capabilities, data lakes are not databases, and object storage does not provide support for ACID processing semantics, which you may require to effectively optimize and manage your data at scale across hundreds or thousands of users using a multitude of different technologies.

Data Lake

Data Lake Metadata Statistics Optimization

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Through their unique position in ports, at sea, and on roads, they optimize global cargo flows and create sustainable customer value. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views. An AWS Glue job (metadata exporter) runs daily on the source account.

Metadata

Metadata Data Lake Machine Learning Big Data

Introducing Amazon MWAA larger environment sizes

AWS Big Data

APRIL 16, 2024

Running Apache Airflow at scale puts proportionally greater load on the Airflow metadata database, sometimes leading to CPU and memory issues on the underlying Amazon Relational Database Service (Amazon RDS) cluster. A resource-starved metadata database may lead to dropped connections from your workers, failing tasks prematurely.

Metadata

Metadata Metrics Testing Management

The Semantic Web: 20 Years And a Handful of Enterprise Knowledge Graphs Later

Ontotext

JULY 29, 2021

KGs bring the Semantic Web paradigm to the enterprises, by introducing semantic metadata to drive data management and content management to new levels of efficiency and breaking silos to let them synergize with various forms of knowledge management. Take this restaurant, for example. used across different systems in the enterprise.

Enterprise

Enterprise Metadata Knowledge Discovery Management

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

BMW Group uses 4,500 AWS Cloud accounts across the entire organization but is faced with the challenge of reducing unnecessary costs, optimizing spend, and having a central place to monitor costs. For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard.

Dashboards

Dashboards Analytics Metadata Data Warehouse

What Are ChatGPT and Its Friends?

O'Reilly on Data

MARCH 23, 2023

Many of these go slightly (but not very far) beyond your initial expectations: you can ask it to generate a list of terms for search engine optimization, you can ask it to generate a reading list on topics that you’re interested in. In the second, “it” refers to the pitcher. It was not optimized to provide correct responses.

IT

IT Modeling Testing Risk

Themes and Conferences per Pacoid, Episode 11

Domino Data Lab

JULY 2, 2019

In other words, using metadata about data science work to generate code. SQL optimization provides helpful analogies, given how SQL queries get translated into query graphs internally , then the real smarts of a SQL engine work over that graph. On deck this time ’round the Moon: program synthesis. SQL and Spark.

Metadata

Metadata Data Science Machine Learning Data-driven

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

The Iceberg specification allows seamless table evolution such as schema and partition evolution, and its design is optimized for usage on Amazon S3. Iceberg stores the metadata pointer for all the metadata files. For more details on Iceberg format versions, refer to Format Versioning.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake

Data Lake Data Processing Metadata Snapshot

Processing large records with Amazon Kinesis Data Streams

AWS Big Data

OCTOBER 16, 2023

The individual pieces of data within these streams are often referred to as records. Store large records in Amazon S3 with a reference in Kinesis Data Streams A useful approach for storing large records involves utilizing an alternative storage solution while employing a reference within Kinesis Data Streams.

Cost-Benefit

Cost-Benefit Testing Optimization Strategy

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Webinars

Trending Sources

Build a high-performance quant research platform with Apache Iceberg

Webinars

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Use open table format libraries on AWS Glue 5.0 for Apache Spark

RDF-Star: Metadata Complexity Simplified

Keeping Small Queries Fast – Short query optimizations in Apache Impala

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

Introducing Amazon MWAA micro environments for Apache Airflow

Recap of Amazon Redshift key product announcements in 2024

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Speed up queries with the cost-based optimizer in Amazon Athena

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

Enhance data governance with enforced metadata rules in Amazon DataZone

Do I Need a Data Catalog?

Deep automation in machine learning

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Integrate custom applications with AWS Lake Formation – Part 2

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Maximize your data dividends with active metadata

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Deploy Amazon QuickSight dashboards to monitor AWS Glue ETL job metrics and set alarms

Use Amazon Athena with Spark SQL for your open-source transactional table formats

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Building Your Human Benchmark with Ontotext Metadata Studio

What is a data architect? Skills, salaries, and how to become a data framework master

AWS Lake Formation 2023 year in review

What is data governance? Best practices for managing data assets

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

Optimized joins & filtering with Bloom filter predicate in Kudu

Choosing an open table format for your transactional data lake on AWS

How Cargotec uses metadata replication to enable cross-account data sharing

Introducing Amazon MWAA larger environment sizes

The Semantic Web: 20 Years And a Handful of Enterprise Knowledge Graphs Later

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

What Are ChatGPT and Its Friends?

Themes and Conferences per Pacoid, Episode 11

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Use Apache Iceberg in a data lake to support incremental data processing

Processing large records with Amazon Kinesis Data Streams

Stay Connected