Big Data, Blog and Metadata - Data Leaders Brief

Big Data

Blog

Metadata

Unifying metadata governance across Amazon SageMaker and Collibra

AWS Big Data

JULY 16, 2025

Managing metadata across tools and teams is a growing challenge for organizations building modern data and AI platforms. As data volumes grow and generative AI becomes more central to business strategy, teams need a consistent way to define, discover, and govern their datasets, features, and models.

Metadata

Metadata Publishing Management Modeling

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Amazon Athena provides interactive analytics service for analyzing the data in Amazon Simple Storage Service (Amazon S3). Amazon Redshift is used to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Table metadata is fetched from AWS Glue.

Metadata

Metadata Data Lake Modeling Data Warehouse

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Trending Sources

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

The landscape of big data management has been transformed by the rising popularity of open table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake. These formats, designed to address the limitations of traditional data storage systems, have become essential in modern data architectures.

Metadata

Metadata Data Warehouse Big Data Data Lake

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

How RFS works OpenSearch and Elasticsearch snapshots are a directory tree that contains both data and metadata. The raw data for a given shard is stored in its corresponding shard sub-directory as a collection of Lucene files, which OpenSearch and Elasticsearch lightly obfuscates. source cluster containing 5 TiB (3.9

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

How Volkswagen Autoeuropa built a data solution with a robust governance framework, simplifying access to quality data using Amazon DataZone

AWS Big Data

NOVEMBER 13, 2024

Onboard key data products – The team identified the key data products that enabled these two use cases and aligned to onboard them into the data solution. These data products belonged to data domains such as production, finance, and logistics. It highlights the guardrails that enable ease of access to quality data.

Metadata

Metadata Data Quality Digital Transformation Data-driven

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

To achieve this, they aimed to break down data silos and centralize data from various business units and countries into the BMW Cloud Data Hub (CDH). Consumer accounts : Used by data consumers to implement use cases insights and build applications tailored to their business needs.

Data Lake

Data Lake Sales Metadata Machine Learning

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. Industry-leading price-performance: Amazon Redshift launches RA3.large

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. These are useful for flexible data lifecycle management. An Iceberg table’s metadata stores a history of snapshots, which are updated with each transaction.

Snapshot

Snapshot Metadata Data Lake Optimization

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

SageMaker brings together widely adopted AWS ML and analytics capabilities—virtually all of the components you need for data exploration, preparation, and integration; petabyte-scale big data processing; fast SQL analytics; model development and training; governance; and generative AI development.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

After you create the asset, you can add glossaries or metadata forms, but its not necessary for this post. You can publish the data asset so its now discoverable within the Amazon DataZone portal. Create it as a JSON file on your workstation (for this post, we call it blog-sub-target.json ). Enter a name for the asset.

Publishing

Publishing Unstructured Data Metadata Data-driven

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

This approach simplifies your data journey and helps you meet your security requirements. The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections. About the Authors Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team.

Visualization

Visualization Data Processing Testing Publishing

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Run the following commands: export PROJ_NAME=lfappblog aws s3 cp s3://aws-blogs-artifacts-public/BDB-3934/schema.graphql ~/${PROJ_NAME}/amplify/backend/api/${PROJ_NAME}/schema.graphql In the s chema.graphql file, you can see that the lf-app-lambda-engine function is set as the data source for the GraphQL queries.

Data Processing

Data Processing Metadata Publishing Testing

Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena

AWS Big Data

JULY 25, 2025

Organizations commonly choose Apache Avro as their data serialization format for IoT data due to its compact binary format, built-in schema evolution support, and compatibility with big data processing frameworks. The schema literal serves as a form of metadata, providing a clear description of your data structure.

IoT

IoT Analytics Metadata Measurement

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Business analysts enhance the data with business metadata/glossaries and publish the same as data assets or data products. The data security officer sets permissions in Amazon DataZone to allow users to access the data portal. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

This blog post will explore how zero-ETL capabilities combined with its new application connectors are transforming the way businesses integrate and analyze their data from popular platforms such as ServiceNow, Salesforce, Zendesk, SAP and others. The data is also registered in the Glue Data Catalog , a metadata repository.

Data Integration

Data Integration Data Lake Statistics Data-driven

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. By using features like Icebergs compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale.

Data Lake

Data Lake IoT Metadata Testing

Introducing Jobs in Amazon SageMaker

AWS Big Data

JULY 15, 2025

Under Data sources , select Amazon S3. Select the Amazon S3 source node and enter the following values: S3 URI: s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Apparel/ Format: Parquet Select Update node. About the authors Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team.

Visualization

Visualization Data Processing Metrics Big Data

Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMaker

AWS Big Data

JULY 15, 2025

You can use sample data to extract information from the specific category, update partition metadata, and display query results in the notebook using Python code. To use the sample data provided in this blog post, your domain should be in us-east-1 region. Choose the plus sign, and under Data sources , choose Amazon S3.

Data Processing

Data Processing Visualization Metadata Software

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Publish data assets – As the data producer from the retail team, you must ingest individual data assets into Amazon DataZone. For this use case, create a data source and import the technical metadata of four data assets— customers , order_items , orders , products , reviews , and shipments —from AWS Glue Data Catalog.

Visualization

Visualization Data Lake Testing Data Governance

Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques

AWS Big Data

JANUARY 9, 2025

This feature will be discussed in detail later in this blog. The raw metadata is assumed to be not more than 100Gb. However, the recently introduced disk-based vector search feature eliminates the need for external vector quantization. For detailed implementation steps, see to the OpenSearch documentation.

Optimization

Optimization Metrics Modeling Key Performance Indicator

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

AWS Big Data

FEBRUARY 13, 2025

However, you can use Amazon MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers. MSK Replicator offers a built-in replication capability to seamlessly replicate data from one cluster to another. It doesnt explicitly copy the write ACLs except the deny ones.

Metrics

Metrics Metadata Strategy Management

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Icebergs branching feature Iceberg offers a branching feature for data lifecycle management, which is particularly useful for efficiently implementing the WAP pattern. The metadata of an Iceberg table stores a history of snapshots. snappy.parquet s3:// /src-data/current/ !aws He works based in Tokyo, Japan.

Data Quality

Data Quality Publishing Snapshot Data Lake

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

Many enterprises have heterogeneous data platforms and technology stacks across different business units or data domains. For decades, they have been struggling with scale, speed, and correctness required to derive timely, meaningful, and actionable insights from vast and diverse big data environments.

Metadata

Metadata Data Warehouse ROI Snapshot

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

AWS Big Data

DECEMBER 11, 2024

Data processing and SQL analytics Analyze, prepare, and integrate data for analytics and AI using Amazon Athena, Amazon EMR, AWS Glue, and Amazon Redshift. Data and AI governance Publish your data products to the catalog with glossaries and metadata forms. Big Data Architect. option("sep", ",").load("s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv")

Data Lake

Data Lake Data Warehouse Data-driven Big Data

Amazon OpenSearch Service 101: Create your first search application with OpenSearch

AWS Big Data

JUNE 25, 2025

Each product record contains rich metadata, including title, detailed description, category, color, and price. For more insights, best practices and architectures, and industry trends, refer to Amazon OpenSearch Service blog posts and hands-on workshops at AWS Workshops. For an exhaustive list, refer to Search features.

Dashboards

Dashboards IoT Interactive Visualization

Optimize traffic costs of Amazon MSK consumers on Amazon EKS with rack awareness

AWS Big Data

JULY 30, 2025

Are you incurring significant cross Availability Zone traffic costs when running an Apache Kafka client in containerized environments on Amazon Elastic Kubernetes Service (Amazon EKS) that consume data from Amazon Managed Streaming for Apache Kafka (Amazon MSK) topics? An Apache Kafka client consumer will register to read against a topic.

Optimization

Optimization Metadata Management Data Processing

Near real-time baggage operational insights for airlines using Amazon Kinesis Data Streams

AWS Big Data

JULY 8, 2025

More details related to baggage operational database modernization can be found at Enhance the reliability of airlines’ mission-critical baggage handling using Amazon DynamoDB in the AWS Database Blog. Amazon QuickSight can be configured to use Amazon Athena to read the data catalog.

Internet of Things

Internet of Things IoT Metrics Data-driven

Build a multi-tenant healthcare system with Amazon OpenSearch Service

AWS Big Data

AUGUST 5, 2025

For example, radiology requires high storage capacity and bandwidth for large medical imaging files, along with specialized indexing for metadata searches. For details, see this blog post: Workload management in OpenSearch-based multi-tenant centralized logging platforms. Transition to cold storage for years 3–7.

Insurance

Insurance Cost-Benefit Optimization Metadata

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

AWS Big Data

MARCH 21, 2025

Traditionally, answering this question would involve multiple data exports, complex extract, transform, and load (ETL) processes, and careful data synchronization across systems. The table metadata is managed by Data Catalog. You can use SageMaker Lakehouse to unify the data across different data sources.

Data Warehouse

Data Warehouse Metadata Publishing Sales

Data Governance and Metadata Management: You Can’t Have One Without the Other

erwin

FEBRUARY 13, 2020

When an organization’s data governance and metadata management programs work in harmony, then everything is easier. Data governance is a complex but critical practice. There’s always more data to handle, much of it unstructured; more data sources, like IoT, more points of integration, and more regulatory compliance requirements.

Metadata

Metadata Data Governance Management Cost-Benefit

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Best Practices for Metadata Management

Alation

JULY 19, 2021

What Is Metadata? Metadata is information about data. A clothing catalog or dictionary are both examples of metadata repositories. Indeed, a popular online catalog, like Amazon, offers rich metadata around products to guide shoppers: ratings, reviews, and product details are all examples of metadata.

Metadata

Metadata Management Data Governance Machine Learning

7 Benefits of Metadata Management

erwin

FEBRUARY 19, 2021

Metadata management is key to wringing all the value possible from data assets. However, most organizations don’t use all the data at their disposal to reach deeper conclusions about how to drive revenue, achieve regulatory compliance or accomplish other strategic objectives. What Is Metadata? Harvest data.

Metadata

Metadata Management Data Quality Cost-Benefit

Very Meta … Unlocking Data’s Potential with Metadata Management Solutions

erwin

OCTOBER 24, 2019

Untapped data, if mined, represents tremendous potential for your organization. While there has been a lot of talk about big data over the years, the real hero in unlocking the value of enterprise data is metadata , or the data about the data. Metadata Is the Heart of Data Intelligence.

Metadata

Metadata Management Data-driven Data Architecture

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets.

Metadata

Metadata Data Governance Metrics Marketing

How Metadata Makes Data Meaningful

erwin

DECEMBER 12, 2019

Metadata is an important part of data governance, and as a result, most nascent data governance programs are rife with project plans for assessing and documenting metadata. But in many scenarios, it seems that the underlying driver of metadata collection projects is that it’s just something you do for data governance.

Metadata

Metadata Data Governance Digital Transformation Data Quality

Are You Content with Your Organization’s Content Strategy?

Rocket-Powered Data Science

JULY 6, 2021

If you include the title of this blog, you were just presented with 13 examples of heteronyms in the preceding paragraphs. This is accomplished through tags, annotations, and metadata (TAM). Smart content includes labeled (tagged, annotated) metadata (TAM). What you have just experienced is a plethora of heteronyms.

Strategy

Strategy Machine Learning Metadata Knowledge Discovery

Top 10 Metadata Management Influencers, Sites, and Blogs You Must Follow in 2021

Octopai

APRIL 19, 2021

Aptly named, metadata management is the process in which BI and Analytics teams manage metadata, which is the data that describes other data. In other words, data is the context and metadata is the content. Without metadata, BI teams are unable to understand the data’s full story.

Metadata

Metadata Management Business Intelligence Data Governance

Top 10 Data Lineage Podcasts, Blogs, and Magazines

Octopai

JANUARY 31, 2021

We have identified the top ten sites, videos, or podcasts online that deal with data lineage. Our list of Top 10 Data Lineage Podcasts, Blogs, and Websites To Follow in 2021. Data Engineering Podcast. This podcast centers around data management and investigates a different aspect of this field each week.

Data Governance

Data Governance Data Processing Data Quality Metadata

Biggest Trends in Data Visualization Taking Shape in 2022

Smart Data Collective

OCTOBER 13, 2021

There are countless examples of big data transforming many different industries. There is no disputing the fact that the collection and analysis of massive amounts of unstructured data has been a huge breakthrough. This is something that you can learn more about in just about any technology blog.

Visualization

Visualization Cost-Benefit Big Data Prescriptive Analytics

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

This blog post is co-written with Hardeep Randhawa and Abhay Kumar from HPE. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. This metadata file is later used to read source file names during processing into the staging layer.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This is a guest blog post co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland. They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed.

Metadata

Metadata Data Lake Machine Learning Big Data

Data Intelligence in the Next Normal; Why, Who and When?

erwin

JANUARY 14, 2021

When the pandemic first hit, there was some negative impact on big data and analytics spending. Digital transformation was accelerated, and budgets for spending on big data and analytics increased. Technical metadata is what makes up database schema and table definitions.

Digital Transformation

Digital Transformation Metadata Big Data Data-driven

Unifying metadata governance across Amazon SageMaker and Collibra

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Webinars

Trending Sources

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Webinars

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

How Volkswagen Autoeuropa built a data solution with a robust governance framework, simplifying access to quality data using Amazon DataZone

How BMW streamlined data access using AWS Lake Formation fine-grained access control

Recap of Amazon Redshift key product announcements in 2024

Use open table format libraries on AWS Glue 5.0 for Apache Spark

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Integrate custom applications with AWS Lake Formation – Part 2

Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Introducing Jobs in Amazon SageMaker

Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMaker

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Amazon OpenSearch Service 101: Create your first search application with OpenSearch

Optimize traffic costs of Amazon MSK consumers on Amazon EKS with rack awareness

Near real-time baggage operational insights for airlines using Amazon Kinesis Data Streams

Build a multi-tenant healthcare system with Amazon OpenSearch Service

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

Data Governance and Metadata Management: You Can’t Have One Without the Other

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Best Practices for Metadata Management

7 Benefits of Metadata Management

Very Meta … Unlocking Data’s Potential with Metadata Management Solutions

Enhance data governance with enforced metadata rules in Amazon DataZone

How Metadata Makes Data Meaningful

Are You Content with Your Organization’s Content Strategy?

Top 10 Metadata Management Influencers, Sites, and Blogs You Must Follow in 2021

Top 10 Data Lineage Podcasts, Blogs, and Magazines

Biggest Trends in Data Visualization Taking Shape in 2022

Use Apache Iceberg in a data lake to support incremental data processing

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

How Cargotec uses metadata replication to enable cross-account data sharing

Data Intelligence in the Next Normal; Why, Who and When?

Stay Connected