Big Data, Metadata and Reference

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Amazon Athena provides interactive analytics service for analyzing the data in Amazon Simple Storage Service (Amazon S3). Amazon Redshift is used to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Table metadata is fetched from AWS Glue.

Metadata

Metadata Data Lake Modeling Data Warehouse

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

However, commits can still fail if the latest metadata is updated after the base metadata version is established. Iceberg uses a layered architecture to manage table state and data: Catalog layer Maintains a pointer to the current table metadata file, serving as the single source of truth for table state.

Snapshot

Snapshot Management Metadata Big Data

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

The landscape of big data management has been transformed by the rising popularity of open table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake. These formats, designed to address the limitations of traditional data storage systems, have become essential in modern data architectures.

Metadata

Metadata Data Warehouse Big Data Data Lake

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake. Central to a transactional data lake are open table formats (OTFs) such as Apache Hudi , Apache Iceberg , and Delta Lake , which act as a metadata layer over columnar formats.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. For more details, refer to Iceberg Release 1.6.1. These are useful for flexible data lifecycle management. For more details, refer to Delta Lake Release 3.2.1.

Snapshot

Snapshot Metadata Data Lake Optimization

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

AWS Big Data

APRIL 9, 2025

Whether youre a data analyst seeking a specific metric or a data steward validating metadata compliance, this update delivers a more precise, governed, and intuitive search experience. Start using this enhanced search capability today and experience the difference it brings to your data discovery journey.

Metadata

Metadata Metrics Cost-Benefit Data-driven

Best Practices for Metadata Management

Alation

JULY 19, 2021

What Is Metadata? Metadata is information about data. A clothing catalog or dictionary are both examples of metadata repositories. Indeed, a popular online catalog, like Amazon, offers rich metadata around products to guide shoppers: ratings, reviews, and product details are all examples of metadata.

Metadata

Metadata Management Data Governance Machine Learning

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

AWS Big Data

NOVEMBER 19, 2024

Solution overview By combining the powerful vector search capabilities of OpenSearch Service with the access control features provided by Amazon Cognito , this solution enables organizations to manage access controls based on custom user attributes and document metadata. Refer to Service Quotas for more details.

Management

Management Metadata Manufacturing Testing

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge offers a comprehensive suite of tools that automate much of the complex conversion work, allowing organizations to quickly and reliably transition their data analytics capabilities to the scalable Amazon Redshift data warehouse. Amazon Redshift is a fully managed data warehouse service offered by Amazon Web Services (AWS).

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

What Is a Metadata Catalog? (And How it Can Dramatically Improve Your Data Accuracy)

Octopai

JANUARY 31, 2022

If you’re a mystery lover, I’m sure you’ve read that classic tale: Sherlock Holmes and the Case of the Deceptive Data, and you know how a metadata catalog was a key plot element. In The Case of the Deceptive Data, Holmes is approached by B.I. He goes on to explain: Reasons for inaccurate data. Big data is BIG.

Metadata

Metadata IT Unstructured Data IoT

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

AWS Big Data

NOVEMBER 29, 2023

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift. This post discusses restricting listing of data catalog metadata as per the granted permissions.

Metadata

Metadata Data Warehouse Analytics Data Analytics

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Pricing and availability Amazon MWAA pricing dimensions remains unchanged, and you only pay for what you use: The environment class Metadata database storage consumed Metadata database storage pricing remains the same. Refer to Amazon Managed Workflows for Apache Airflow Pricing for rates and more details.

Metadata

Metadata Cost-Benefit Metrics Optimization

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. Industry-leading price-performance: Amazon Redshift launches RA3.large

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets.

Metadata

Metadata Data Governance Metrics Marketing

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Row type – Enable this option to display only rows that have at least one cell with authorized data. Unfiltered Table Metadata This tab displays the response of the AWS Glue API GetUnfilteredTableMetadata policies for the selected table. About the Authors Stefano Sandonà is a Senior Big Data Specialist Solution Architect at AWS.

Data Processing

Data Processing Metadata Publishing Testing

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

reduces the Amazon DynamoDB cost associated with KCL by optimizing read operations on the DynamoDB table storing metadata. KCL uses DynamoDB to store metadata such as shard-worker mapping and checkpoints. x benefits, refer to Use features of the AWS SDK for Java 2.x. Refer to Step 4 of Migrating from KCL 2.x x to KCL 3.x

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Reporting being part of an effective DQM, we will also go through some data quality metrics examples you can use to assess your efforts in the matter. But first, let’s define what data quality actually is. What is the definition of data quality? Why Do You Need Data Quality Management? 2 – Data profiling.

Data Quality

Data Quality Metrics Data-driven Management

The Power of Graph Databases, Linked Data, and Graph Algorithms

Rocket-Powered Data Science

MARCH 10, 2020

The book is awesome, an absolute must-have reference volume, and it is free (for now, downloadable from Neo4j ). Any node and its relationship to a particular node becomes a type of contextual metadata for that particular note. Graph Algorithms book. Context may include time, location, related events, nearby entities, and more.

Metadata

Metadata Machine Learning Prescriptive Analytics ROI

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

AWS Big Data

FEBRUARY 6, 2023

For instructions, refer to How to Set Up a MongoDB Cluster. Configure PrivateLink by following the steps described in Connecting Applications Securely to a MongoDB Atlas Data Plane with AWS PrivateLink. Choose the table to view the schema and other metadata. Sandeep Adwankar is a Senior Technical Product Manager at AWS.

Metadata

Metadata Data Lake Machine Learning Big Data

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

The following are the key components and steps in the integration process: Zero-ETL extracts and loads the data into Amazon S3 , a highly scalable object storage service. The data is also registered in the Glue Data Catalog , a metadata repository. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert.

Data Integration

Data Integration Data Lake Statistics Data-driven

Proposals for model vulnerability and security

O'Reilly on Data

MARCH 20, 2019

Data poisoning attacks. Data poisoning refers to someone systematically changing your training data to manipulate your model’s predictions. Data poisoning attacks have also been called “causative” attacks.) To poison data, an attacker must have access to some or all of your training data.

Modeling

Modeling Machine Learning Predictive Modeling Consulting

Deprecation of Lake Formation’s Governed Tables Feature

AWS Big Data

OCTOBER 2, 2024

Governed Tables metadata will continue to exist within the AWS Glue Data Catalog, and the Governed Tables data will remain in your S3 buckets. In this case, refer to Use CTAS and INSERT INTO to work around the 100 partition limit. About the author Mert Hocanin is a Principal Big Data Architect with AWS Lake Formation.

Snapshot

Snapshot Metadata Big Data Analytics

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Cloud data architect: The cloud data architect designs and implements data architecture for cloud-based platforms such as AWS, Azure, and Google Cloud Platform. Data security architect: The data security architect works closely with security teams and IT teams to design data security architectures.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

SageMaker brings together widely adopted AWS ML and analytics capabilities—virtually all of the components you need for data exploration, preparation, and integration; petabyte-scale big data processing; fast SQL analytics; model development and training; governance; and generative AI development.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

AWS Big Data

JUNE 2, 2023

We use AWS Glue , a fully managed, serverless, ETL (extract, transform, and load) service, and the Google BigQuery Connector for AWS Glue (for more information, refer to Migrating data from Google BigQuery to Amazon S3 using AWS Glue custom connectors ). If you don’t have one, refer to Amazon Redshift Serverless. An S3 bucket.

Metadata

Metadata Data Warehouse Big Data Analytics

Introducing support for Apache Kafka on Raft mode (KRaft) with Amazon MSK clusters

AWS Big Data

MAY 29, 2024

Since its inception, Apache Kafka has depended on Apache Zookeeper for storing and replicating the metadata of Kafka brokers and topics. the Kafka community has adopted KRaft (Apache Kafka on Raft), a consensus protocol, to replace Kafka’s dependency on ZooKeeper for metadata management. For Metadata mode , select KRaft.

Metadata

Metadata Cost-Benefit Management Big Data

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

AWS Big Data

SEPTEMBER 12, 2024

The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Iceberg creates a new version called a snapshot for every change to the data in the table. The customer table data and metadata are stored in the S3 bucket.

Optimization

Optimization Snapshot Metadata Metrics

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly. The solution integrates data in three tiers.

Unstructured Data

Unstructured Data Metadata Management Analytics

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

This approach streamlines data access while ensuring proper governance. To learn more about working with events using EventBridge, refer to Events via Amazon EventBridge default bus. After you create the asset, you can add glossaries or metadata forms, but its not necessary for this post. Enter a name for the asset.

Publishing

Publishing Unstructured Data Metadata Data-driven

Top 10 Data Governance Trends for 2020: Data’s Real Value Comes Into Focus

erwin

JANUARY 3, 2020

Understanding the data governance trends for the year ahead will give business leaders and data professionals a competitive edge … Happy New Year! Regulatory compliance and data breaches have driven the data governance narrative during the past few years.

Data Governance

Data Governance Digital Transformation IoT Metadata

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

It shows a call center streaming data source that sends the latest call center feed in every 15 seconds. The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. We use two datasets in this post.

Management

Management Metadata Analytics Dashboards

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Maximize your data dividends with active metadata

IBM Big Data Hub

NOVEMBER 28, 2022

Metadata management performs a critical role within the modern data management stack. It helps blur data silos, and empowers data and analytics teams to better understand the context and quality of data. This, in turn, builds trust in data and the decision-making to follow. Improve data discovery.

Metadata

Metadata Data Quality Data-driven Data Governance

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

Benchmark setup In our testing, we used the 3 TB dataset stored in Amazon S3 in compressed Parquet format and metadata for databases and tables is stored in the AWS Glue Data Catalog. This benchmark uses unmodified TPC-DS data schema and table relationships. He has been focusing in the big data analytics space since 2014.

Metadata

Metadata Statistics Broadcasting Optimization

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The Business Application Research Center (BARC) warns that data governance is a highly complex, ongoing program, not a “big bang initiative,” and it runs the risk of participants losing trust and interest over time. The program must introduce and support standardization of enterprise data.

Data Governance

Data Governance Management Metadata Data Quality

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

This approach simplifies your data journey and helps you meet your security requirements. The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections. To learn more, refer to Amazon SageMaker Unified Studio.

Visualization

Visualization Data Processing Testing Publishing

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

Configure SAML federation with Amazon OpenSearch Serverless and Keycloak

AWS Big Data

JULY 24, 2024

aoss:UpdateSecurityConfig – Modify a given SAML provider configuration, including the XML metadata. For more details, refer to Considerations. IdP Metadata by choosing the link on the Realm settings page. You need this metadata, when you create the SAML provider in OpenSearch Serverless. Choose Create SAML provider.

Dashboards

Dashboards Metadata Visualization Consulting

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

These formats enable ACID (atomicity, consistency, isolation, durability) transactions, upserts, and deletes, and advanced features such as time travel and snapshots that were previously only available in data warehouses. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot

Snapshot Data Lake Metadata Optimization

Disaster recovery strategies for Amazon MWAA – Part 2

AWS Big Data

JUNE 17, 2024

Backup and restore architecture The backup and restore strategy involves periodically backing up Amazon MWAA metadata to Amazon Simple Storage Service (Amazon S3) buckets in the primary Region. Refer to the detailed deployment steps in the README file to deploy it in your own accounts. The steps are as follows: [1.a]

Strategy

Strategy Metadata Recreation/Entertainment Metrics

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows.

Metadata

Metadata Data Processing Management Testing

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Webinars

Trending Sources

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Webinars

Run Apache XTable in AWS Lambda for background conversion of open table formats

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Build a high-performance quant research platform with Apache Iceberg

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

Best Practices for Metadata Management

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

What Is a Metadata Catalog? (And How it Can Dramatically Improve Your Data Accuracy)

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Introducing Amazon MWAA micro environments for Apache Airflow

Recap of Amazon Redshift key product announcements in 2024

Enhance data governance with enforced metadata rules in Amazon DataZone

Integrate custom applications with AWS Lake Formation – Part 2

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

The Power of Graph Databases, Linked Data, and Graph Algorithms

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Proposals for model vulnerability and security

Deprecation of Lake Formation’s Governed Tables Feature

What is a data architect? Skills, salaries, and how to become a data framework master

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

Introducing support for Apache Kafka on Raft mode (KRaft) with Amazon MSK clusters

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

Unstructured data management and governance using AWS AI/ML and analytics services

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Top 10 Data Governance Trends for 2020: Data’s Real Value Comes Into Focus

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Maximize your data dividends with active metadata

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

What is data governance? Best practices for managing data assets

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

How Cargotec uses metadata replication to enable cross-account data sharing

Configure SAML federation with Amazon OpenSearch Serverless and Keycloak

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Disaster recovery strategies for Amazon MWAA – Part 2

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Stay Connected