Big Data, Management and Metadata

AWS Glue for Handling Metadata

Analytics Vidhya

AUGUST 19, 2022

Introduction AWS Glue helps Data Engineers to prepare data for other data consumers through the Extract, Transform & Load (ETL) Process. The managed service offers a simple and cost-effective method of categorizing and managing big data in an enterprise. It provides organizations with […].

Metadata

Metadata Data Science Big Data Publishing

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

However, commits can still fail if the latest metadata is updated after the base metadata version is established. Icebergs concurrency model and conflict type Before diving into specific implementation patterns, its essential to understand how Iceberg manages concurrent writes through its table architecture and transaction model.

Snapshot

Snapshot Management Metadata Big Data

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Amazon Athena provides interactive analytics service for analyzing the data in Amazon Simple Storage Service (Amazon S3). Amazon Redshift is used to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Table metadata is fetched from AWS Glue.

Metadata

Metadata Data Lake Modeling Data Warehouse

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

AWS Big Data

NOVEMBER 19, 2024

The permission mechanism has to be secure, built on top of built-in security features, and scalable for manageability when the user base scales out. In this post, we show you how to manage user access to enterprise documents in generative AI-powered tools according to the access you assign to each persona.

Management

Management Metadata Manufacturing Testing

SAP Datasphere Powers Business at the Speed of Data

Rocket-Powered Data Science

MARCH 20, 2023

With all the data in and around the enterprise, users would say that they have a lot of information but need more insights to assist them in producing better and more informative content. This is where we dispel an old “big data” notion (heard a decade ago) that was expressed like this: “we need our data to run at the speed of business.”

Data Warehouse

Data Warehouse Metadata Digital Transformation Machine Learning

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

The landscape of big data management has been transformed by the rising popularity of open table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake. These formats, designed to address the limitations of traditional data storage systems, have become essential in modern data architectures.

Metadata

Metadata Data Warehouse Big Data Data Lake

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

In this post, we focus on data management implementation options such as accessing data directly in Amazon Simple Storage Service (Amazon S3), using popular data formats like Parquet, or using open table formats like Iceberg. Data management is the foundation of quantitative research.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

AWS Big Data

APRIL 9, 2025

Whether youre a data analyst seeking a specific metric or a data steward validating metadata compliance, this update delivers a more precise, governed, and intuitive search experience. Start using this enhanced search capability today and experience the difference it brings to your data discovery journey.

Metadata

Metadata Metrics Data-driven Cost-Benefit

CRM’s Have a Big Data Technical Debt Problem: Here’s How to Fix It

Smart Data Collective

JULY 27, 2021

Customer relationship management (CRM) platforms are very reliant on big data. As these platforms become more widely used, some of the data resources they depend on become more stretched. CRM providers need to find ways to address the technical debt problem they are facing through new big data initiatives.

Big Data

Big Data Snapshot IT Dashboards

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

It is appealing to migrate from self-managed OpenSearch and Elasticsearch clusters in legacy versions to Amazon OpenSearch Service to enjoy the ease of use, native integration with AWS services, and rich features from the open-source environment ( OpenSearch is now part of Linux Foundation ).

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. Their ability to resolve critical issues such as data consistency, query efficiency, and governance renders them indispensable for data- driven organizations.

Snapshot

Snapshot Metadata Data Lake Optimization

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

Their terminal operations rely heavily on seamless data flows and the management of vast volumes of data. With the addition of these technologies alongside existing systems like terminal operating systems (TOS) and SAP, the number of data producers has grown substantially. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

1) What Is Data Quality Management? 4) Data Quality Best Practices. 5) How Do You Measure Data Quality? 6) Data Quality Metrics Examples. 7) Data Quality Control: Use Case. 8) The Consequences Of Bad Data Quality. 9) 3 Sources Of Low-Quality Data. 10) Data Quality Solutions: Key Attributes.

Data Quality

Data Quality Metrics Data-driven Management

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Recognizing this paradigm shift, ANZ Institutional Division has embarked on a transformative journey to redefine its approach to data management, utilization, and extracting significant business value from data insights. This enables global discoverability and collaboration without centralizing ownership or operations.

Metadata

Metadata Data Governance Data Quality Data-driven

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

Additionally, we show how to use AWS AI/ML services for analyzing unstructured data. Why it’s challenging to process and manage unstructured data Unstructured data makes up a large proportion of the data in the enterprise that can’t be stored in a traditional relational database management systems (RDBMS).

Unstructured Data

Unstructured Data Metadata Management Analytics

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Redshift is a fully managed, AI-powered cloud data warehouse that delivers the best price-performance for your analytics workloads at any scale. It enables you to get insights faster without extensive knowledge of your organization’s complex database schema and metadata. Your data is not shared across accounts.

Metadata

Metadata Sales Data Warehouse Optimization

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

AWS Big Data

OCTOBER 24, 2024

Amazon OpenSearch Service is a fully managed service for search and analytics. It allows organizations to secure data, perform searches, analyze logs, monitor applications in real time, and explore interactive log analytics. You can use an existing domain or create a new domain. Make sure the Python version is later than 2.7.0:

Visualization

Visualization Management Data Processing Testing

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets.

Metadata

Metadata Data Governance Metrics Marketing

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

To achieve this, they aimed to break down data silos and centralize data from various business units and countries into the BMW Cloud Data Hub (CDH). However, the initial version of CDH supported only coarse-grained access control to entire data assets, and hence it was not possible to scope access to data asset subsets.

Data Lake

Data Lake Sales Metadata Machine Learning

Dive deep into security management: The Data on EKS Platform

AWS Big Data

APRIL 29, 2024

The construction of big data applications based on open source software has become increasingly uncomplicated since the advent of projects like Data on EKS , an open source project from AWS to provide blueprints for building data and machine learning (ML) applications on Amazon Elastic Kubernetes Service (Amazon EKS).

Management

Management Big Data Data Warehouse Metadata

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

By directly integrating with Lakehouse, all the data is automatically cataloged and can be secured through fine-grained permissions in Lake Formation. Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. Zero-ETL provides service-managed replication. What is zero-ETL?

Data Integration

Data Integration Data Lake Statistics Data-driven

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA), is a managed Apache Airflow service used to extract business insights across an organization by combining, enriching, and transforming data through a series of tasks called a workflow. This approach offers greater flexibility and control over workflow management.

Metadata

Metadata Cost-Benefit Metrics Optimization

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

In this post, we focus on the use case for centralizing log aggregation for an organization that has a compliance need to archive and retain its log data. Kinesis Data Streams is a fully managed, serverless data streaming service that stores and ingests various streaming data in real time at any scale.

Metadata

Metadata Metrics Analytics Data Processing

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

It encompasses the people, processes, and technologies required to manage and protect data assets. The Data Management Association (DAMA) International defines it as the “planning, oversight, and control over management of data and the use of data and data-related sources.”

Data Governance

Data Governance Management Metadata Data Quality

Are You Content with Your Organization’s Content Strategy?

Rocket-Powered Data Science

JULY 6, 2021

The key to success is to start enhancing and augmenting content management systems (CMS) with additional features: semantic content and context. This is accomplished through tags, annotations, and metadata (TAM). TAM management, like content management, begins with business strategy. Collect, curate, and catalog (i.e.,

Strategy

Strategy Machine Learning Metadata Knowledge Discovery

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Organizations with legacy, on-premises, near-real-time analytics solutions typically rely on self-managed relational databases as their data store for analytics workloads. Near-real-time streaming analytics captures the value of operational data and metrics to provide new insights to create business opportunities.

Management

Management Metadata Analytics Dashboards

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

AWS Big Data

NOVEMBER 29, 2023

From Talent Acquisition to Talent Management and talent insights, Eightfold offers a single AI platform that does it all. It delivers analytics and enhanced insights about the customer’s Talent Acquisition, Talent Management pipelines, and much more. Customers can also implement their own custom dashboards in QuickSight.

Metadata

Metadata Data Warehouse Analytics Data Analytics

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. As stated earlier, the first step involves data ingestion.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Ali Tore, Senior Vice President of Advanced Analytics at Salesforce, highlighting the value of this integration, says “We’re excited to partner with Amazon to bring Tableau’s powerful data exploration and AI-driven analytics capabilities to customers managing data across organizational boundaries with Amazon DataZone.

Visualization

Visualization Data Lake Testing Data Governance

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. This fragmented, repetitive, and error-prone experience for data connectivity is a significant obstacle to data integration, analysis, and machine learning (ML) initiatives.

Visualization

Visualization Data Processing Testing Publishing

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

SageMaker brings together widely adopted AWS ML and analytics capabilities—virtually all of the components you need for data exploration, preparation, and integration; petabyte-scale big data processing; fast SQL analytics; model development and training; governance; and generative AI development.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. You can also create new data lake tables using Redshift Managed Storage (RMS) as a native storage option.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

Kinesis Data Streams not only offers the flexibility to use many out-of-box integrations to process the data published to the streams, but also provides the capability to build custom stream processing applications that can be deployed on your compute fleet. This is where Kinesis Client Library (KCL) comes in.

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

As organizations increasingly adopt cloud-based solutions and centralized identity management, the need for seamless and secure access to data warehouses like Amazon Redshift becomes crucial. federated users to access the AWS Management Console. From there, the user can access the Redshift Query Editor V2.

Sales

Sales Metadata Enterprise Testing

Introducing support for Apache Kafka on Raft mode (KRaft) with Amazon MSK clusters

AWS Big Data

MAY 29, 2024

Organizations are adopting Apache Kafka and Amazon Managed Streaming for Apache Kafka (Amazon MSK) to capture and analyze data in real time. Since its inception, Apache Kafka has depended on Apache Zookeeper for storing and replicating the metadata of Kafka brokers and topics. Starting from Apache Kafka version 3.3,

Metadata

Metadata Cost-Benefit Management Big Data

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

AWS Big Data

OCTOBER 23, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a fully managed service that builds upon Apache Airflow, offering its benefits while eliminating the need for you to set up, operate, and maintain the underlying infrastructure, reducing operational overhead while increasing security and resilience.

Interactive

Interactive Testing Data-driven Data Lake

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge offers a comprehensive suite of tools that automate much of the complex conversion work, allowing organizations to quickly and reliably transition their data analytics capabilities to the scalable Amazon Redshift data warehouse. Amazon Redshift is a fully managed data warehouse service offered by Amazon Web Services (AWS).

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Organizational data is often fragmented across multiple lines of business, leading to inconsistent and sometimes duplicate datasets. This fragmentation can delay decision-making and erode trust in available data. This approach streamlines data access while ensuring proper governance. Enter a name for the asset.

Publishing

Publishing Unstructured Data Metadata Data-driven

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

AWS Big Data

MAY 20, 2025

Create an AWS Identity and Access Management (IAM) role. Create an Amazon Bedrock knowledge base referencing structured data in Amazon Redshift Amazon Bedrock Knowledge Bases uses Amazon Redshift as the query engine to query your data. It reads metadata from your structured data store to generate SQL queries.

Structured Data

Structured Data Data Warehouse Analytics Finance

Introducing enhanced functionality for worker configuration management in Amazon MSK Connect

AWS Big Data

MARCH 25, 2024

Amazon MSK Connect is a fully managed service for Apache Kafka Connect. With a few clicks, MSK Connect allows you to deploy connectors that move data between Apache Kafka and external systems. Together, these new capabilities make it straightforward to manage your MSK Connect resources and automate deployments through CI/CD pipelines.

Management

Management Metadata Reporting Big Data

How Zurich Insurance Group built a log management solution on AWS

AWS Big Data

JULY 16, 2024

The Zurich Cyber Fusion Center management team faced similar challenges, such as balancing licensing costs to ingest and long-term retention requirements for both business application log and security log data within the existing SIEM architecture. Previously, P2 logs were ingested into the SIEM.

Insurance

Insurance Management Cost-Benefit Optimization

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

AWS Big Data

SEPTEMBER 12, 2024

The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Iceberg creates a new version called a snapshot for every change to the data in the table. The customer table data and metadata are stored in the S3 bucket.

Optimization

Optimization Snapshot Metadata Metrics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Apache Iceberg offers integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more. AWS provides integrations for various AWS services with Iceberg tables as well, including AWS Glue Data Catalog for tracking table metadata.

Data Lake

Data Lake Snapshot Metadata Data Architecture

AWS Glue for Handling Metadata

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Webinars

Trending Sources

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Webinars

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

SAP Datasphere Powers Business at the Speed of Data

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Build a high-performance quant research platform with Apache Iceberg

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

CRM’s Have a Big Data Technical Debt Problem: Here’s How to Fix It

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Use open table format libraries on AWS Glue 5.0 for Apache Spark

How EUROGATE established a data mesh architecture using Amazon DataZone

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Unstructured data management and governance using AWS AI/ML and analytics services

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

Enhance data governance with enforced metadata rules in Amazon DataZone

How BMW streamlined data access using AWS Lake Formation fine-grained access control

Dive deep into security management: The Data on EKS Platform

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Introducing Amazon MWAA micro environments for Apache Airflow

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

What is data governance? Best practices for managing data assets

Are You Content with Your Organization’s Content Strategy?

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Recap of Amazon Redshift key product announcements in 2024

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Introducing support for Apache Kafka on Raft mode (KRaft) with Amazon MSK clusters

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

Introducing enhanced functionality for worker configuration management in Amazon MSK Connect

How Zurich Insurance Group built a log management solution on AWS

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Stay Connected