Data Architecture, Metadata and Optimization

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. We will also cover the pattern with automatic compaction through AWS Glue Data Catalog table optimization. Generate new metadata files.

Snapshot

Snapshot Management Metadata Big Data

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

We also examine how centralized, hybrid and decentralized data architectures support scalable, trustworthy ecosystems. As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

This post describes how HPE Aruba automated their Supply Chain management pipeline, and re-architected and deployed their data solution by adopting a modern data architecture on AWS. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

For container terminal operators, data-driven decision-making and efficient data sharing are vital to optimizing operations and boosting supply chain efficiency. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

How to Manage Risk with Modern Data Architectures

Cloudera

JUNE 29, 2023

To improve the way they model and manage risk, institutions must modernize their data management and data governance practices. Implementing a modern data architecture makes it possible for financial institutions to break down legacy data silos, simplifying data management, governance, and integration — and driving down costs.

Data Architecture

Data Architecture Risk Management Risk Management

5 Ways Data Modeling Is Critical to Data Governance

erwin

JANUARY 9, 2020

Then there’s unstructured data with no contextual framework to govern data flows across the enterprise not to mention time-consuming manual data preparation and limited views of data lineage. So here’s why data modeling is so critical to data governance. erwin Data Modeler: Where the Magic Happens.

Data Governance

Data Governance Modeling Metadata Unstructured Data

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architecture is a complex and varied field and different organizations and industries have unique needs when it comes to their data architects. Solutions data architect: These individuals design and implement data solutions for specific business needs, including data warehouses, data marts, and data lakes.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

Amazon SageMaker Lakehouse provides an open data architecture that reduces data silos and unifies data across Amazon Simple Storage Service (Amazon S3) data lakes, Redshift data warehouses, and third-party and federated data sources.

Analytics

Analytics Data Lake Metadata Data Warehouse

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Cloudinary is a cloud-based media management platform that provides a comprehensive set of tools and services for managing, optimizing, and delivering images, videos, and other media assets on websites and mobile applications. This concept makes Iceberg extremely versatile.

Data Lake

Data Lake Metadata Snapshot Analytics

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

Over the past decade, the successful deployment of large scale data platforms at our customers has acted as a big data flywheel driving demand to bring in even more data, apply more sophisticated analytics, and on-board many new data practitioners from business analysts to data scientists.

Snapshot

Snapshot Metadata Cost-Benefit Data Architecture

Data architecture strategy for data quality

IBM Big Data Hub

JANUARY 5, 2023

Several factors determine the quality of your enterprise data like accuracy, completeness, consistency, to name a few. But there’s another factor of data quality that doesn’t get the recognition it deserves: your data architecture. How the right data architecture improves data quality.

Data Architecture

Data Architecture Data Quality Strategy Data Lake

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue.

Data Lake

Data Lake Data Processing Optimization Machine Learning

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge offers a comprehensive suite of tools that automate much of the complex conversion work, allowing organizations to quickly and reliably transition their data analytics capabilities to the scalable Amazon Redshift data warehouse. Amazon Redshift is a fully managed data warehouse service offered by Amazon Web Services (AWS).

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

The external data catalog can be AWS Glue Data Catalog, the data catalog that comes with Amazon Athena, or your own Apache Hive metastore. To get the best performance on data lake queries with Redshift, you can use AWS Glue Data Catalog’s column statistics feature to collect statistics on Data Lake tables.

Data Lake

Data Lake Statistics Broadcasting Optimization

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

Data governance principles According to the Data Governance Institute, eight principles are at the center of all successful data governance and stewardship programs: All participants must have integrity in their dealings with each other. The program must introduce and support standardization of enterprise data.

Data Governance

Data Governance Management Metadata Data Quality

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

In fact, we recently announced the integration with our cloud ecosystem bringing the benefits of Iceberg to enterprises as they make their journey to the public cloud, and as they adopt more converged architectures like the Lakehouse. 1: Multi-function analytics .

Metadata

Metadata Data Architecture Machine Learning Cost-Benefit

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Through their unique position in ports, at sea, and on roads, they optimize global cargo flows and create sustainable customer value. Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. An AWS Glue job (metadata exporter) runs daily on the source account.

Metadata

Metadata Data Lake Machine Learning Big Data

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.

Data Integration

Data Integration Data Lake Statistics Data-driven

Data democratization: How data architecture can drive business decisions and AI initiatives

IBM Big Data Hub

AUGUST 4, 2023

Today, the way businesses use data is much more fluid; data literate employees use data across hundreds of apps, analyze data for better decision-making, and access data from numerous locations. It uses knowledge graphs, semantics and AI/ML technology to discover patterns in various types of metadata.

Data Architecture

Data Architecture Data Lake Machine Learning Data Governance

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Those decentralization efforts appeared under different monikers through time, e.g., data marts versus data warehousing implementations (a popular architectural debate in the era of structured data) then enterprise-wide data lakes versus smaller, typically BU-Specific, “data ponds”.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Building a Beautiful Data Lakehouse

CIO Business Intelligence

MARCH 9, 2022

They conveniently store data in a flat architecture that can be queried in aggregate and offer the speed and lower cost required for big data analytics. On the other hand, they don’t support transactions or enforce data quality. If only there were a best-of-both-worlds compromise. .

Data Lake

Data Lake Unstructured Data Data Warehouse Big Data

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

BMW Group uses 4,500 AWS Cloud accounts across the entire organization but is faced with the challenge of reducing unnecessary costs, optimizing spend, and having a central place to monitor costs. The ultimate goal is to raise awareness of cloud efficiency and optimize cloud utilization in a cost-effective and sustainable manner.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

The Iceberg specification allows seamless table evolution such as schema and partition evolution, and its design is optimized for usage on Amazon S3. Iceberg stores the metadata pointer for all the metadata files. In this post, we use the Yellow taxi public dataset from NYC Taxi & Limousine Commission as our source data.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

How the right data and AI foundation can empower a successful ESG strategy

IBM Big Data Hub

APRIL 10, 2023

A well-designed data architecture should support business intelligence and analysis, automation, and AI—all of which can help organizations to quickly seize market opportunities, build customer value, drive major efficiencies, and respond to risks such as supply chain disruptions.

Strategy

Strategy Data Architecture Cost-Benefit Reporting

Lay the groundwork now for advanced analytics and AI

CIO Business Intelligence

AUGUST 3, 2023

It also used device data to develop Lenovo Device Intelligence, which uses AI-driven predictive analytics to help customers understand and proactively prevent and solve potential IT issues. Lenovo Device Intelligence can also help to optimize IT support costs, reduce employee downtime, and improve the user experience, the company says.

Analytics

Analytics Data Lake Metadata Cost-Benefit

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

Architecture overview The following diagram illustrates the solution architecture. The solution uses AWS Serverless Analytics services such as AWS Glue to optimize data layout by partitioning and formatting the server access logs to be consumed by other services.

Metadata

Metadata Dashboards Metrics Visualization

Dive deep into security management: The Data on EKS Platform

AWS Big Data

APRIL 29, 2024

About the Authors Yuzhu Xiao is a Senior Data Development Engineer at Amber Group with extensive experience in cloud data platform architecture. Xin Zhang is an AWS Solutions Architect, responsible for solution consulting and design based on the AWS Cloud platform.

Management

Management Big Data Data Warehouse Metadata

AWS re:Invent 2023 Amazon Redshift Sessions Recap

AWS Big Data

DECEMBER 18, 2023

Amazon Redshift powers data-driven decisions for tens of thousands of customers every day with a fully managed, AI-powered cloud data warehouse, delivering the best price-performance for your analytics workloads. In this session, learn about Redshift Serverless new AI-driven scaling and optimization functionality.

Data Warehouse

Data Warehouse Machine Learning Data-driven Data Lake

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

Modernizing analytics for scale, performance, and reliability “Our migration from legacy on-premises platform to Amazon Redshift allows us to ingest data 88% faster, query data 3x faster, and load daily data to the cloud 6x faster.

Data Warehouse

Data Warehouse Data Lake Analytics Machine Learning

Embedding AI Into Every Aspect of Your Business

Cloudera

JULY 20, 2021

So relying upon the past for future insights with data that is outdated due to changing customer preferences, the hyper-competitive world and emphasis on environment, society and governance produces non-relevant insights and sub-optimized returns. Quality data needs to be the normalizing factor.

Manufacturing

Manufacturing Forecasting IoT Insurance

How Zurich Insurance Group built a log management solution on AWS

AWS Big Data

JULY 16, 2024

The new approach would need to offer the flexibility to integrate new technologies such as machine learning (ML), scalability to handle long-term retention at forecasted growth levels, and provide options for cost optimization. Athena supports a variety of compression formats for reading and writing data.

Insurance

Insurance Management Cost-Benefit Optimization

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To create and manage the data products, smava uses Amazon Redshift , a cloud data warehouse. In this post, we show how smava optimized their data platform by using Amazon Redshift Serverless and Amazon Redshift data sharing to overcome right-sizing challenges for unpredictable workloads and further improve price-performance.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. As Ozone scales to exabytes of data, it is important to ensure that Ozone Manager can perform at scale. Cisco has multiple reference architectures for running Ozone.

Management

Management Metadata Metrics Optimization

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. The following diagram illustrates the solution architecture. This post is co-written with Eliad Gat and Oded Lifshiz from Orca Security. Orca addressed this in several ways.

Data Lake

Data Lake Analytics Snapshot Data Quality

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

These inputs reinforced the need of a unified data strategy across the FinOps teams. We decided to build a scalable data management product that is based on the best practices of modern data architecture. Our source system and domain teams were mapped as data producers, and they would have ownership of the datasets.

Finance

Finance Metadata Big Data Recreation/Entertainment

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Data has become an invaluable asset for businesses, offering critical insights to drive strategic decision-making and operational optimization. The business end-users were given a tool to discover data assets produced within the mesh and seamlessly self-serve on their data sharing needs.

Data Governance

Data Governance Publishing Data-driven Metadata

Processing large records with Amazon Kinesis Data Streams

AWS Big Data

OCTOBER 16, 2023

To meet this need, AWS offers Amazon Kinesis Data Streams , a powerful and scalable real-time data streaming service. With Kinesis Data Streams, you can effortlessly collect, process, and analyze streaming data in real time at any scale. This optimization is achieved by storing just the URL within Kinesis Data Streams.

Cost-Benefit

Cost-Benefit Testing Optimization Metadata

IBM named a leader in the 2022 Gartner® Magic Quadrant™ for Data Integration Tools

IBM Big Data Hub

AUGUST 24, 2022

Remote runtime data integration as-a-service execution capabilities for on-premises and multi-cloud execution. Multi-directional data movement topology with high volume and low-latency integration. Support for data governance. Metadata exchange with third party metadata management and governance tools.

Data Integration

Data Integration Metadata Data-driven Data Architecture

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

The RDV organizes data into three key types of tables: Hubs – This type of table represents a core business entity such as a customer. Each record in a hub table is married with metadata that identifies the record’s creation time, originating source system, and unique business key.

Enterprise

Enterprise Data Warehouse Data Lake Optimization

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

Queue priorities needed to be reconfigured for optimal performance. Transition from Navigator by migrating the business metadata (tags, entity names, custom properties, descriptions and technical metadata (Hive, Spark, HDFS, Impala) to Atlas. Sentry Hive / HDFS ACL sync is not included in CDP-DC 7.1 (on on roadmap).

Testing

Testing Metadata Risk Data Science

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

AWS Big Data

JUNE 12, 2024

These topics include federation with the Swisscom identity provider (IdP), JDBC connections, detective controls using AWS Config rules and remediation actions, cost optimization using the Redshift scheduler, and audit logging. The following high-level architecture diagram shows ODP with different layers of the modern data architecture.

Data Architecture

Data Architecture Cost-Benefit Data-driven Experimentation

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Data’s dark secret: Why poor quality cripples AI and growth

Webinars

Trending Sources

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Webinars

How EUROGATE established a data mesh architecture using Amazon DataZone

How to Manage Risk with Modern Data Architectures

5 Ways Data Modeling Is Critical to Data Governance

What is a data architect? Skills, salaries, and how to become a data framework master

Top analytics announcements of AWS re:Invent 2024

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Introducing Apache Iceberg in Cloudera Data Platform

Data architecture strategy for data quality

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

What is data governance? Best practices for managing data assets

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

How Cargotec uses metadata replication to enable cross-account data sharing

Choosing an open table format for your transactional data lake on AWS

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Data democratization: How data architecture can drive business decisions and AI initiatives

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Building a Beautiful Data Lakehouse

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

How the right data and AI foundation can empower a successful ESG strategy

Lay the groundwork now for advanced analytics and AI

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

Dive deep into security management: The Data on EKS Platform

AWS re:Invent 2023 Amazon Redshift Sessions Recap

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Embedding AI Into Every Aspect of Your Business

How Zurich Insurance Group built a log management solution on AWS

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Boosting Object Storage Performance with Ozone Manager

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

HEMA accelerates their data governance journey with Amazon DataZone

Processing large records with Amazon Kinesis Data Streams

IBM named a leader in the 2022 Gartner® Magic Quadrant™ for Data Integration Tools

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Upgrade Journey: The Path from CDH to CDP Private Cloud

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

Stay Connected