Data Architecture, Data Processing and Metadata

Data Architecture

Data Processing

Metadata

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

Need for a data mesh architecture Because entities in the EUROGATE group generate vast amounts of data from various sourcesacross departments, locations, and technologiesthe traditional centralized data architecture struggles to keep up with the demands for real-time insights, agility, and scalability.

IoT

IoT Machine Learning Metadata Data-driven

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

The Struggle Between Data Dark Ages and LLM Accuracy

Cloudera

DECEMBER 6, 2024

The AI Forecast: Data and AI in the Cloud Era , sponsored by Cloudera, aims to take an objective look at the impact of AI on business, industry, and the world at large. It could be metadata that you weren’t capturing before. And the value of the 10% is as much as the 85% and as much as the next 5% to get to 95%.

Manufacturing

Manufacturing Forecasting Metadata Data Processing

Webinars

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Cloudera

SEPTEMBER 29, 2022

Each of these trends claim to be complete models for their data architectures to solve the “everything everywhere all at once” problem. Data teams are confused as to whether they should get on the bandwagon of just one of these trends or pick a combination. First, we describe how data mesh and data fabric could be related.

Data Architecture

Data Architecture Data Warehouse Metadata Sales

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. An entity can act both as a producer of data assets and as a consumer of data assets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

Over the years, data lakes on Amazon Simple Storage Service (Amazon S3) have become the default repository for enterprise data and are a common choice for a large set of users who query data for a variety of analytics and machine leaning use cases. Analytics use cases on data lakes are always evolving. Choose ETL Jobs.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Those decentralization efforts appeared under different monikers through time, e.g., data marts versus data warehousing implementations (a popular architectural debate in the era of structured data) then enterprise-wide data lakes versus smaller, typically BU-Specific, “data ponds”.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Create an Amazon Route 53 public hosted zone such as mydomain.com to be used for routing internet traffic to your domain. For instructions, refer to Creating a public hosted zone. Request an AWS Certificate Manager (ACM) public certificate for the hosted zone. hosted_zone_id – The Route 53 public hosted zone ID.

Dashboards

Dashboards Data Processing Metadata Consulting

SAP enhances Datasphere and SAC for AI-driven transformation

CIO Business Intelligence

MARCH 6, 2024

SAP announced today a host of new AI copilot and AI governance features for SAP Datasphere and SAP Analytics Cloud (SAC). The company is expanding its partnership with Collibra to integrate Collibra’s AI Governance platform with SAP data assets to facilitate data governance for non-SAP data assets in customer environments. “We

Unstructured Data

Unstructured Data Dashboards Business Intelligence Data Governance

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

These inputs reinforced the need of a unified data strategy across the FinOps teams. We decided to build a scalable data management product that is based on the best practices of modern data architecture. Our source system and domain teams were mapped as data producers, and they would have ownership of the datasets.

Finance

Finance Metadata Big Data Recreation/Entertainment

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

HEMA has a bespoke enterprise architecture, built around the concept of services. Each service is hosted in a dedicated AWS account and is built and maintained by a product owner and a development team, as illustrated in the following figure. Tommaso is the Head of Data & Cloud Platforms at HEMA.

Data Governance

Data Governance Publishing Data-driven Metadata

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

AWS Big Data

JUNE 12, 2024

One Data Platform The ODP architecture is based on the AWS Well Architected Framework Analytics Lens and follows the pattern of having raw, standardized, conformed, and enriched layers as described in Modern data architecture. See the following admin user code: admin_secret_kms_key_options = KmsKeyOptions(.

Data Architecture

Data Architecture Cost-Benefit Data-driven Experimentation

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. As Ozone scales to exabytes of data, it is important to ensure that Ozone Manager can perform at scale. The tool reads only the metadata for objects in a cluster with around 100 million keys.

Management

Management Metadata Metrics Optimization

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

While Cloudera CDH was already a success story at HBL, in 2022, HBL identified the need to move its customer data centre environment from Cloudera’s CDH to Cloudera Data Platform (CDP) Private Cloud to accommodate growing volumes of data. ” Prior to the upgrade, HBL’s 27 node cluster ran on CDH 6.1

Management

Management Data Lake Consulting Unstructured Data

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Apache Iceberg overview Iceberg is an open-source table format that brings the power of SQL tables to big data files. It enables ACID transactions on tables, allowing for concurrent data ingestion, updates, and queries, all while using familiar SQL. Data can be organized into three different zones, as shown in the following figure.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Cost and resource efficiency – This is an area where Acast observed a reduction in data duplication, and therefore cost reduction (in some accounts, removing the copy of data 100%), by reading data across accounts while enabling scaling. In this approach, teams responsible for generating data are referred to as producers.

Data-driven

Data-driven Advertising Metadata Data Architecture

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

Overview of solution As a data-driven company, smava relies on the AWS Cloud to power their analytics use cases. smava ingests data from various external and internal data sources into a landing stage on the data lake based on Amazon Simple Storage Service (Amazon S3).

Data Lake

Data Lake Data Warehouse Data-driven B2B

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

Ontotext

DECEMBER 1, 2023

Atanas Kiryakov presenting at KGF 2023 about Where Shall and Enterprise Start their Knowledge Graph Journey Only data integration through semantic metadata can drive business efficiency as “it’s the glue that turns knowledge graphs into hubs of metadata and content”.

Metadata

Metadata Sales Machine Learning Consulting

Empowering data mesh: The tools to deliver BI excellence

erwin

APRIL 16, 2024

The data mesh framework In the dynamic landscape of data management, the search for agility, scalability, and efficiency has led organizations to explore new, innovative approaches. One such innovation gaining traction is the data mesh framework. This empowers individual teams to own and manage their data.

Metadata

Metadata Data Quality Data Governance Modeling

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. This data is sent to Apache Kafka, which is hosted on Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Data Lake

Data Lake Analytics Snapshot Data Quality

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Profile aggregation – When you’ve uniquely identified a customer, you can build applications in Managed Service for Apache Flink to consolidate all their metadata, from name to interaction history. Then, you transform this data into a concise format.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

How Zurich Insurance Group built a log management solution on AWS

AWS Big Data

JULY 16, 2024

Priority 2 logs, such as operating system security logs, firewall, identity provider (IdP), email metadata, and AWS CloudTrail , are ingested into Amazon OpenSearch Service to enable the following capabilities. She currently serves as the Global Head of Cyber Data Management at Zurich Group.

Insurance

Insurance Management Cost-Benefit Optimization

CIOs rise to the ESG reporting challenge

CIO Business Intelligence

JANUARY 30, 2024

“Always the gatekeepers of much of the data necessary for ESG reporting, CIOs are finding that companies are even more dependent on them,” says Nancy Mentesana, ESG executive director at Labrador US, a global communications firm focused on corporate disclosure documents. The complexity is at a much higher level.”

Reporting

Reporting Data Quality Strategy Data-driven

Announcing Alation 4.0 with Alation Connect

Alation

FEBRUARY 20, 2020

But it is the contextual usage information of data that is critical to answer data consumers’ and stewards’ questions. Cataloging both data and queries provides insight into. How recently the data was used. How recently the data was updated. What the mapping is of technical metadata to business descriptions.

Metadata

Metadata Enterprise Data Processing Data Architecture

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

The Delta tables created by the EMR Serverless application are exposed through the AWS Glue Data Catalog and can be queried through Amazon Athena. Solution overview The following diagram shows the overall architecture of the solution that we implement in this post. Monjumi Sarma is a Data Lab Solutions Architect at AWS.

Data Lake

Data Lake Dashboards Metrics Metadata

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

When building a scalable data architecture on AWS, giving autonomy and ownership to the data domains are crucial for the success of the platform. Solution overview In the first post of this series, we explained how Novo Nordisk and AWS Professional Services built a modern data architecture based on data mesh tenets.

Data Governance

Data Governance Management Data-driven Analytics

The Gartner 2022 Leadership Vision for Data and Analytics Leaders Questions and Answers

Andrew White

JANUARY 9, 2022

On Thursday January 6th I hosted Gartner’s 2022 Leadership Vision for Data and Analytics webinar. Most of D&A concerns and activities are done within EA in the Info/Data architecture domain/phases. Where these efforts break down is in the data that goes into the connection at one end and comes out the other.

Analytics

Analytics Measurement Data-driven Modeling

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

Data Environment First off, the solutions you consider should be compatible with your current data architecture. We have outlined the requirements that most providers ask for: Data Sources Strategic Objective Use native connectivity optimized for the data source. addresses). Do what you expect your customers to do.

Analytics

Analytics Cost-Benefit Visualization Dashboards

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

AWS Big Data

FEBRUARY 28, 2025

In modern data architectures, the need to manage and query vast datasets efficiently, consistently, and accurately is paramount. For organizations that deal with big data processing, managing metadata becomes a critical concern. Any reference to HMS refers to a Standalone Hive Metastore.

Metadata

Metadata Data Lake Data Processing Data Architecture

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift

AWS Big Data

APRIL 28, 2025

Under Add a data source , choose Add connection , then choose Amazon Redshift. Enter the following parameters in the connection details, and choose Add data. Host : Enter the Amazon Redshift managed VPC endpoint. Choose AWS Glue (Lakehouse) for Data source type. Choose RUN to import the metadata.

Metadata

Metadata Data Lake Big Data Publishing

Build end-to-end Apache Spark pipelines with Amazon MWAA, Batch Processing Gateway, and Amazon EMR on EKS clusters

AWS Big Data

MAY 1, 2025

The operator typically performs the following steps: Initialize job BPGOperator prepares the job payload, including input parameters, configurations, connection details, and other metadata required by BPG. About the Authors Suvojit Dasgupta is a Principal Data Architect at AWS. You can find the code base in the GitHub repo.

Cost-Benefit

Cost-Benefit Interactive Management Data Processing

Data Leaders Brief

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

How EUROGATE established a data mesh architecture using Amazon DataZone

Webinars

Trending Sources

The Struggle Between Data Dark Ages and LLM Accuracy

Webinars

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Migrate an existing data lake to a transactional data lake using Apache Iceberg

How Cargotec uses metadata replication to enable cross-account data sharing

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

SAP enhances Datasphere and SAC for AI-driven transformation

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

HEMA accelerates their data governance journey with Amazon DataZone

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Boosting Object Storage Performance with Ozone Manager

Habib Bank manages data at scale with Cloudera Data Platform

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Design a data mesh on AWS that reflects the envisioned organization

How smava makes loans transparent and affordable using Amazon Redshift Serverless

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

Empowering data mesh: The tools to deliver BI excellence

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Create an end-to-end data strategy for Customer 360 on AWS

How Zurich Insurance Group built a log management solution on AWS

CIOs rise to the ESG reporting challenge

Announcing Alation 4.0 with Alation Connect

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

How Novo Nordisk built distributed data governance and control at scale

The Gartner 2022 Leadership Vision for Data and Analytics Leaders Questions and Answers

What Is Embedded Analytics?

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift

Build end-to-end Apache Spark pipelines with Amazon MWAA, Batch Processing Gateway, and Amazon EMR on EKS clusters

Stay Connected