Big Data, Data Architecture, Data Processing and Metadata

Big Data

Data Architecture

Data Processing

Metadata

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

Need for a data mesh architecture Because entities in the EUROGATE group generate vast amounts of data from various sourcesacross departments, locations, and technologiesthe traditional centralized data architecture struggles to keep up with the demands for real-time insights, agility, and scalability.

IoT

IoT Machine Learning Metadata Data-driven

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Trending Sources

How Stifel built a modern data platform using AWS Glue and an event-driven domain architecture

AWS Big Data

JULY 7, 2025

In this post, we show you how Stifel implemented a modern data platform using AWS services and open data standards, building an event-driven architecture for domain data products while centralizing the metadata to facilitate discovery and sharing of data products.

Data-driven

Data-driven Metadata Digital Transformation Data Lake

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

HEMA has a bespoke enterprise architecture, built around the concept of services. Each service is hosted in a dedicated AWS account and is built and maintained by a product owner and a development team, as illustrated in the following figure. Tommaso is the Head of Data & Cloud Platforms at HEMA.

Data Governance

Data Governance Publishing Data-driven Metadata

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery. Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

These inputs reinforced the need of a unified data strategy across the FinOps teams. We decided to build a scalable data management product that is based on the best practices of modern data architecture. Our source system and domain teams were mapped as data producers, and they would have ownership of the datasets.

Finance

Finance Metadata Big Data Recreation/Entertainment

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

Over the years, data lakes on Amazon Simple Storage Service (Amazon S3) have become the default repository for enterprise data and are a common choice for a large set of users who query data for a variety of analytics and machine leaning use cases. Analytics use cases on data lakes are always evolving. Choose ETL Jobs.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Create an Amazon Route 53 public hosted zone such as mydomain.com to be used for routing internet traffic to your domain. For instructions, refer to Creating a public hosted zone. Request an AWS Certificate Manager (ACM) public certificate for the hosted zone. hosted_zone_id – The Route 53 public hosted zone ID.

Dashboards

Dashboards Data Processing Metadata Consulting

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. By decoupling storage and compute, data lakes promote cost-effective storage and processing of big data. Why did Orca choose Apache Iceberg?

Data Lake

Data Lake Analytics Snapshot Data Quality

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Cost and resource efficiency – This is an area where Acast observed a reduction in data duplication, and therefore cost reduction (in some accounts, removing the copy of data 100%), by reading data across accounts while enabling scaling. In this approach, teams responsible for generating data are referred to as producers.

Data-driven

Data-driven Advertising Metadata Data Architecture

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

AWS Big Data

JUNE 12, 2024

One Data Platform The ODP architecture is based on the AWS Well Architected Framework Analytics Lens and follows the pattern of having raw, standardized, conformed, and enriched layers as described in Modern data architecture. See the following admin user code: admin_secret_kms_key_options = KmsKeyOptions(.

Data Architecture

Data Architecture Cost-Benefit Data-driven Experimentation

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

Overview of solution As a data-driven company, smava relies on the AWS Cloud to power their analytics use cases. smava ingests data from various external and internal data sources into a landing stage on the data lake based on Amazon Simple Storage Service (Amazon S3).

Data Lake

Data Lake Data Warehouse Data-driven B2B

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Profile aggregation – When you’ve uniquely identified a customer, you can build applications in Managed Service for Apache Flink to consolidate all their metadata, from name to interaction history. Then, you transform this data into a concise format.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

How Zurich Insurance Group built a log management solution on AWS

AWS Big Data

JULY 16, 2024

Priority 2 logs, such as operating system security logs, firewall, identity provider (IdP), email metadata, and AWS CloudTrail , are ingested into Amazon OpenSearch Service to enable the following capabilities. She currently serves as the Global Head of Cyber Data Management at Zurich Group.

Insurance

Insurance Management Cost-Benefit Optimization

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.

Data Lake

Data Lake Dashboards Metrics Metadata

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

When building a scalable data architecture on AWS, giving autonomy and ownership to the data domains are crucial for the success of the platform. Solution overview In the first post of this series, we explained how Novo Nordisk and AWS Professional Services built a modern data architecture based on data mesh tenets.

Data Governance

Data Governance Data-driven Management Analytics

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

Data Environment First off, the solutions you consider should be compatible with your current data architecture. We have outlined the requirements that most providers ask for: Data Sources Strategic Objective Use native connectivity optimized for the data source. addresses). Do what you expect your customers to do.

Analytics

Analytics Cost-Benefit Visualization Dashboards

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

AWS Big Data

FEBRUARY 28, 2025

In modern data architectures, the need to manage and query vast datasets efficiently, consistently, and accurately is paramount. For organizations that deal with big data processing, managing metadata becomes a critical concern. Any reference to HMS refers to a Standalone Hive Metastore.

Metadata

Metadata Data Lake Data Processing Data Architecture

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift

AWS Big Data

APRIL 28, 2025

Under Add a data source , choose Add connection , then choose Amazon Redshift. Enter the following parameters in the connection details, and choose Add data. Host : Enter the Amazon Redshift managed VPC endpoint. Choose AWS Glue (Lakehouse) for Data source type. Choose RUN to import the metadata.

Metadata

Metadata Data Lake Big Data Publishing

Build end-to-end Apache Spark pipelines with Amazon MWAA, Batch Processing Gateway, and Amazon EMR on EKS clusters

AWS Big Data

MAY 1, 2025

The operator typically performs the following steps: Initialize job BPGOperator prepares the job payload, including input parameters, configurations, connection details, and other metadata required by BPG. About the Authors Suvojit Dasgupta is a Principal Data Architect at AWS. You can find the code base in the GitHub repo.

Cost-Benefit

Cost-Benefit Interactive Management Data Processing

Data Leaders Brief

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

How EUROGATE established a data mesh architecture using Amazon DataZone

Webinars

Trending Sources

How Stifel built a modern data platform using AWS Glue and an event-driven domain architecture

Webinars

HEMA accelerates their data governance journey with Amazon DataZone

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Migrate an existing data lake to a transactional data lake using Apache Iceberg

How Cargotec uses metadata replication to enable cross-account data sharing

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Design a data mesh on AWS that reflects the envisioned organization

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Create an end-to-end data strategy for Customer 360 on AWS

How Zurich Insurance Group built a log management solution on AWS

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

How Novo Nordisk built distributed data governance and control at scale

What Is Embedded Analytics?

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift

Build end-to-end Apache Spark pipelines with Amazon MWAA, Batch Processing Gateway, and Amazon EMR on EKS clusters

Stay Connected