Data Architecture, Reference and Snapshot

Data Architecture

Reference

Snapshot

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. The Data Catalog provides the functionality as the Iceberg catalog. Determine the changes in transaction, and write new data files.

Snapshot

Snapshot Management Metadata Big Data

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

This post was co-written with Dipankar Mazumdar, Staff Data Engineering Advocate with AWS Partner OneHouse. Data architecture has evolved significantly to handle growing data volumes and diverse workloads. For more examples and references to other posts, refer to the following GitHub repository.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Trending Sources

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions. Snowflake can query across Iceberg and Snowflake table formats.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This means that if data is moved from a bucket in the source Region to another bucket in the target Region, the data access permissions need to be reapplied in the target Region. AWS Glue Data Catalog The AWS Glue Data Catalog is a central repository of metadata about data stored in your data lake.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.

Data Integration

Data Integration Data Lake Statistics Data-driven

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

Kinesis Data Streams has native integrations with other AWS services such as AWS Glue and Amazon EventBridge to build real-time streaming applications on AWS. Refer to Amazon Kinesis Data Streams integrations for additional details. To access your data from Timestream, you need to install the Timestream plugin for Grafana.

Analytics

Analytics IoT Data-driven Snapshot

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Use Batch Processing Gateway to automate job management in multi-cluster Amazon EMR on EKS environments

AWS Big Data

SEPTEMBER 13, 2024

For comprehensive instructions, refer to Running Spark jobs with the Spark operator. For official guidance, refer to Create a VPC. Refer to create-db-subnet-group for more details. Refer to create-db-subnet-group for more details. Refer to create-db-cluster for more details. SubnetId" | jq -c '.') mysql_aurora.3.06.1

Management

Management Snapshot Cost-Benefit Testing

Load data incrementally from transactional data lakes to data warehouses

AWS Big Data

OCTOBER 19, 2023

Data lakes and data warehouses are two of the most important data storage and management technologies in a modern data architecture. Data lakes store all of an organization’s data, regardless of its format or structure. AWS Glue supports the Redshift MERGE SQL command within its data integration jobs.

Data Lake

Data Lake Data Warehouse Visualization Snapshot

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

In fact, we recently announced the integration with our cloud ecosystem bringing the benefits of Iceberg to enterprises as they make their journey to the public cloud, and as they adopt more converged architectures like the Lakehouse. 1: Multi-function analytics . Financial regulation. Reproducibility for ML Ops. The *Any*-house.

Metadata

Metadata Data Architecture Machine Learning Cost-Benefit

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor. The result is made available to the application by querying the latest snapshot. For more information, refer to Notions of Time: Event Time and Processing Time. For more information, refer to Dynamic Tables.

Data Lake

Data Lake Unstructured Data Management Snapshot

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale. Clustering data for better data colocation using z-ordering.

Data Lake

Data Lake Metadata Statistics Optimization

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

By analyzing the historical report snapshot, you can identify areas for improvement, implement changes, and measure the effectiveness of those changes. For instructions, refer to Amazon DataZone quickstart with AWS Glue data. To learn more about Amazon DataZone, refer to the Amazon DataZone User Guide.

Data Quality

Data Quality Visualization Metadata Metrics

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 2

AWS Big Data

JUNE 12, 2024

In this series, we talk about Swisscom’s journey of automating Amazon Redshift provisioning as part of the Swisscom One Data Platform (ODP) solution using the AWS Cloud Development Kit (AWS CDK), and we provide code snippets and the other useful references. outputs: - Name: ClusterStatus Selector: '$.Clusters[0].ClusterStatus'

Data-driven

Data-driven Snapshot Optimization Management

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Success criteria alignment by all stakeholders (producers, consumers, operators, auditors) is key for successful transition to a new Amazon Redshift modern data architecture. The success criteria are the key performance indicators (KPIs) for each component of the data workflow. The following figure shows a daily usage KPI.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

This post is designed to be implemented for a real customer use case, where you get full snapshot data on a daily basis. The dataset represents employee details such as ID, name, address, phone number, contractor or not, and more. You can also maintain the delta table by compacting the small files.

Data Lake

Data Lake Testing Snapshot Big Data

Migrate Microsoft Azure Synapse Analytics to Amazon Redshift using AWS SCT

AWS Big Data

OCTOBER 18, 2023

To create it, refer to Tutorial: Get started with Amazon EC2 Windows instances. To download and install AWS SCT on the EC2 instance that you created, refer to Installing, verifying, and updating AWS SCT. For more information about bucket names, refer to Bucket naming rules. Deselect Create final snapshot.

Analytics

Analytics Data Warehouse Dashboards Testing

Estimating Scope 1 Carbon Footprint with Amazon Athena

AWS Big Data

AUGUST 2, 2023

The data architecture diagram below shows an example of how you could use AWS services to calculate and visualize an organization’s estimated carbon footprint. Customers have the flexibility to choose the services in each stage of the data pipeline based on their use case.

Data Lake

Data Lake Measurement Visualization Data Architecture

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Incremental data is generated in the PostgreSQL table by running custom SQL scripts. Data ingestion – Steps 1 and 2 use AWS DMS, which connects to the source database and moves full and incremental data (CDC) to Amazon S3 in Parquet format. Let’s refer to this S3 bucket as the raw layer. with Apache Spark version 3.3.0)

Data Lake

Data Lake Dashboards Metrics Metadata

How Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance

AWS Big Data

MARCH 28, 2024

We also decided to use Amazon EMR managed scaling to scale the core and task nodes (for scaling scenarios, refer to Node allocation scenarios ). He is also the author of the book Simplify Big Data Analytics with Amazon EMR. GB with a p99 of 491 seconds (approximately 8 minutes).

Optimization

Optimization IT Big Data Data Processing

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

Apache Iceberg, together with the REST Catalog, dramatically simplifies the enterprise data architecture, reducing the Time to Value, Time to Market, and overall TCO, and driving greater ROI. Please reference user documentation for installation and configuration of Cloudera Public Cloud.

Metadata

Metadata Data Warehouse ROI Machine Learning

“You Complete Me,” said Data Lineage to DataOps Observability.

DataKitchen

JANUARY 23, 2023

It is important to have additional tools and processes in place to understand the impact of data errors and to minimize their effect on the data pipeline and downstream systems. These operations can include data movement, validation, cleaning, transformation, aggregation, analysis, and more. Are problems with data tests?

Testing

Testing Data Quality Data Governance Data-driven

Data Leaders Brief

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Trending Sources

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Webinars

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Use Batch Processing Gateway to automate job management in multi-cluster Amazon EMR on EKS environments

Load data incrementally from transactional data lakes to data warehouses

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Exploring real-time streaming for generative AI Applications

Choosing an open table format for your transactional data lake on AWS

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 2

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Migrate Microsoft Azure Synapse Analytics to Amazon Redshift using AWS SCT

Estimating Scope 1 Carbon Footprint with Amazon Athena

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

How Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

“You Complete Me,” said Data Lineage to DataOps Observability.

Stay Connected