Data Lake, Data Warehouse and Snapshot

Load data incrementally from transactional data lakes to data warehouses

AWS Big Data

OCTOBER 19, 2023

Data lakes and data warehouses are two of the most important data storage and management technologies in a modern data architecture. Data lakes store all of an organization’s data, regardless of its format or structure.

Data Lake

Data Lake Data Warehouse Visualization Snapshot

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Data architecture has evolved significantly to handle growing data volumes and diverse workloads. Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. The snapshot points to the manifest list. AWS Glue 3.0

Data Lake

Data Lake Data Processing Metadata Snapshot

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

AWS Big Data

DECEMBER 12, 2024

Complex queries, on the other hand, refer to large-scale data processing and in-depth analysis based on petabyte-level data warehouses in massive data scenarios. In this post, we use dbt for data modeling on both Amazon Athena and Amazon Redshift. Here, data modeling uses dbt on Amazon Redshift.

Snapshot

Snapshot Recreation/Entertainment Experimentation Data Lake

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Solving the small file problem and improving query performance In modern data architectures, stream processing engines such as Amazon EMR are often used to ingest continuous streams of data into data lakes using Apache Iceberg. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

Amazon Athena supports the MERGE command on Apache Iceberg tables, which allows you to perform inserts, updates, and deletes in your data lake at scale using familiar SQL statements that are compliant with ACID (Atomic, Consistent, Isolated, Durable).

Data Lake

Data Lake Snapshot Optimization Data Transformation

Manage your data warehouse cost allocations with Amazon Redshift Serverless tagging

AWS Big Data

MARCH 27, 2023

Amazon Redshift Serverless makes it simple to run and scale analytics without having to manage your data warehouse infrastructure. For Filter by resource type , you can filter by Workgroup , Namespace , Snapshot , and Recovery Point. For more details on tagging, refer to Tagging resources overview.

Data Warehouse

Data Warehouse Management Snapshot Data Lake

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. availability. parquet") df.sortWithinPartitions("review_date").writeTo("dev.db.amazon_reviews_iceberg").append()

Data Lake

Data Lake Snapshot Metadata Optimization

Build an Amazon Redshift data warehouse using an Amazon DynamoDB single-table design

AWS Big Data

JUNE 21, 2023

These types of queries are suited for a data warehouse. The goal of a data warehouse is to enable businesses to analyze their data fast; this is important because it means they are able to gain valuable insights in a timely manner. Amazon Redshift is fully managed, scalable, cloud data warehouse.

Data Warehouse

Data Warehouse Data Lake OLAP Cost-Benefit

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. Of those tables, some are larger (such as in terms of record volume) than others, and some are updated more frequently than others.

Data Lake

Data Lake Data Processing Metadata Snapshot

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

In this blog, we will share with you in detail how Cloudera integrates core compute engines including Apache Hive and Apache Impala in Cloudera Data Warehouse with Iceberg. We will publish follow up blogs for other data services. It allows us to independently upgrade the Virtual Warehouses and Database Catalogs.

Data Warehouse

Data Warehouse Snapshot Metadata Cost-Benefit

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

In a data warehouse, a dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Big Data

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

ML use cases rarely dictate the master data management solution, so the ML stack needs to integrate with existing data warehouses. To manage the dynamism, we can resort to taking snapshots that represent immutable points in time: of models, of data, of code, and of internal state. Versioning.

IT

IT Testing Experimentation Software

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

AWS Big Data

DECEMBER 9, 2024

This approach has been widely used in data warehouses to track changes in various dimensions such as customer information, product details, and employee data. It enables point-in-time analysis, provides detailed audit trails, aids in data quality management, and helps meet compliance requirements by preserving historical data.

Snapshot

Snapshot Data Warehouse Data Lake Data Quality

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. For additional details, refer to Automated snapshots.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

AWS Big Data

FEBRUARY 12, 2024

About Redshift and some relevant features for the use case Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools.

Analytics

Analytics Data Warehouse Snapshot Cost-Benefit

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance.

Data Lake

Data Lake Snapshot Metadata Optimization

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

AWS Big Data

AUGUST 9, 2024

These processes retrieve data from around 90 different data sources, resulting in updating roughly 2,000 tables in the data warehouse and 3,000 external tables in Parquet format, accessed through Amazon Redshift Spectrum and a data lake on Amazon Simple Storage Service (Amazon S3). TB of data.

Data Lake

Data Lake Analytics Data Warehouse Data-driven

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

This integration expands the possibilities for AWS analytics and machine learning (ML) solutions, making the data warehouse accessible to a broader range of applications. Your applications can seamlessly read from and write to your Amazon Redshift data warehouse while maintaining optimal performance and transactional consistency.

Data Processing

Data Processing Data Lake Data Warehouse Optimization

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Large-scale data warehouse migration to the cloud is a complex and challenging endeavor that many organizations undertake to modernize their data infrastructure, enhance data management capabilities, and unlock new business opportunities. This makes sure the new data platform can meet current and future business goals.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Cloudera Data Warehouse (CDW) running Hive has previously supported creating materialized views against Hive ACID source tables. release and the matching CDW Private Cloud Data Services release, Hive also supports creating, using, and rebuilding materialized views for Iceberg table format.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor. The result is made available to the application by querying the latest snapshot. For more details, refer to Create a low-latency source-to-data lake pipeline using Amazon MSK Connect, Apache Flink, and Apache Hudi.

Data Lake

Data Lake Unstructured Data Management Snapshot

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

A CDC-based approach captures the data changes and makes them available in data warehouses for further analytics in real-time. usually a data warehouse) needs to reflect those changes in near real-time. This post showcases how to use streaming ingestion to bring data to Amazon Redshift.

Data Warehouse

Data Warehouse Snapshot Data Processing Internet of Things

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

AWS Big Data

JULY 25, 2023

It automatically provisions and intelligently scales data warehouse compute capacity to deliver fast performance, and you pay only for what you use. Just load your data and start querying right away in the Amazon Redshift Query Editor or in your favorite business intelligence (BI) tool. Ashish Agrawal is a Sr.

Metrics

Metrics Data Warehouse Dashboards Snapshot

Enrich your customer data with geospatial insights using Amazon Redshift, AWS Data Exchange, and Amazon QuickSight

AWS Big Data

MARCH 18, 2024

Load generic address data to Amazon Redshift Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Redshift Serverless makes it straightforward to run analytics workloads of any size without having to manage data warehouse infrastructure.

Data Warehouse

Data Warehouse Visualization Snapshot Data-driven

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Snowflake and Domino: Better Together

Domino Data Lab

JANUARY 11, 2021

Data Science works best with a high degree of data granularity when the data offers the closest possible representation of what happened during actual events – as in financial transactions, medical consultations or marketing campaign results. About Domino Data Lab. Integration Features.

Data Science

Data Science Recreation/Entertainment Data Warehouse Publishing

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

AWS Big Data

JULY 28, 2023

Organizations must comply with these requests provided that there are no legitimate grounds for retaining the personal data, such as legal obligations or contractual requirements. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift offers backups and snapshots of the data.

Snapshot

Snapshot Metadata Measurement Data Warehouse

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

AWS Big Data

JULY 28, 2023

Amazon Redshift is a petabyte-scale, enterprise-grade cloud data warehouse service delivering the best price-performance. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift to cost-effectively and quickly analyze their data using standard SQL and existing business intelligence (BI) tools.

Data Lake

Data Lake Data Governance Data Warehouse Modeling

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

It can receive the events from an input Kinesis data stream and route the resulting stream to an output data stream. State snapshot in Amazon S3 – You can store the state snapshot in Amazon S3 for tracking. You can create a stateful functions cluster with Apache Flink based on your application business logic.

Analytics

Analytics IoT Data-driven Snapshot

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

Improve performance and overall manageability of Iceberg tables using the new table maintenance capabilities such as expiring old snapshots and removing their metadata, and compaction to combine small files for more efficient data processing. Read why the future of data lakehouses is open. ORC open file format support.

Metadata

Metadata Data Warehouse Snapshot Machine Learning

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

Dimensional modeling in Amazon Redshift

AWS Big Data

JULY 19, 2023

Amazon Redshift is a fully managed and petabyte-scale cloud data warehouse that is used by tens of thousands of customers to process exabytes of data every day to power their analytics workload. You can structure your data, measure business processes, and get valuable insights quickly can be done by using a dimensional model.

Modeling

Modeling Sales Data Warehouse Snapshot

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

Introduction Apache Iceberg has recently grown in popularity because it adds data warehouse-like capabilities to your data lake making it easier to analyze all your data — structured and unstructured. Problem with too many snapshots Everytime a write operation occurs on an Iceberg table, a new snapshot is created.

Optimization

Optimization Strategy Snapshot Metadata

Chose Both: Data Fabric and Data Lakehouse

Cloudera

SEPTEMBER 12, 2022

A data lakehouse that enables multiple engines to run on the same data improves speed to market and productivity of users. . Cloudera has supported data lakehouses for over five years. Applying the Iceberg table format to all the organization’s data in the data lake makes it more performant and usable at scale.

Unstructured Data

Unstructured Data Data Architecture Data Lake Snapshot

Load data incrementally from transactional data lakes to data warehouses

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Trending Sources

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Webinars

Use Apache Iceberg in a data lake to support incremental data processing

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Choosing an open table format for your transactional data lake on AWS

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

Manage your data warehouse cost allocations with Amazon Redshift Serverless tagging

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Build an Amazon Redshift data warehouse using an Amazon DynamoDB single-table design

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

MLOps and DevOps: Why Data Makes It Different

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Implement disaster recovery with Amazon Redshift

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

Introducing Apache Hudi support with AWS Glue crawlers

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Materialized Views in Hive for Iceberg Table Format

Exploring real-time streaming for generative AI Applications

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

Enrich your customer data with geospatial insights using Amazon Redshift, AWS Data Exchange, and Amazon QuickSight

Unleashing the power of Presto: The Uber case study

Snowflake and Domino: Better Together

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Dimensional modeling in Amazon Redshift

Optimization Strategies for Iceberg Tables

Chose Both: Data Fabric and Data Lakehouse

Stay Connected