Data Analytics, Data Lake and Snapshot

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

AWS Big Data

NOVEMBER 11, 2024

This post focuses on introducing an active-passive approach using a snapshot and restore strategy. Snapshot and restore in OpenSearch Service The snapshot and restore strategy in OpenSearch Service involves creating point-in-time backups, known as snapshots , of your OpenSearch domain.

Snapshot

Snapshot Strategy Dashboards Data Lake

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. Apache Iceberg integration is supported by AWS analytics services including Amazon EMR , Amazon Athena , and AWS Glue. The snapshot points to the manifest list.

Data Lake

Data Lake Data Processing Metadata Snapshot

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

In this blog post, we dive into different data aspects and how Cloudinary breaks the two concerns of vendor locking and cost efficient data analytics by using Apache Iceberg, Amazon Simple Storage Service (Amazon S3 ), Amazon Athena , Amazon EMR , and AWS Glue. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

AWS Big Data

DECEMBER 12, 2024

One-time and complex queries are two common scenarios in enterprise data analytics. Complex queries, on the other hand, refer to large-scale data processing and in-depth analysis based on petabyte-level data warehouses in massive data scenarios.

Snapshot

Snapshot Recreation/Entertainment Experimentation Data Lake

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Sales

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Traditional batch ingestion and processing pipelines that involve operations such as data cleaning and joining with reference data are straightforward to create and cost-efficient to maintain. options(**additional_options).mode("append").save(s3_output_folder)

Data Lake

Data Lake Data Analytics Analytics Data Processing

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

AWS Big Data

AUGUST 9, 2024

With the ever-increasing volume of data available, Dafiti faces the challenge of effectively managing and extracting valuable insights from this vast pool of information to gain a competitive edge and make data-driven decisions that align with company business objectives. TB of data. We started with 115 dc2.large

Data Lake

Data Lake Analytics Data Warehouse Data-driven

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

Presto was able to achieve this level of scalability by completely separating analytical compute from data storage. Presto is an open source distributed SQL query engine for data analytics and the data lakehouse, designed for running interactive analytic queries against datasets of all sizes, from gigabytes to petabytes.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor. The result is made available to the application by querying the latest snapshot. For more details, refer to Create a low-latency source-to-data lake pipeline using Amazon MSK Connect, Apache Flink, and Apache Hudi.

Data Lake

Data Lake Unstructured Data Management Snapshot

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

Estimating Scope 1 Carbon Footprint with Amazon Athena

AWS Big Data

AUGUST 2, 2023

Using existing analytics tools such as Amazon Athena and Amazon QuickSight an organization can gain insight into its estimated carbon footprint. The data architecture diagram below shows an example of how you could use AWS services to calculate and visualize an organization’s estimated carbon footprint.

Data Lake

Data Lake Measurement Visualization Data Architecture

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. He works with AWS customers to design and build real time data processing systems. Vishal Khatri is a Sr.

Data Warehouse

Data Warehouse Snapshot Data Processing Internet of Things

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes.

Data Lake

Data Lake Dashboards Metrics Metadata

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

However, there is a fundamental challenge standing in the way of being successful: data. By breaking down data silos and integrating log data from multiple sources, Cloudera empowers defenders with the real-time analytics to respond to threats swiftly.

Analytics

Analytics Metadata Snapshot Data-driven

Accelerating revenue growth with real-time analytics: Poshmark’s journey

AWS Big Data

MARCH 20, 2023

In this post, we share how Poshmark improved CX and accelerated revenue growth by using a real-time analytics solution. High-level challenge: The need for real-time analytics Previous efforts at Poshmark for improving CX through analytics were based on batch processing of analytics data and using it on a daily basis to improve CX.

Analytics

Analytics Slice and Dice Data Processing Data Lake

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

From detailed design to a beta release, Tricentis had customers expecting to consume data from a data lake specific to only their data, and all of the data that had been generated for over a decade. Data export As stated earlier, some customers want to get an export of their test data and create their data lake.

Software

Software Data Lake Testing Cost-Benefit

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

We can determine the following are needed: An open data format ingestion architecture processing the source dataset and refining the data in the S3 data lake. This requires a dedicated team of 3–7 members building a serverless data lake for all data sources. Vijay Bagur is a Sr.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

Organizations across the world are increasingly relying on streaming data, and there is a growing need for real-time data analytics, considering the growing velocity and volume of data being collected. Step 6} $ REGISTRY_NAME={VAL_OF_GlueSchemaRegistryName - Ref. Step 6} $ SCHEMA_NAME={VAL_OF_SchemaName– Ref.

Management

Management Metadata Internet of Things Testing

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Apache Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. We use a sample JSON file as input to Amazon DynamoDB.

Data Lake

Data Lake Metadata Testing Snapshot

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

Corinium

APRIL 25, 2019

Ahead of the Chief Data Analytics Officers & Influencers, Insurance event we caught up with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity to discuss how the industry is evolving. In data-driven organizations, data is flowing.

Insurance

Insurance Risk IoT Data-driven

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Cloudera

OCTOBER 10, 2024

It combines the flexibility and scalability of data lake storage with the data analytics, data governance, and data management functionality of the data warehouse. Table Cleanup: As tables grow, they often accumulate unused data files, manifest files, and snapshots that aren’t needed anymore.

Optimization

Optimization Snapshot Data Lake Cost-Benefit

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance.

Data Lake

Data Lake IoT Metadata Testing

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

To optimize their security operations, organizations are adopting modern approaches that combine real-time monitoring with scalable data analytics. They are using data lake architectures and Apache Iceberg to efficiently process large volumes of security data while minimizing operational overhead.

Snapshot

Snapshot Optimization Data Lake Metadata

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

However, there is a fundamental challenge standing in the way of being successful: data. By breaking down data silos and integrating log data from multiple sources, Cloudera empowers defenders with the real-time analytics to respond to threats swiftly.

Analytics

Analytics Metadata Snapshot Data-driven

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

AWS Big Data

NOVEMBER 6, 2024

Second, because traditional data warehousing approaches are unable to keep up with the volume, velocity, and variety of data, engineering teams are building data lakes and adopting open data formats such as Parquet and Apache Iceberg to store their data.

Metadata

Metadata Data Lake Management Internet of Things

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

Cloudera’s open data lakehouse, powered by Apache Iceberg, solves the real-world big data challenges mentioned above by providing a unified, curated, shareable, and interoperable data lake that is accessible by a wide array of Iceberg-compatible compute engines and tools.

Metadata

Metadata Data Warehouse ROI Snapshot

Data Leaders Brief

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Trending Sources

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Webinars

Choosing an open table format for your transactional data lake on AWS

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

Unleashing the power of Presto: The Uber case study

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Exploring real-time streaming for generative AI Applications

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Estimating Scope 1 Carbon Footprint with Amazon Athena

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Empower Your Cyber Defenders with Real-Time Analytics

Accelerating revenue growth with real-time analytics: Poshmark’s journey

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Stay Connected