Data Lake, Data-driven and Snapshot

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

In this post, we focus on data management implementation options such as accessing data directly in Amazon Simple Storage Service (Amazon S3), using popular data formats like Parquet, or using open table formats like Iceberg. Data management is the foundation of quantitative research.

Metadata

Metadata Snapshot Cost-Benefit Optimization

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Enterprises and organizations across the globe want to harness the power of data to make better decisions by putting data at the center of every decision-making process. However, throughout history, data services have held dominion over their customers’ data.

Data Lake

Data Lake Metadata Snapshot Analytics

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Businesses are constantly evolving, and data leaders are challenged every day to meet new requirements. Customers are using AWS and Snowflake to develop purpose-built data architectures that provide the performance required for modern analytics and artificial intelligence (AI) use cases.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. In addition, organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge.

Data Integration

Data Integration Data Lake Statistics Data-driven

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. In this post, we discuss a common use case in relation to operational data processing and the solution we built using Apache Hudi and AWS Glue.

Data Lake

Data Lake Data Processing Metadata Snapshot

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

As with many burgeoning fields and disciplines, we don’t yet have a shared canonical infrastructure stack or best practices for developing and deploying data-intensive applications. Why: Data Makes It Different. Not only is data larger, but models—deep learning models in particular—are much larger than before.

IT

IT Testing Experimentation Software

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

AWS Big Data

DECEMBER 9, 2024

In todays data-driven world, tracking and analyzing changes over time has become essential. As organizations process vast amounts of data, maintaining an accurate historical record is crucial. History management in data systems is fundamental for compliance, business intelligence, data quality, and time-based analysis.

Snapshot

Snapshot Data Warehouse Data Lake Data Quality

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

AWS Big Data

AUGUST 9, 2024

As businesses strive to make informed decisions, the amount of data being generated and required for analysis is growing exponentially. This trend is no exception for Dafiti , an ecommerce company that recognizes the importance of using data to drive strategic decision-making processes. We started with 115 dc2.large

Data Lake

Data Lake Analytics Data Warehouse Data-driven

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

AWS Big Data

FEBRUARY 12, 2024

Objective Gupshup wanted to build a messaging analytics platform that provided: Build a platform to get detailed insights, data, and reports about WhatsApp/SMS campaigns and track the success of every text message sent by the end customers. Additionally, extract, load, and transform (ELT) data processing is sped up and made easier.

Analytics

Analytics Data Warehouse Snapshot Cost-Benefit

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

We’re living in the age of real-time data and insights, driven by low-latency data streaming applications. The volume of time-sensitive data produced is increasing rapidly, with different formats of data being introduced across new businesses and customer use cases.

Analytics

Analytics IoT Data-driven Snapshot

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

Corinium

APRIL 25, 2019

Ahead of the Chief Data Analytics Officers & Influencers, Insurance event we caught up with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity to discuss how the industry is evolving. The last 10+ years or so have seen Insurance become as data-driven as any vertical industry.

Insurance

Insurance Risk IoT Data-driven

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

FMs are multimodal; they work with different data types such as text, video, audio, and images. Large language models (LLMs) are a type of FM and are pre-trained on vast amounts of text data and typically have application uses such as text generation, intelligent chatbots, or summarization.

Data Lake

Data Lake Unstructured Data Management Snapshot

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

Chose Both: Data Fabric and Data Lakehouse

Cloudera

SEPTEMBER 12, 2022

It sounds straightforward: you just need data and the means to analyze it. The data is there, in spades. Data volumes have been growing for years and are predicted to reach 175 ZB by 2025. First, organizations have a tough time getting their arms around their data. Unified data fabric. Yes and no.

Unstructured Data

Unstructured Data Data Architecture Data Lake Snapshot

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

This data is used in procuring devices’ inventory to meet Amazon customers’ demands. With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data.

Optimization

Optimization Forecasting Data Lake Metadata

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

In fact, according to the Identity Theft Resource Center (ITRC) Annual Data Breach Report , there were 2,365 cyber attacks in 2023 with more than 300 million victims, and a 72% increase in data breaches since 2021. However, there is a fundamental challenge standing in the way of being successful: data.

Analytics

Analytics Metadata Snapshot Data-driven

Enrich your customer data with geospatial insights using Amazon Redshift, AWS Data Exchange, and Amazon QuickSight

AWS Big Data

MARCH 18, 2024

It always pays to know more about your customers, and AWS Data Exchange makes it straightforward to use publicly available census data to enrich your customer dataset. The United States Census Bureau conducts the US census every 10 years and gathers household survey data. Subscribe to census data on AWS Data Exchange.

Data Warehouse

Data Warehouse Visualization Snapshot Data-driven

Dimensional modeling in Amazon Redshift

AWS Big Data

JULY 19, 2023

Amazon Redshift is a fully managed and petabyte-scale cloud data warehouse that is used by tens of thousands of customers to process exabytes of data every day to power their analytics workload. You can structure your data, measure business processes, and get valuable insights quickly can be done by using a dimensional model.

Modeling

Modeling Sales Data Warehouse Snapshot

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

In this post, we share how the AWS Data Lab helped Tricentis to improve their software as a service (SaaS) Tricentis Analytics platform with insights powered by Amazon Redshift. Although Tricentis has amassed such data over a decade, the data remains untapped for valuable insights.

Software

Software Data Lake Testing Cost-Benefit

Accelerating revenue growth with real-time analytics: Poshmark’s journey

AWS Big Data

MARCH 20, 2023

We discuss how to create such a solution using Amazon Kinesis Data Streams , Amazon Managed Streaming for Kafka (Amazon MSK), Amazon Kinesis Data Analytics for Apache Flink ; the design decisions that went into the architecture; and the observed business benefits by Poshmark.

Analytics

Analytics Slice and Dice Data Processing Data Lake

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

To build a data-driven business, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. GenericInMemoryCatalog stores the catalog data in memory.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

The magic behind Uber’s data-driven success Uber, the ride-hailing giant, is a household name worldwide. But what most people don’t realize is that behind the scenes, Uber is not just a transportation service; it’s a data and analytics powerhouse. Consider the magnitude of Uber’s footprint.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

AWS Big Data

JULY 28, 2023

Amazon Redshift is a petabyte-scale, enterprise-grade cloud data warehouse service delivering the best price-performance. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift to cost-effectively and quickly analyze their data using standard SQL and existing business intelligence (BI) tools.

Data Lake

Data Lake Data Governance Data Warehouse Modeling

Jumia builds a next-generation data platform with metadata-driven specification frameworks

AWS Big Data

DECEMBER 20, 2024

In this post, we share part of the journey that Jumia took with AWS Professional Services to modernize its data platform that ran under a Hadoop distribution to AWS serverless based solutions. These phases are: data orchestration, data migration, data ingestion, data processing, and data maintenance.

Metadata

Metadata Data-driven Snapshot Data Lake

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. By using features like Icebergs compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale.

Data Lake

Data Lake IoT Metadata Testing

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

In fact, according to the Identity Theft Resource Center (ITRC) Annual Data Breach Report , there were 2,365 cyber attacks in 2023 with more than 300 million victims, and a 72% increase in data breaches since 2021. However, there is a fundamental challenge standing in the way of being successful: data.

Analytics

Analytics Metadata Snapshot Data-driven

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

To optimize their security operations, organizations are adopting modern approaches that combine real-time monitoring with scalable data analytics. They are using data lake architectures and Apache Iceberg to efficiently process large volumes of security data while minimizing operational overhead.

Snapshot

Snapshot Optimization Data Lake Metadata

Data Leaders Brief

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Build a high-performance quant research platform with Apache Iceberg

Webinars

Trending Sources

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Webinars

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

MLOps and DevOps: Why Data Makes It Different

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

Exploring real-time streaming for generative AI Applications

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Chose Both: Data Fabric and Data Lakehouse

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

Empower Your Cyber Defenders with Real-Time Analytics

Enrich your customer data with geospatial insights using Amazon Redshift, AWS Data Exchange, and Amazon QuickSight

Dimensional modeling in Amazon Redshift

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Accelerating revenue growth with real-time analytics: Poshmark’s journey

Build a data lake with Apache Flink on Amazon EMR

Unleashing the power of Presto: The Uber case study

Simplify external object access in Amazon Redshift using automatic mounting of the AWS Glue Data Catalog

Jumia builds a next-generation data platform with metadata-driven specification frameworks

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Stay Connected