Data-driven, Metadata and Snapshot

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

In this post, we focus on data management implementation options such as accessing data directly in Amazon Simple Storage Service (Amazon S3), using popular data formats like Parquet, or using open table formats like Iceberg. Data management is the foundation of quantitative research.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Comprehensive data management for AI: The next-gen data management engine that will drive AI to new heights

CIO Business Intelligence

NOVEMBER 19, 2024

The next phase of this transformation requires an intelligent data infrastructure that can bring AI closer to enterprise data. The challenges of integrating data with AI workflows When I speak with our customers, the challenges they talk about involve integrating their data and their enterprise AI workflows.

Management

Management Unstructured Data Deep Learning Metadata

CRM’s Have a Big Data Technical Debt Problem: Here’s How to Fix It

Smart Data Collective

JULY 27, 2021

Customer relationship management (CRM) platforms are very reliant on big data. As these platforms become more widely used, some of the data resources they depend on become more stretched. CRM providers need to find ways to address the technical debt problem they are facing through new big data initiatives. Unused assets.

Big Data

Big Data Snapshot IT Dashboards

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

With this new instance family, OpenSearch Service uses OpenSearch innovation and AWS technologies to reimagine how data is indexed and stored in the cloud. Today, customers widely use OpenSearch Service for operational analytics because of its ability to ingest high volumes of data while also providing rich and interactive analytics.

Optimization

Optimization Snapshot Metadata Cost-Benefit

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. In addition, organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge.

Data Integration

Data Integration Data Lake Statistics Data-driven

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Enterprises and organizations across the globe want to harness the power of data to make better decisions by putting data at the center of every decision-making process. However, throughout history, data services have held dominion over their customers’ data.

Data Lake

Data Lake Metadata Snapshot Analytics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Businesses are constantly evolving, and data leaders are challenged every day to meet new requirements. Customers are using AWS and Snowflake to develop purpose-built data architectures that provide the performance required for modern analytics and artificial intelligence (AI) use cases.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

AWS Big Data

DECEMBER 9, 2024

In todays data-driven world, tracking and analyzing changes over time has become essential. As organizations process vast amounts of data, maintaining an accurate historical record is crucial. History management in data systems is fundamental for compliance, business intelligence, data quality, and time-based analysis.

Snapshot

Snapshot Data Warehouse Data Lake Data Quality

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Organizations with legacy, on-premises, near-real-time analytics solutions typically rely on self-managed relational databases as their data store for analytics workloads. Near-real-time streaming analytics captures the value of operational data and metrics to provide new insights to create business opportunities.

Management

Management Metadata Analytics Dashboards

Introducing in-place version upgrades with Amazon MWAA

AWS Big Data

JUNE 5, 2023

If you also needed to preserve the history of DAG runs, you had to take a backup of your metadata database and then restore that backup on the newly created environment. Amazon MWAA manages the entire upgrade process, from provisioning new Apache Airflow versions to upgrading the metadata database.

Snapshot

Snapshot Metadata Testing Data-driven

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

We’re living in the age of real-time data and insights, driven by low-latency data streaming applications. The volume of time-sensitive data produced is increasing rapidly, with different formats of data being introduced across new businesses and customer use cases.

Analytics

Analytics IoT Data-driven Snapshot

Blending Art and Science: Using Data to Forecast and Manage Your Sales Pipeline

Sisense

JANUARY 6, 2020

Best practice blends the application of advanced data models with the experience, intuition and knowledge of sales management, to deeply understand the sales pipeline. For this partnership to work, it requires sales leaders who really care about data and are open to analysts’ advice about how to use the Salesforce data they generate.

Sales

Sales Forecasting Snapshot Management

Amazon OpenSearch Service Under the Hood: Multi-AZ with Standby

AWS Big Data

MAY 10, 2023

In OpenSearch Service, you can deploy data nodes to store your data and respond to indexing and search requests, you can also deploy dedicated cluster manager nodes to manage and orchestrate the cluster. When you use OpenSearch Service, you create indexes to hold your data and specify partitioning and replication for those indexes.

Snapshot

Snapshot Testing Metadata Management

How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics

AWS Big Data

FEBRUARY 13, 2023

This is a guest post by Miguel Chin, Data Engineering Manager at OLX Group and David Greenshtein, Specialist Solutions Architect for Analytics, AWS. We live in a data-producing world, and as companies want to become data driven, there is the need to analyze more and more data.

Snapshot

Snapshot Data Warehouse Analytics Testing

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

AWS Big Data

AUGUST 6, 2024

INSITE applications are in general data intensive. They ingest and transform large volumes of data in different formats and processing patterns (such as batch and near real time) from various sources internal and external to Amazon. To enable and meet these requirements, GTTS built its own data platform.

Cost-Benefit

Cost-Benefit Metadata Snapshot Metrics

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. In this post, we discuss a common use case in relation to operational data processing and the solution we built using Apache Hudi and AWS Glue.

Data Lake

Data Lake Data Processing Metadata Snapshot

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

To build a data-driven business, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. For metadata read/write, Flink has the catalog interface.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

In fact, according to the Identity Theft Resource Center (ITRC) Annual Data Breach Report , there were 2,365 cyber attacks in 2023 with more than 300 million victims, and a 72% increase in data breaches since 2021. However, there is a fundamental challenge standing in the way of being successful: data.

Analytics

Analytics Metadata Snapshot Data-driven

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

This data is used in procuring devices’ inventory to meet Amazon customers’ demands. With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data.

Optimization

Optimization Forecasting Data Lake Metadata

What Is Data Intelligence?

Alation

AUGUST 26, 2021

What Is Data Intelligence? Data intelligence is a system to deliver trustworthy, reliable data. It includes intelligence about data, or metadata. IDC coined the term, stating, “data intelligence helps organizations answer six fundamental questions about data.” Why do we have data?

Metadata

Metadata Data Governance Dashboards Software

Reliable Data Exchange with the Outbox Pattern and Cloudera DiM

Cloudera

MARCH 15, 2023

In this post, I will demonstrate how to use the Cloudera Data Platform (CDP) and its streaming solutions to set up reliable data exchange in modern applications between high-scale microservices, and ensure that the internal state will stay consistent even under the highest load.

Data-driven

Data-driven Snapshot Publishing Metadata

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

FMs are multimodal; they work with different data types such as text, video, audio, and images. Large language models (LLMs) are a type of FM and are pre-trained on vast amounts of text data and typically have application uses such as text generation, intelligent chatbots, or summarization.

Data Lake

Data Lake Unstructured Data Management Snapshot

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

This allows you to simplify security and governance over transactional data lakes by providing access controls at table-, column-, and row-level permissions with your Apache Spark jobs. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Cloud Data Warehouse Migration 101: Expert Tips

Alation

JULY 28, 2022

There was a time when most CIOs would never consider putting their crown jewels — AKA customer data and associated analytics — into the cloud. And what must organizations overcome to succeed at cloud data warehousing ? What Are the Biggest Drivers of Cloud Data Warehousing? The cloud is no longer synonymous with risk.

Data Warehouse

Data Warehouse Cost-Benefit Data-driven Data Governance

Ensuring Data Transformation Quality with dbt Core

Wayne Yaddow

MARCH 14, 2025

How dbt Core aids data teams test, validate, and monitor complex data transformations and conversions Photo by NASA on Unsplash Introduction dbt Core, an open-source framework for developing, testing, and documenting SQL-based data transformations, has become a must-have tool for modern data teams as the complexity of data pipelines grows.

Data Transformation

Data Transformation Testing Unstructured Data Data Quality

Jumia builds a next-generation data platform with metadata-driven specification frameworks

AWS Big Data

DECEMBER 20, 2024

In this post, we share part of the journey that Jumia took with AWS Professional Services to modernize its data platform that ran under a Hadoop distribution to AWS serverless based solutions. These phases are: data orchestration, data migration, data ingestion, data processing, and data maintenance.

Metadata

Metadata Data-driven Snapshot Data Lake

Zero-copy, Coordination-free approach to OpenSearch Snapshots

AWS Big Data

MAY 13, 2025

Amazon OpenSearch Service provides automated hourly snapshots as a critical backup and recovery mechanism for customer data. These snapshots serve as point-in-time backups that you can use to restore your OpenSearch domains to a previous state, helping to ensure data durability and business continuity.

Snapshot

Snapshot Cost-Benefit Optimization Metadata

Accelerate lightweight analytics using PyIceberg with AWS Lambda and an AWS Glue Iceberg REST endpoint

AWS Big Data

MAY 9, 2025

For modern organizations built on data insights, effective data management is crucial for powering advanced analytics and machine learning (ML) activities. Iceberg tables can be accessed from various distributed data processing frameworks like Apache Spark and Trino, making it a flexible solution for diverse data processing needs.

Snapshot

Snapshot Analytics Data-driven Data Processing

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

In fact, according to the Identity Theft Resource Center (ITRC) Annual Data Breach Report , there were 2,365 cyber attacks in 2023 with more than 300 million victims, and a 72% increase in data breaches since 2021. However, there is a fundamental challenge standing in the way of being successful: data.

Analytics

Analytics Metadata Snapshot Data-driven

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

To optimize their security operations, organizations are adopting modern approaches that combine real-time monitoring with scalable data analytics. They are using data lake architectures and Apache Iceberg to efficiently process large volumes of security data while minimizing operational overhead.

Snapshot

Snapshot Optimization Data Lake Metadata

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. By using features like Icebergs compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale.

Data Lake

Data Lake IoT Metadata Testing

Melting the ice — How Natural Intelligence simplified a data lake migration to Apache Iceberg

AWS Big Data

APRIL 28, 2025

Many organizations turn to data lakes for the flexibility and scale needed to manage large volumes of structured and unstructured data. Recently, NI embarked on a journey to transition their legacy data lake from Apache Hive to Apache Iceberg. Silver layer : Contains cleaned and enriched data, processed using Apache Flink.

Data Lake

Data Lake Metadata Cost-Benefit Snapshot

“You Complete Me,” said Data Lineage to DataOps Observability.

DataKitchen

JANUARY 23, 2023

What is data lineage? Data lineage traces data’s origin, history, and movement through various processing, storage, and analysis stages. It is used to understand the provenance of data and how it is transformed and to identify potential errors or issues. What about DataOps Observability? How does it compare?

Testing

Testing Data Governance Data Quality Data-driven

Build a high-performance quant research platform with Apache Iceberg

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Webinars

Trending Sources

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Webinars

Comprehensive data management for AI: The next-gen data management engine that will drive AI to new heights

CRM’s Have a Big Data Technical Debt Problem: Here’s How to Fix It

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Introducing in-place version upgrades with Amazon MWAA

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Blending Art and Science: Using Data to Forecast and Manage Your Sales Pipeline

Amazon OpenSearch Service Under the Hood: Multi-AZ with Standby

How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Build a data lake with Apache Flink on Amazon EMR

Empower Your Cyber Defenders with Real-Time Analytics

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

What Is Data Intelligence?

Reliable Data Exchange with the Outbox Pattern and Cloudera DiM

Exploring real-time streaming for generative AI Applications

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Cloud Data Warehouse Migration 101: Expert Tips

Ensuring Data Transformation Quality with dbt Core

Jumia builds a next-generation data platform with metadata-driven specification frameworks

Zero-copy, Coordination-free approach to OpenSearch Snapshots

Accelerate lightweight analytics using PyIceberg with AWS Lambda and an AWS Glue Iceberg REST endpoint

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Melting the ice — How Natural Intelligence simplified a data lake migration to Apache Iceberg

“You Complete Me,” said Data Lineage to DataOps Observability.

Stay Connected