Blog, Data Warehouse and Snapshot

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

In this blog, we will share with you in detail how Cloudera integrates core compute engines including Apache Hive and Apache Impala in Cloudera Data Warehouse with Iceberg. We will publish follow up blogs for other data services. Iceberg basics Iceberg is an open table format designed for large analytic workloads.

Data Warehouse

Data Warehouse Snapshot Metadata Cost-Benefit

Cloud Data Warehouse Migration 101: Expert Tips

Alation

JULY 28, 2022

It’s costly and time-consuming to manage on-premises data warehouses — and modern cloud data architectures can deliver business agility and innovation. However, CIOs declare that agility, innovation, security, adopting new capabilities, and time to value — never cost — are the top drivers for cloud data warehousing.

Data Warehouse

Data Warehouse Cost-Benefit Data-driven Data Governance

How to Use Apache Iceberg in CDP’s Open Lakehouse

Cloudera

AUGUST 8, 2022

The general availability covers Iceberg running within some of the key data services in CDP, including Cloudera Data Warehouse ( CDW ), Cloudera Data Engineering ( CDE ), and Cloudera Machine Learning ( CML ). Cloudera Data Engineering (Spark 3) with Airflow enabled. Cloudera Machine Learning . group by year.

Snapshot

Snapshot Data Warehouse Machine Learning Cost-Benefit

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Overview This blog post describes support for materialized views for the Iceberg table format. It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. Starting from the CDW Public Cloud DWX-1.6.1

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Whenever there is an update to the Iceberg table, a new snapshot of the table is created, and the metadata pointer points to the current table metadata file. At the top of the hierarchy is the metadata file, which stores information about the table’s schema, partition information, and snapshots. We use iceberg-blog-cluster.

Data Lake

Data Lake Data Processing Metadata Snapshot

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

In this blog post, we dive into different data aspects and how Cloudinary breaks the two concerns of vendor locking and cost efficient data analytics by using Apache Iceberg, Amazon Simple Storage Service (Amazon S3 ), Amazon Athena , Amazon EMR , and AWS Glue. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Blending Art and Science: Using Data to Forecast and Manage Your Sales Pipeline

Sisense

JANUARY 6, 2020

Best practice blends the application of advanced data models with the experience, intuition and knowledge of sales management, to deeply understand the sales pipeline. In this blog, we share some ideas of how to best use data to manage sales pipelines and have access to the fundamental data models that enable this process.

Sales

Sales Forecasting Snapshot Management

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

In this example, we use a Hive catalog, but we can change to the Data Catalog with the following configuration: spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog Before you run this step, create a S3 bucket and an iceberg folder in your AWS account with the naming convention /iceberg/.

Data Lake

Data Lake Snapshot Metadata Optimization

Unlock insights on Amazon RDS for MySQL data with zero-ETL integration to Amazon Redshift

AWS Big Data

MARCH 21, 2024

The extract, transform, and load (ETL) process has been a common pattern for moving data from an operational database to an analytics data warehouse. ELT is where the extracted data is loaded as is into the target first and then transformed. ETL and ELT pipelines can be expensive to build and complex to manage.

Data Warehouse

Data Warehouse Metrics Statistics Optimization

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance.

Data Lake

Data Lake Snapshot Metadata Optimization

From Hive Tables to Iceberg Tables: Hassle-Free

Cloudera

JULY 14, 2023

Depending on the size and usage patterns of the data, several different strategies could be pursued to achieve a successful migration. In this blog, I will describe a few strategies one could undertake for various use cases. This will be discussed in a later blog. Relatively fast as the underlying data files are kept in place.

Snapshot

Snapshot Data Warehouse Metadata Testing

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

AI at Scale isn’t Magic, it’s Data – Hybrid Data

Cloudera

OCTOBER 11, 2022

The takeaway – businesses need control over all their data in order to achieve AI at scale and digital business transformation. The challenge for AI is how to do data in all its complexity – volume, variety, velocity. But it isn’t just aggregating data for models. Data needs to be prepared and analyzed.

Data Science

Data Science Snapshot Data Warehouse Metadata

Benefits of Enterprise Modeling and Data Intelligence Solutions

erwin

JULY 2, 2020

They’re static snapshots of a diagram at some point in time. Data Modeling with erwin Data Modeler. a technology manager , uses erwin Data Modeler (erwin DM) at a pharma/biotech company with more than 10,000 employees for their enterprise data warehouse. This is live and dynamic.”. George H.,

Enterprise

Enterprise Modeling Metadata Data Governance

Cloudera Open Data Lakehouse Named a Finalist in the CRN Tech Innovator Awards

Cloudera

AUGUST 21, 2024

The root of the problem comes down to trusted data. Pockets and siloes of disparate data can accumulate across an enterprise or legacy data warehouses may not be equipped to properly manage a sea of structured and unstructured data at scale. Open Data Lakehouse also offers expanded support for Python 3.10

Snapshot

Snapshot Unstructured Data Data Architecture Data Warehouse

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

Introduction Apache Iceberg has recently grown in popularity because it adds data warehouse-like capabilities to your data lake making it easier to analyze all your data — structured and unstructured. Problem with too many snapshots Everytime a write operation occurs on an Iceberg table, a new snapshot is created.

Optimization

Optimization Strategy Snapshot Metadata

Analyze Data Faster with Google Cloud’s BigQuery Storage API

Sisense

APRIL 7, 2020

We live in a world of data: there’s more of it than ever before, in a ceaselessly expanding array of forms and locations. Dealing with Data is your window into the ways Data Teams are tackling the challenges of this new world to help their companies and their customers thrive. We live in an era of Big Data.

Big Data

Big Data Data Warehouse Cost-Benefit Snapshot

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

Improve performance and overall manageability of Iceberg tables using the new table maintenance capabilities such as expiring old snapshots and removing their metadata, and compaction to combine small files for more efficient data processing. Read why the future of data lakehouses is open. ORC open file format support.

Metadata

Metadata Data Warehouse Snapshot Machine Learning

12 Times Faster Query Planning With Iceberg Manifest Caching in Impala

Cloudera

JULY 13, 2023

A range of Iceberg table analysis such as listing table’s data file, selecting table snapshot, partition filtering, and predicate filtering can be delegated through Iceberg Java API instead, obviating the need for each query engine to implement it themself. The data files and metadata files in Iceberg format are immutable.

Metadata

Metadata Snapshot Data Warehouse Statistics

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

Therefore, it is critical for organizations to embrace a low-latency, scalable, and reliable data streaming infrastructure to deliver real-time business applications and better customer experiences. It can receive the events from an input Kinesis data stream and route the resulting stream to an output data stream.

Analytics

Analytics IoT Data-driven Snapshot

Accelerate Moving to CDP with Workload Manager

Cloudera

MAY 13, 2021

Since my last blog, What you need to know to begin your journey to CDP , we received many requests for a tool from Cloudera to analyze the workloads and help upgrade or migrate to Cloudera Data Platform (CDP). After moving to CDP, take a snapshot to use as a CDP baseline. Data Engineering jobs (optional). Isolate tenant

Management

Management Data Warehouse Interactive Reporting

Excellent Analytics Tip #17: Calculate Customer Lifetime Value

Occam's Razor

APRIL 5, 2010

His blog, Non-line Blogging , is a favourite of mine. Take a snapshot of your customer database for the past 2 years and it may look like this: That is an average. We've really only just scratched the surface of LTV in this blog post. David is one of those super-smart, funny, and nice people. Look 'em up.

Analytics

Analytics Marketing Measurement Metrics

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

AWS Big Data

FEBRUARY 9, 2023

To achieve this, they combine their CRM data with a wealth of information already available in their data warehouse, enterprise systems, or other software as a service (SaaS) applications. One widely used approach is getting the CRM data into your data warehouse and keeping it up to date through frequent data synchronization.

Data Warehouse

Data Warehouse Data-driven Snapshot Testing

Dimensional modeling in Amazon Redshift

AWS Big Data

JULY 19, 2023

Amazon Redshift is a fully managed and petabyte-scale cloud data warehouse that is used by tens of thousands of customers to process exabytes of data every day to power their analytics workload. You can structure your data, measure business processes, and get valuable insights quickly can be done by using a dimensional model.

Modeling

Modeling Sales Data Warehouse Snapshot

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

But what most people don’t realize is that behind the scenes, Uber is not just a transportation service; it’s a data and analytics powerhouse. Every day, millions of riders use the Uber app, unwittingly contributing to a complex web of data-driven decisions. They ingest data in snapshots from operational systems.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Chose Both: Data Fabric and Data Lakehouse

Cloudera

SEPTEMBER 12, 2022

A data lakehouse that enables multiple engines to run on the same data improves speed to market and productivity of users. . Cloudera has supported data lakehouses for over five years. Organizations need the two data architectures working together in harmony to drive value and insight from ever more data, faster.

Unstructured Data

Unstructured Data Data Architecture Data Lake Snapshot

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. Table data storage mode – There are two options: Historical – This table in the data lake stores historical updates to records (always append).

Data Lake

Data Lake Data Processing Metadata Snapshot

Introducing CDP Data Engineering: Purpose Built Tooling For Accelerating Data Pipelines

Cloudera

SEPTEMBER 17, 2020

Because DE is fully integrated with the Cloudera Shared Data Experience (SDX), every stakeholder across your business gains end-to-end operational visibility, with comprehensive security and governance throughout. The admin overview page provides a snapshot of all the workloads across multi-cloud environments.

Visualization

Visualization Statistics Metrics Optimization

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

Apache Hudi is an open source transactional data lake framework that greatly simplifies incremental data processing and the development of data pipelines. Besides demonstrating with Hudi here, we will follow up with other OTF tables with other blogs. For Stack name , enter a stack name (for example, rsv2-emr-hudi-blog ).

Data Lake

Data Lake Snapshot Big Data Data-driven

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

A Better Way Forward: Cloudera’s Open Data Lakehouse Cloudera offers a solution to these challenges with its open data lakehouse, which combines the flexibility and scalability of data lake storage with data warehouse functionality to unify and simplify the management of cyber log data.

Analytics

Analytics Metadata Snapshot Data-driven

Now Available: Cloudera Data Science Workbench Release 1.4

Cloudera

MAY 22, 2018

With CDSW, organizations can research and experiment faster, deploy models easily and with confidence, as well as rely on the wider Cloudera platform to reduce the risks and costs of data science projects. The post Now Available: Cloudera Data Science Workbench Release 1.4 appeared first on Cloudera Blog.

Data Science

Data Science Snapshot Machine Learning Data Warehouse

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

In a data warehouse, a dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. This post is designed to be implemented for a real customer use case, where you get full snapshot data on a daily basis.

Data Lake

Data Lake Testing Snapshot Big Data

Laminar Scales Enterprise Data Security Platform With New Management Features

Laminar Security

APRIL 18, 2023

It eliminates false positives and unnecessary Dev requests by allowing data security teams to securely validate data samples with data owners prior to fix escalation, without further exposing any data unnecessarily. The platform never removes data from the customer’s environment.

Enterprise

Enterprise Management Dashboards Snapshot

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Deploy your resources To provision the resources needed for the solution, complete the following steps: Choose Launch Stack : For Stack name , enter emr-serverless-deltalake-blog. Run the EMR Serverless Spark application to load data into Delta tables We use EMR Studio to manage and submit jobs in an EMR Serverless application.

Data Lake

Data Lake Dashboards Metrics Metadata

What Is Data Intelligence?

Alation

AUGUST 26, 2021

Today, BI represents a $23 billion market and umbrella term that describes a system for data-driven decision-making. BI leverages and synthesizes data from analytics, data mining, and visualization tools to deliver quick snapshots of business health to key stakeholders, and empower those people to make better choices.

Metadata

Metadata Data Governance Dashboards Software

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Icebergs branching feature Iceberg offers a branching feature for data lifecycle management, which is particularly useful for efficiently implementing the WAP pattern. The metadata of an Iceberg table stores a history of snapshots. snappy.parquet s3:// /src-data/current/ !aws

Data Quality

Data Quality Publishing Snapshot Data Lake

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

AWS Big Data

MAY 19, 2023

Organizations across all industries have complex data processing requirements for their analytical use cases across different analytics systems, such as data lakes on AWS , data warehouses ( Amazon Redshift ), search ( Amazon OpenSearch Service ), NoSQL ( Amazon DynamoDB ), machine learning ( Amazon SageMaker ), and more.

Machine Learning

Machine Learning Metrics Big Data Management

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Cloudera

OCTOBER 10, 2024

The open data lakehouse is quickly becoming the standard architecture for unified multifunction analytics on large volumes of data. It combines the flexibility and scalability of data lake storage with the data analytics, data governance, and data management functionality of the data warehouse.

Optimization

Optimization Snapshot Data Lake Cost-Benefit

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

Time Travel: Reproduce a query as of a given time or snapshot ID, which can be used for historical audits, validating ML models, and rollback of erroneous operations, as examples. The solution covered by this blog describes how Cloudera shares data with an Amazon Athena notebook. You will see the 2 carrier records in the table.

Metadata

Metadata Data Warehouse ROI Snapshot

ERP modernization: Still a make-or-break project for CIOs

CIO Business Intelligence

NOVEMBER 25, 2024

“Today’s CIOs inherit highly customized ERPs and struggle to lead change management efforts, especially with systems that [are the] backbone of all the enterprise’s operations,” wrote Isaac Sacolick, founder and president of StarCIO, a digital transformation consultancy, in a recent blog post.

Digital Transformation

Digital Transformation Data Warehouse Data Governance Enterprise

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

A Better Way Forward: Cloudera’s Open Data Lakehouse Cloudera offers a solution to these challenges with its open data lakehouse, which combines the flexibility and scalability of data lake storage with data warehouse functionality to unify and simplify the management of cyber log data.

Analytics

Analytics Metadata Snapshot Data-driven

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloud Data Warehouse Migration 101: Expert Tips

Webinars

Trending Sources

How to Use Apache Iceberg in CDP’s Open Lakehouse

Webinars

Materialized Views in Hive for Iceberg Table Format

Use Apache Iceberg in a data lake to support incremental data processing

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Blending Art and Science: Using Data to Forecast and Manage Your Sales Pipeline

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Unlock insights on Amazon RDS for MySQL data with zero-ETL integration to Amazon Redshift

Introducing Apache Hudi support with AWS Glue crawlers

Top 20 most-asked questions about Amazon RDS for Db2 answered

From Hive Tables to Iceberg Tables: Hassle-Free

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AI at Scale isn’t Magic, it’s Data – Hybrid Data

Benefits of Enterprise Modeling and Data Intelligence Solutions

Cloudera Open Data Lakehouse Named a Finalist in the CRN Tech Innovator Awards

Optimization Strategies for Iceberg Tables

Analyze Data Faster with Google Cloud’s BigQuery Storage API

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

12 Times Faster Query Planning With Iceberg Manifest Caching in Impala

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Accelerate Moving to CDP with Workload Manager

Excellent Analytics Tip #17: Calculate Customer Lifetime Value

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

Dimensional modeling in Amazon Redshift

Unleashing the power of Presto: The Uber case study

Chose Both: Data Fabric and Data Lakehouse

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Introducing CDP Data Engineering: Purpose Built Tooling For Accelerating Data Pipelines

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Empower Your Cyber Defenders with Real-Time Analytics

Now Available: Cloudera Data Science Workbench Release 1.4

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Laminar Scales Enterprise Data Security Platform With New Management Features

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

What Is Data Intelligence?

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

ERP modernization: Still a make-or-break project for CIOs

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Stay Connected