Blog, Data Lake and Data Processing

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights. Choose Next to create your stack.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Amazon DataZone now launched authentication supports through the Amazon Athena JDBC driver, allowing data users to seamlessly query their subscribed data lake assets via popular business intelligence (BI) and analytics tools like Tableau, Power BI, Excel, SQL Workbench, DBeaver, and more.

Visualization

Visualization Data Lake Testing Data Governance

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

This led to inefficiencies in data governance and access control. AWS Lake Formation is a service that streamlines and centralizes the data lake creation and management process. The Solution: How BMW CDH solved data duplication The CDH is a company-wide data lake built on Amazon Simple Storage Service (Amazon S3).

Data Lake

Data Lake Sales Metadata Machine Learning

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Eight Top DataOps Trends for 2022

DataKitchen

NOVEMBER 29, 2021

A domain has an important job and a dedicated team – five to nine members – who develop an intimate knowledge of data sources, data consumers and functional nuances. For example, managing ordered data dependencies, inter-domain communication, shared infrastructure, and incoherent workflows.

Testing

Testing Data Lake Data Architecture Manufacturing

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

However, enterprises often encounter challenges with data silos, insufficient access controls, poor governance, and quality issues. Embracing data as a product is the key to address these challenges and foster a data-driven culture. To incorporate this third-party data, AWS Data Exchange is the logical choice.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Scaling RISE with SAP data and AWS Glue

AWS Big Data

NOVEMBER 29, 2024

Customers often want to augment and enrich SAP source data with other non-SAP source data. Such analytic use cases can be enabled by building a data warehouse or data lake. Customers can now use the AWS Glue SAP OData connector to extract data from SAP.

Visualization

Visualization Data Processing Data-driven Cost-Benefit

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

AWS Big Data

AUGUST 3, 2023

Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. For more information, see Changing the default settings for your data lake.

Data Lake

Data Lake Visualization Dashboards Insurance

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

On your project, in the navigation pane, choose Data. For Add data source , choose Add connection. For Host , enter your host name of your Aurora PostgreSQL database cluster. format(connection_properties["HOST"],connection_properties["PORT"],connection_properties["DATABASE"]) df.write.format("jdbc").option("url",

Visualization

Visualization Data Processing Testing Publishing

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

Data storage databases. Your SaaS company can store and protect any amount of data using Amazon Simple Storage Service (S3), which is ideal for data lakes, cloud-native applications, and mobile apps. This blog post has demonstrated how AWS can greatly benefit your SaaS company, on multiple levels. Easy to use.

Cost-Benefit

Cost-Benefit Data Lake Software Machine Learning

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

This involves creating VPC endpoints in both the AWS and Snowflake VPCs, making sure data transfer remains within the AWS network. Use Amazon Route 53 to create a private hosted zone that resolves the Snowflake endpoint within your VPC. Open the secret blog-glue-snowflake-credentials. Choose Edit. Choose Next.

Analytics

Analytics Data-driven Data Integration Data Lake

Accomplish Agile Business Intelligence & Analytics For Your Business

datapine

APRIL 15, 2020

You need to determine if you are going with an on-premise or cloud-hosted strategy. For example, you can collect the amount of business information fed into a data lake weekly, therefore, have the advantage to react immediately if issues arise. Then, you need to choose AND set-up the right BI solution for your organization!

Business Intelligence

Business Intelligence Analytics Testing Dashboards

Create your Private Data Warehousing Environment Using Azure Kubernetes Service

Cloudera

DECEMBER 2, 2021

In addition to AKS and the load balancers mentioned above, this includes VNET, Data Lake Storage, PostgreSQL Azure database, and more. By default Azure Data Lake Storage, PostgreSQL Database, and Virtual Machines are accessible over public endpoints. Additional Aspects of a Private CDW Environment on Azure. Next Steps.

Data Lake

Data Lake Data Warehouse Data Processing Interactive

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. CDP Data Lake cluster versions – CM 7.4.0, Pre-Check: Data Lake Cluster. Understanding Ranger Policies in Data Lake Cluster.

Data Lake

Data Lake Metadata Unstructured Data Management

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

The Solution: CDP Private Cloud brings a next-generation hybrid architecture with cloud-native benefits to HBL’s data platform. HBL started their data journey in 2019 when data lake initiative was started to consolidate complex data sources and enable the bank to use single version of truth for decision making.

Management

Management Data Lake Consulting Unstructured Data

Announcing the 2020 Data Impact Award Winners

Cloudera

NOVEMBER 18, 2020

The technological linchpin of its digital transformation has been its Enterprise Data Architecture & Governance platform. It hosts over 150 big data analytics sandboxes across the region with over 200 users utilizing the sandbox for data discovery.

Internet Publishing and Broadcasting

Internet Publishing and Broadcasting Data-driven Broadcasting Digital Transformation

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

Use AWS Glue to streamline SFTP data processing

AWS Big Data

AUGUST 13, 2024

With AWS Glue, you can discover and connect to hundreds of diverse data sources and manage your data in a centralized data catalog. It enables you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Choose Store a new secret.

Data Processing

Data Processing Visualization Data Lake Data Processing

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Cloudera

MAY 23, 2024

In modern enterprises, the exponential growth of data means organizational knowledge is distributed across multiple formats, ranging from structured data stores such as data warehouses to multi-format data stores like data lakes. This makes gathering information for decision making a challenge.

Management

Management Metrics Data Processing Data Lake

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

IBM Big Data Hub

MAY 9, 2023

Over the past decade, deep learning arose from a seismic collision of data availability and sheer compute power, enabling a host of impressive AI capabilities. models are trained on IBM’s curated, enterprise-focused data lake, on our custom-designed cloud-native AI supercomputer, Vela. All watsonx.ai

Enterprise

Enterprise Technology Modeling Cost-Benefit

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

Cloudera’s Data Warehouse service allows raw data to be stored in the cloud storage of your choice (S3, ADLSg2). It will be stored in your own namespace, and not force you to move data into someone else’s proprietary file formats or hosted storage. Proprietary file formats mean no one else is invited in! Separate compute.

Data Warehouse

Data Warehouse Data Lake IT Analytics

The disruptive potential of open data lakehouse architectures and IBM watsonx.data

IBM Big Data Hub

JUNE 15, 2023

It is comprised of commodity cloud object storage, open data and open table formats, and high-performance open-source query engines. To help organizations scale AI workloads, we recently announced IBM watsonx.data , a data store built on an open data lakehouse architecture and part of the watsonx AI and data platform.

Data Warehouse

Data Warehouse Data Lake Optimization Data-driven

How Agencies Can Gain the Cyber Edge with Smart Data Solutions

Cloudera

DECEMBER 13, 2022

The attack targeted a host of public and private sector organizations (18,000 customers) including NASA, the Justice Department, and Homeland Security, and it is believed the attackers persisted on SolarWinds systems for 14 months prior to discovery. All with the integrated security and governance technologies required for compliance.

Machine Learning

Machine Learning Experimentation Data Lake Data Processing

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In addition, data pipelines include more and more stages, thus making it difficult for data engineers to compile, manage, and troubleshoot those analytical workloads. Those incremental costs derive from a variety of reasons: Increased data processing costs associated with legacy deployment types (e.g., CRM platforms).

Data Processing

Data Processing Data Warehouse Enterprise Visualization

TDC Digital leverages IBM Cloud for transparent billing and improved customer satisfaction

IBM Big Data Hub

MAY 19, 2023

Furthermore, TDC Digital had not used any cloud storage solution and experienced latency and downtime while hosting the application in its data center. TDC Digital is excited about its plans to host its IT infrastructure in IBM data centers, offering better scalability, performance and security.

Unstructured Data

Unstructured Data Data Processing Manufacturing Data Lake

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

AWS Big Data

JULY 30, 2024

Two private subnets are used to set up the Amazon MWAA environment, and the third private subnet is used to host the AWS Lambda authorizer function. For Bucket name , enter a name for your bucket (for this post, mwaa-sso-blog- ). For Stack name , enter a name (for example, sso-blog-mwaa-infra-stack). Choose Create bucket.

Metadata

Metadata Enterprise Data Lake Management

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Sisense

DECEMBER 11, 2019

With data growing at a staggering rate, managing and structuring it is vital to your survival. In this piece, we detail the Israeli debut of Periscope Data. Driving startup growth with the power of data. It’s why Sisense, having merged with Periscope Data in May 2019, chose to host this event in Tel Aviv.

Data Lake

Data Lake Big Data Sales Data-driven

High Availability (Multi-AZ) for Cloudera Operational Database

Cloudera

FEBRUARY 13, 2024

Introduction In the previous blog post we covered the high availability feature of Cloudera Operational Database (COD) in Amazon AWS. We will not repeat ourselves, so it’s assumed that technologies and concepts like HA, Multi-AZ, and operational databases are already known to the reader through the previous blog post.

Data Lake

Data Lake Testing Data Processing Enterprise

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. The post How Cloudera Data Flow Enables Successful Data Mesh Architectures appeared first on Cloudera Blog.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines. Data quality at rest focuses on validating the data stored in data lakes, databases, or data warehouses. It ensures that the data meets specific quality standards before it is consumed.

Data Quality

Data Quality Data Lake Visualization Data-driven

Attribute Amazon EMR on EC2 costs to your end-users

AWS Big Data

AUGUST 27, 2024

It takes in three arguments: – The Amazon S3 location of the data file that is read in by the Spark job. The input_full_path is s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/input/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet He is in data and analytical field for over 14 years.

Metrics

Metrics Dashboards Data Lake Optimization

BusinessObjects in the Cloud – No Big Rush and No Big Deal

Paul Blogs on BI

SEPTEMBER 8, 2021

Well firstly, if the main data warehouses, repositories, or application databases that BusinessObjects accesses are on premise, it makes no sense to move BusinessObjects to the cloud until you move its data sources to the cloud. You also have the option of hosting with a third party.

Data Warehouse

Data Warehouse Data Processing Data Lake Testing

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

And knowing the business purpose translates into actively governing personal data against potential privacy and security violations. Do You Know Where Your Sensitive Data Is? Data is a valuable asset used to operate, manage and grow a business.

Data Governance

Data Governance Cost-Benefit Metadata Risk

Aaand the New NiFi Champion is…

Cloudera

JUNE 5, 2023

On May 3, 2023, Cloudera kicked off a contest called “Best in Flow” for NiFi developers to compete to build the best data pipelines. This blog is to congratulate our winner and review the top submissions. RK built some simple flows to pull streaming data into Google Cloud Storage and Snowflake.

Testing

Testing Data Lake Data Processing IT

CDP Private Cloud is a Game-changer for Partners

Cloudera

SEPTEMBER 2, 2020

Additionally, lines of business (LOBs) are able to gain access to a shared data lake that is secured and governed by the use of Cloudera Shared Data Experience (SDX). According to 451 Research’s Voice of the Enterprise: Cloud, Hosting & Managed Services study, 58% of Enterprises are moving towards a hybrid IT environment.

Cost-Benefit

Cost-Benefit Data Warehouse Data Lake Machine Learning

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. For Type , choose Spark.

Data Lake

Data Lake Dashboards Metrics Metadata

Modern Data Architecture for Telecommunications

Cloudera

SEPTEMBER 6, 2022

Previously, there were three types of data structures in telco: . Entity data sets — i.e. marketing data lakes . More global telco companies like LG Uplus , Deutsche Telekom , and Vodafone Automotive continue to build their data futures on Cloudera. Read their stories and more on cloudera.com/telco. .

Data Architecture

Data Architecture Cost-Benefit Digital Transformation Business Driver

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. Over time, workloads start processing more data, tenants start onboarding more workloads, and administrators (admins) start onboarding more tenants. Cloudera Manager (CM) 6.2

Metadata

Metadata Data Lake Optimization Strategy

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

2020 saw us hosting our first ever fully digital Data Impact Awards ceremony, and it certainly was one of the highlights of our year. We saw a record number of entries and incredible examples of how customers were using Cloudera’s platform and services to unlock the power of data. Show us what is possible! .

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

10 Keys to a Secure Cloud Data Lakehouse

Cloudera

OCTOBER 25, 2022

The data lakehouse is gaining in popularity because it enables a single platform for all your enterprise data with the flexibility to run any analytic and machine learning (ML) use case. Cloud data lakehouses provide significant scaling, agility, and cost advantages compared to cloud data lakes and cloud data warehouses.

Data Processing

Data Processing Data Lake Cost-Benefit Risk

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This is a guest blog post co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland. Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. The source code for the application is hosted the AWS Glue GitHub.

Metadata

Metadata Data Lake Machine Learning Big Data

Quantitative and Qualitative Data: A Vital Combination

Sisense

OCTOBER 6, 2020

Digging into quantitative data Why is quantitative data important What are the problems with quantitative data Exploring qualitative data Qualitative data benefits Getting the most from qualitative data Better together. First, data isn’t created in a uniform, consistent format.

Statistics

Statistics Unstructured Data Data-driven Visualization

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

AWS Big Data

SEPTEMBER 22, 2023

At Stitch Fix, we have been powered by data science since its foundation and rely on many modern data lake and data processing technologies. In our infrastructure, Apache Kafka has emerged as a powerful tool for managing event streams and facilitating real-time data processing.

Management

Management Metrics Cost-Benefit Data Lake

Deep Thoughts on Data Flow with Alation & Trifacta

Alation

FEBRUARY 20, 2020

Data lakes, while useful in helping you to capture all of your data, are only the first step in extracting the value of that data. Watch the complete video interview below: Subscribe to Alation's Blog. Get the latest data cataloging news and trends in your inbox.

Data Lake

Data Lake Data Processing Data Quality Visualization

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Webinars

Trending Sources

How BMW streamlined data access using AWS Lake Formation fine-grained access control

Webinars

Eight Top DataOps Trends for 2022

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Scaling RISE with SAP data and AWS Glue

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

10 Things AWS Can Do for Your SaaS Company

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

Accomplish Agile Business Intelligence & Analytics For Your Business

Create your Private Data Warehousing Environment Using Azure Kubernetes Service

Migrate Hive data from CDH to CDP public cloud

Habib Bank manages data at scale with Cloudera Data Platform

Announcing the 2020 Data Impact Award Winners

Enhance query performance using AWS Glue Data Catalog column-level statistics

Use AWS Glue to streamline SFTP data processing

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

Get Your Analytics Insights Instantly – Without Abandoning Central IT

The disruptive potential of open data lakehouse architectures and IBM watsonx.data

How Agencies Can Gain the Cyber Edge with Smart Data Solutions

Addressing the Three Scalability Challenges in Modern Data Platforms

TDC Digital leverages IBM Cloud for transparent billing and improved customer satisfaction

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

High Availability (Multi-AZ) for Cloudera Operational Database

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Attribute Amazon EMR on EC2 costs to your end-users

BusinessObjects in the Cloud – No Big Rush and No Big Deal

How Data Governance Protects Sensitive Data

Aaand the New NiFi Champion is…

CDP Private Cloud is a Game-changer for Partners

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Modern Data Architecture for Telecommunications

Improving Multi-tenancy with Virtual Private Clusters

Announcing the 2021 Data Impact Awards

10 Keys to a Secure Cloud Data Lakehouse

How Cargotec uses metadata replication to enable cross-account data sharing

Quantitative and Qualitative Data: A Vital Combination

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

Deep Thoughts on Data Flow with Alation & Trifacta

Stay Connected