Blog - Data Leaders Brief

Top 14 Must-Read Data Science Books You Need On Your Desk

datapine

MAY 14, 2019

“Big data is at the foundation of all the megatrends that are happening.” – Chris Lynch, big data expert. We live in a world saturated with data. At present, around 2.7 Zettabytes of data are floating around in our digital universe, just waiting to be analyzed and explored, according to AnalyticsWeek. In 2013, less than 0.5% click for book source**.

Data Science

Data Science Machine Learning Big Data Data-driven

Accelerating Drug Discovery and Development with DataOps

DataKitchen

AUGUST 13, 2021

DataOps automation provides a way to boost innovation and improve collaboration related to data in pharmaceutical research and development (R&D). A typical R&D organization has many independent teams, and each team chooses a different technology platform. Mastery of Heterogeneous Tools.

Testing

Testing Dashboards Marketing Measurement

Apache Kafka Deployments and Systems Reliability – Part 1

Cloudera

SEPTEMBER 20, 2021

In this blog series, we will discuss each of these deployments and the deployment choices made along with how they impact reliability. In Part 1, the discussion is related to: Serial and Parallel Systems Reliability as a concept, Kafka Clusters with and without Co-Located Apache Zookeeper, and Kafka Clusters deployed on VMs. .

Data Processing

Data Processing Software IT

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Evaluating Ray: Distributed Python for Massive Scalability

Domino Data Lab

FEBRUARY 12, 2020

Dean Wampler provides a distilled overview of Ray, an open source system for scaling Python systems from single machines to large clusters. this post on the Ray project blog ?. Ray is an open-source system for scaling Python applications from single machines to large clusters. Introduction. Ray: Scaling Python Applications.

Experimentation

Experimentation Modeling Data Science Machine Learning

The DataOps Vendor Landscape, 2021

DataKitchen

APRIL 13, 2021

Read the complete blog below for a more detailed description of the vendors and their capabilities. Download the 2021 DataOps Vendor Landscape here. DataOps is a hot topic in 2021. This is not surprising given that DataOps enables enterprise data teams to generate significant business value from their data. Meta-Orchestration .

Testing

Testing Machine Learning Consulting Data Science

Chart Snapshot: Tanglegrams

The Data Visualisation Catalogue

SEPTEMBER 26, 2024

As a visualisation method, Tanglegrams are often implemented to compare and display the concordance (similarity of traits) between two datasets of hierarchical clustering. Tanglegram comparing dendrograms between volume and site index by the ADA and GADA approach, based on hierarchical clustering, in clonal teak (Tectona grandis Linn F.)

Snapshot

Snapshot Data Processing Visualization

Density-Based Clustering

Domino Data Lab

DECEMBER 2, 2020

Cluster Analysis is an important problem in data analysis. Data scientists use clustering to identify malfunctioning servers, group genes with similar expression patterns, and perform various other applications. There are many families of data clustering algorithms, and you may be familiar with the most popular one: k-means.

Metrics

Metrics KDD Testing Machine Learning

Chart Snapshot: Radar Box Plots

The Data Visualisation Catalogue

MAY 2, 2024

This combination enables the comparison of multivariate data across multiple classes or clusters simultaneously. This visualisation uses radar polygons that can be compared based on their shape and thickness, providing insights into data variability and similarities among classes or clusters.

Snapshot

Snapshot Strategy IT Visualization

Simply Install: Apache Hadoop

Insight

MAY 20, 2020

Fourteen years later, there are quite a number of Hadoop clusters in operation across many companies, though fewer companies are probably creating new Hadoop clusters? Fourteen years later, there are quite a number of Hadoop clusters in operation across many companies, though fewer companies are probably creating new Hadoop clusters?—?instead

Interactive

Interactive Publishing Metadata IT

Attribute Amazon EMR on EC2 costs to your end-users

AWS Big Data

AUGUST 27, 2024

The current Amazon EMR pricing page shows the estimated cost of the cluster. In this post, we share a chargeback model that you can use to track and allocate the costs of Spark workloads running on Amazon EMR on EC2 clusters. It can help you identify cost optimizations and improve the cost-efficiency of your EMR clusters.

Metrics

Metrics Dashboards Data Lake Optimization

Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and Amazon S3 Access Grants

AWS Big Data

MAY 29, 2024

This approach empowers administrators to grant access directly based on existing user and group memberships federated from external IdPs, rather than relying on IAM users or roles. Solution overview Let’s consider a fictional company, OkTank. The user identities are managed externally in an external IdP: Okta.

Data Lake

Data Lake Enterprise Management Business Intelligence

The curse of Dimensionality

Domino Data Lab

OCTOBER 7, 2020

In this blog we show what the changes in behavior of data are in high dimensions. In our next blog we discuss how we try to avoid these problems in applied data analysis of high dimensional data. Each property is discussed below with R code so the reader can test it themselves. Data Has Properties. P >> N) ).

Statistics

Statistics Testing Predictive Modeling Big Data

Chart Snapshot: Circular Dendrograms

The Data Visualisation Catalogue

SEPTEMBER 19, 2024

A Circular Dendrogram is a variation of a Dendrogram that visualises the structure of hierarchical clustering on a polar (radial) layout. Clades closer to the edges of the diagram show individual entities, while the more central clades represent groups of entities clustered together. js Graph Gallery (D3.js) js) Observable (D3.js)

Snapshot

8 Modeling Tools to Build Complex Algorithms

Domino Data Lab

AUGUST 9, 2021

Ray: is an open-source library framework that offers a simple API for scaling applications from a single computer to large clusters. The library contains an assortment of tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction and predictive data analysis.

Modeling

Modeling Deep Learning Machine Learning Statistics

How CFM built a well-governed and scalable data-engineering platform using Amazon EMR for financial features generation

AWS Big Data

SEPTEMBER 13, 2024

This post is co-written with Julien Lafaye from CFM. Capital Fund Management ( CFM ) is an alternative investment management company based in Paris with staff in New York City and London. CFM takes a scientific approach to finance, using quantitative and systematic techniques to develop the best investment strategies.

Interactive

Interactive Strategy Cost-Benefit Data Governance

Chart Snapshot: Dendrograms

The Data Visualisation Catalogue

AUGUST 28, 2024

A Dendrogram is a variation of a Tree Diagram that illustrates the arrangement of clusters formed by hierarchical clustering. Clades closer to the bottom of the diagram show individual entities, while higher clades represent groups of entities that have been clustered together. Each leaf represents an individual entity.

Snapshot

Snapshot Visualization

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. pip install -r requirements.txt. Introduction. In this tutorial, we will illustrate how RAPIDS can be used to tackle the Kaggle Home Credit Default Risk challenge. Get the Dataset.

Machine Learning

Machine Learning Data Science Data Lake Modeling

Deliver Amazon CloudWatch logs to Amazon OpenSearch Serverless

AWS Big Data

JULY 31, 2024

At the time of publishing this blog post, these subscription filters support delivering logs to Amazon OpenSearch Service provisioned clusters only. In this blog post, we will show how to use Amazon OpenSearch Ingestion to deliver CloudWatch logs to OpenSearch Serverless in near real-time.

Visualization

Visualization Dashboards Testing Publishing

R vs Python: What’s the Best Language for Natural Language Processing?

Sisense

APRIL 10, 2020

Text data is proliferating at a staggering rate, and only advanced coding languages like Python and R will be able to pull insights out of these datasets at scale. R or Python?”. People looking into data science languages are usually confused about which language they should learn first: R or Python. R: Analytics powerhouse.

Deep Learning

Deep Learning Data Science Machine Learning Visualization

Automated Deployment of CDP Private Cloud Clusters

Cloudera

JUNE 15, 2021

By automating cluster deployment this way, you reduce the risk of misconfiguration, promote consistent deployments across multiple clusters in your environment, and help to deliver business value more quickly. . This blog will walk through how to deploy a Private Cloud Base cluster, with security, with a minimum of human interaction.

Data Processing

Data Processing Management Interactive Risk

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

Additionally, with version 6.15, Amazon EMR introduces access control protection for its application web interface such as on-cluster Spark History Server, Yarn Timeline Server, and Yarn Resource Manager UI. Besides demonstrating with Hudi here, we will follow up with other OTF tables with other blogs.

Data Lake

Data Lake Snapshot Big Data Data-driven

HBase Clusters Data Synchronization with HashTable/SyncTable tool

Cloudera

OCTOBER 22, 2020

Replication ( covered in this previous blog article ) has been released for a while and is among the most used features of Apache HBase. That means any pre-existing data on all clusters involved in the replication deployment will still need to get copied between the peers in some other way. HashTable/SyncTable in a nutshell.

Testing

Testing Snapshot Reporting IT

Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask

Cloudera

OCTOBER 6, 2020

Main benefits of COD include: Auto-scaling – based on the workload utilization of the cluster and will soon have the ability to scale up/down the cluster. In this blog, I will demonstrate how COD can easily be used as a backend system to store data and images for a simple web application. *For import phoenixdb.cursor.

Data Processing

Data Processing Testing Management Modeling

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

AWS Big Data

APRIL 3, 2023

Amazon Redshift RSQL is a native command-line client for interacting with Amazon Redshift clusters and databases. You can connect to an Amazon Redshift cluster, describe database objects, query data, and view query results in various output formats. The RSQL job performs ETL and ELT operations on the Amazon Redshift cluster.

Data Warehouse

Data Warehouse Testing Data Lake Data-driven

Analyzing Large P Small N Data – Examples from Microbiome

Domino Data Lab

NOVEMBER 17, 2020

Our previous Domino Blog on the Curse of Dimensionality [2] , describes weird behaviors that emerge in data when P >> N: Points move far away from each other. High throughput screening technologies have been developed to measure all the molecules of interest in a sample in a single experiment (e.g., The 12 are listed in Table 1.

Statistics

Statistics Measurement Testing Predictive Modeling

Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control

AWS Big Data

NOVEMBER 6, 2023

Amazon EMR Studio is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. The jobs on the EMR cluster will use this runtime role to access AWS resources.

Data Lake

Data Lake Sales Management Testing

Centralize near-real-time governance through alerts on Amazon Redshift data warehouses for sensitive queries

AWS Big Data

JUNE 29, 2023

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud that delivers powerful and secure insights on all your data with the best price-performance. With Amazon Redshift, you can analyze your data to derive holistic insights about your business and your customers. The AWS Region used for this post is us-east-1.

Data Warehouse

Data Warehouse Dashboards Testing Visualization

Converting HBase ACLs to Ranger policies

Cloudera

APRIL 26, 2021

Possible permissions (zero or more letters from the set “RWXCA”): Read (R) – can read data at the given scope. Admin (A) – can perform cluster operations such as balancing the cluster or assigning regions at the given scope. The user who runs HBase on your cluster is a superuser. normal/override.

Finance

Finance Testing Metadata Management

Academic Research Done on Alluvial Diagrams

The Data Visualisation Catalogue

JANUARY 24, 2022

Using an Alluvial Diagram provided the ability to track a field of interest and which cluster it belonged to over a time period. Most of the research listed below focuses on the applications of Alluvial Diagrams, but in recent years more research has been done on the construction and design of Alluvial Diagrams. By Martin Rosvall , Carl T.

Visualization

Visualization Reporting Modeling IT

Fine-Tune Fair to Capacity Scheduler in Relative Mode

Cloudera

MAY 13, 2022

In previous blog posts the Four Paths to CDP and Choosing your Upgrade or Migration Path , we covered the overall business and technical issues that go into moving your legacy platform to CDP. In this blog we shift our focus to a specific area that should be given some special attention while upgrading or migrating from CDH to CDP.

Management

Management Enterprise Technology IT

Open Data Science and Machine Learning for Business with Cloudera Data Science Workbench on HDP

Cloudera

JANUARY 30, 2019

Cloudera Data Science Workbench is a web-based application that allows data scientists to use their favorite open source libraries and languages — including R, Python, and Scala — directly in secure environments, accelerating analytics projects from research to production. Add it to an existing HDP cluster, and it just works.

Data Science

Data Science Machine Learning Experimentation Visualization

Unlocking HBase on S3 With the New Store File Tracking Feature

Cloudera

NOVEMBER 15, 2022

We covered HBOSS in this previous blog post. Unfortunately, when running the HBOSS solution against larger workloads and datasets spanning over thousands of regions and tens of terabytes, lock contentions induced by HBOSS would severely hamper cluster performance. You can access COD from your CDP console. HBase on S3 review.

Snapshot

Snapshot Cost-Benefit Reporting Visualization

How SOCAR handles large IoT data with Amazon MSK and Amazon ElastiCache for Redis

AWS Big Data

MAY 3, 2023

This is a guest blog post co-written with SangSu Park and JaeHong Ahn from SOCAR. As companies continue to expand their digital footprint, the importance of real-time data processing and analysis cannot be overstated. SOCAR is the leading Korean mobility company with strong competitiveness in car-sharing.

IoT

IoT Internet of Things Data Transformation Management

Sample applications for Cloudera Operational Database

Cloudera

FEBRUARY 26, 2021

In the previous blog posts, we looked at application development concepts and how Cloudera Operational Database (COD) interacts with other CDP services. In this blog post, let us see how easy it is to create a COD instance, and deploy a sample application that runs on that COD instance. . Quick start to deploy your application.

Interactive

Interactive IT

Auditing to external systems in CDP Private Cloud Base

Cloudera

MAY 26, 2021

In this blog we are going to demonstrate how these audit events can be streamed to a third-party SIEM platform via syslog or they can be written to a local file where existing SIEM agents may be able to pick them up. According to research by The Ponemon Institute the average global cost of ?Insider Insider Threats? 31% in two years?

Reporting

Reporting Management Measurement Testing

Product Clustering Techniques in Demand Forecasting

DataRobot

APRIL 26, 2021

All of these techniques center around product clustering, where product lines or SKUs that are “closer” or more similar to each other are clustered and modeled together. In this blog post, we describe these strategies. Clustering by product group. The most intuitive way of clustering SKUs is by their product group.

Forecasting

Forecasting Sales Data-driven Modeling

3 Key Components of the Interdisciplinary Field of Data Science

Domino Data Lab

JULY 28, 2021

Through a marriage of traditional statistics with fast-paced, code-first computer science doctrine and business acumen, data science teams can solve problems with more accuracy and precision than ever before, especially when combined with soft skills in creativity and communication. 3 Components of Data Science Skills. Math and Statistics Expertise.

Data Science

Data Science Statistics Predictive Analytics Recreation/Entertainment

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

AWS Big Data

JULY 31, 2024

Solution overview The following diagram illustrates the architecture that you implement through this blog post. In the current industry landscape, data lakes have become a cornerstone of modern data architecture, serving as repositories for vast amounts of structured and unstructured data.

Data Lake

Data Lake Marketing Data Processing Management

Operational Database Security – Part 2

Cloudera

SEPTEMBER 23, 2020

In this blogpost, we are going to take a look at some of the OpDB related security features of a CDP Private Cloud Base deployment. We are going to talk about auditing, different security levels, security features of Data Catalog, and Client Considerations. You can find part 1 of this series, here. . User, business classification of asset accessed.

Data Lake

Data Lake Metadata IoT Enterprise

HDFS Data Encryption at Rest on Cloudera Data Platform

Cloudera

APRIL 23, 2021

However, any user with HDFS admin or root access on cluster nodes would be able to impersonate the “hdfs” user and access sensitive data in clear text. The capability increases security and protects sensitive data from various kinds of attack that could be internal or external to the platform.

Data Processing

Data Processing Metadata Testing Management

Build a secure data visualization application using the Amazon Redshift Data API with AWS IAM Identity Center

AWS Big Data

MARCH 6, 2025

In todays data-driven world, securely accessing, visualizing, and analyzing data is essential for making informed business decisions. Tens of thousands of customers use Amazon Redshift for modern data analytics at scale, delivering up to three times better price-performance and seven times better throughput than other cloud data warehouses.

Visualization

Visualization Sales Data Warehouse Management

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

AWS Big Data

MAY 4, 2023

A collaboration between the Met Office and EUMETSAT, detailed in Data Proximate Computation on a Dask Cluster Distributed Between Data Centres , highlights the growing need to develop a sustainable, efficient, and scalable data science solution. This solution was inspired by work with a key AWS customer, the UK Met Office.

Data Processing

Data Processing Metadata Informatics Interactive

Access control for Azure ADLS cloud object storage

Cloudera

SEPTEMBER 15, 2020

The audit log for the above operations, with details like time, user, path, operation, client IP address, cluster name, and Ranger policy that authorized the access, are interactively available in Apache Ranger console. In addition, Apache Ranger enables policy-based dynamic column-masking and row-filtering. Cloudera Data Platform 7.2.1

Data Lake

Data Lake Interactive Management Modeling

How to use VPN with a VPC hub-and-spoke architecture

IBM Big Data Hub

MAY 22, 2023

The team can allow enterprise access to VPC resources like Virtual Service Instances running applications or VPC RedHat OpenShift IBM Cloud clusters. All the tests should pass: python install -r requirements.txt pytest A note on enterprise-to-transit cross-zone routing The initial design worked well for enterprise <> spokes.

Enterprise

Enterprise Testing Publishing Technology

Top 14 Must-Read Data Science Books You Need On Your Desk

Accelerating Drug Discovery and Development with DataOps

Webinars

Trending Sources

Apache Kafka Deployments and Systems Reliability – Part 1

Webinars

Evaluating Ray: Distributed Python for Massive Scalability

The DataOps Vendor Landscape, 2021

Chart Snapshot: Tanglegrams

Density-Based Clustering

Chart Snapshot: Radar Box Plots

Simply Install: Apache Hadoop

Attribute Amazon EMR on EC2 costs to your end-users

Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and Amazon S3 Access Grants

The curse of Dimensionality

Chart Snapshot: Circular Dendrograms

8 Modeling Tools to Build Complex Algorithms

How CFM built a well-governed and scalable data-engineering platform using Amazon EMR for financial features generation

Chart Snapshot: Dendrograms

NVIDIA RAPIDS in Cloudera Machine Learning

Deliver Amazon CloudWatch logs to Amazon OpenSearch Serverless

R vs Python: What’s the Best Language for Natural Language Processing?

Automated Deployment of CDP Private Cloud Clusters

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

HBase Clusters Data Synchronization with HashTable/SyncTable tool

Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

Analyzing Large P Small N Data – Examples from Microbiome

Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control

Centralize near-real-time governance through alerts on Amazon Redshift data warehouses for sensitive queries

Converting HBase ACLs to Ranger policies

Academic Research Done on Alluvial Diagrams

Fine-Tune Fair to Capacity Scheduler in Relative Mode

Open Data Science and Machine Learning for Business with Cloudera Data Science Workbench on HDP

Unlocking HBase on S3 With the New Store File Tracking Feature

How SOCAR handles large IoT data with Amazon MSK and Amazon ElastiCache for Redis

Sample applications for Cloudera Operational Database

Auditing to external systems in CDP Private Cloud Base

Product Clustering Techniques in Demand Forecasting

3 Key Components of the Interdisciplinary Field of Data Science

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

Operational Database Security – Part 2

HDFS Data Encryption at Rest on Cloudera Data Platform

Build a secure data visualization application using the Amazon Redshift Data API with AWS IAM Identity Center

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

Access control for Azure ADLS cloud object storage

How to use VPN with a VPC hub-and-spoke architecture

Stay Connected