Data Leaders Brief

Columns Big-Data-Notes

Automate Data Quality Reports with n8n: From CSV to Professional Analysis

KDnuggets

JUNE 26, 2025

Which columns are problematic? Whats the overall data quality score? Most data scientists spend 15-30 minutes manually exploring each new dataset—loading it into pandas, running.info() ,describe() , and.isnull().sum() sum() , then creating visualizations to understand missing data patterns.

Data Quality

Data Quality Reporting Machine Learning Data Science

How Skroutz handles real-time schema evolution in Amazon Redshift with Debezium

AWS Big Data

JUNE 23, 2025

When we decided to build our own data platform to meet our data needs, such as supporting reporting, business intelligence (BI), and decision-making, the main challenge—and also a strict requirement—was to make sure it wouldn’t block or delay our product development. For this, we used Debezium along with a Kafka cluster.

Data Warehouse

Data Warehouse Metadata Data-driven Reporting

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Trending Sources

Setting Up a Machine Learning Pipeline on Google Cloud Platform

KDnuggets

JULY 25, 2025

By Cornellius Yudha Wijaya , KDnuggets Technical Content Specialist on July 25, 2025 in Data Engineering Image by Editor | ChatGPT # Introduction Machine learning has become an integral part of many companies, and businesses that dont utilize it risk being left behind. Download the data and store it somewhere for now.

Machine Learning

Machine Learning Data Science Advertising Modeling

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena

AWS Big Data

JULY 25, 2025

As a result, organizations collect vast amounts of data from diverse sensor devices monitoring everything from industrial equipment to smart buildings. As a result, the data structure (schema) of the information transmitted by these devices evolves continuously.

IoT

IoT Analytics Metadata Measurement

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. The metadata also has foreign key constraint details.

Metadata

Metadata Data Lake Modeling Data Warehouse

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. Consider a common scenario: A streaming pipeline continuously writes data to an Iceberg table while scheduled maintenance jobs perform compaction operations.

Snapshot

Snapshot Management Metadata Big Data

Accelerate your data quality journey for lakehouse architecture with Amazon SageMaker, Apache Iceberg on AWS, Amazon S3 tables, and AWS Glue Data Quality

AWS Big Data

JULY 28, 2025

In an era where data drives innovation and decision-making, organizations are increasingly focused on not only accumulating data but on maintaining its quality and reliability. By using AWS Glue Data Quality , you can measure and monitor the quality of your data. With this, you can make confident business decisions.

Data Quality

Data Quality Data Lake Data Architecture Visualization

Amazon EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1

AWS Big Data

DECEMBER 27, 2024

4xlarge instances, providing observable gains for data processing tasks. The fact tables used the default partitioning by the date column, which have a number of partitions varying from 2002,100. times less data from Amazon S3 and 4.1 Create Iceberg tables from the TPC-DS source data. times faster than Apache Spark 3.5.1

Cost-Benefit

Cost-Benefit Testing Metrics Optimization

Introducing AWS Glue Data Catalog usage metrics for API usage

AWS Big Data

JUNE 26, 2025

We’re excited to announce AWS Glue Data Catalog usage metrics. This feature provides you with immediate visibility into your AWS Glue Data Catalog API usage patterns and trends. AWS Glue Data Catalog is a centralized repository that stores metadata about your organization’s datasets.

Metrics

Metrics Statistics Dashboards Metadata

Compaction support for Avro and ORC file formats in Apache Iceberg tables in Amazon S3

AWS Big Data

JULY 15, 2025

Apache Iceberg, a high-performance open table format (OTF), has gained widespread adoption among organizations managing large scale analytic tables and data volumes. Parquet is one of the most common and fastest growing data types in Amazon S3. ORC was specifically designed for Hadoop ecosystem and optimized for Hive.

Optimization

Optimization Data Lake Cost-Benefit IoT

Develop and monitor a Spark application using existing data in Amazon S3 with Amazon SageMaker Unified Studio

AWS Big Data

JULY 9, 2025

Organizations face significant challenges managing their big data analytics workloads. Data teams struggle with fragmented development environments, complex resource management, inconsistent monitoring, and cumbersome manual scheduling processes. Choose Data processing , then choose Add compute.

Testing

Testing Interactive Sales Dashboards

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

AWS Big Data

DECEMBER 9, 2024

In todays data-driven world, tracking and analyzing changes over time has become essential. As organizations process vast amounts of data, maintaining an accurate historical record is crucial. History management in data systems is fundamental for compliance, business intelligence, data quality, and time-based analysis.

Snapshot

Snapshot Data Warehouse Data Lake Data Quality

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. By using features like Icebergs compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale.

Data Lake

Data Lake IoT Metadata Testing

Build a secure data visualization application using the Amazon Redshift Data API with AWS IAM Identity Center

AWS Big Data

MARCH 6, 2025

In todays data-driven world, securely accessing, visualizing, and analyzing data is essential for making informed business decisions. The Amazon Redshift Data API simplifies access to your Amazon Redshift data warehouse by removing the need to manage database drivers, connections, network configurations, data buffering, and more.

Visualization

Visualization Sales Data Warehouse Management

ERP modernization: Still a make-or-break project for CIOs

CIO Business Intelligence

NOVEMBER 25, 2024

We really liked [NetSuite’s] architecture and that it’s in the cloud, and it hit the vast majority of our business requirements,” Shannon notes. The ERP modernization mandate ERP modernization is both a big undertaking and a big mandate for CIOs — and not one most relish having to do.

Digital Transformation

Digital Transformation Data Warehouse Data Governance Enterprise

How to Learn AI for Data Analytics in 2025

KDnuggets

JUNE 27, 2025

By Natassha Selvaraj , KDnuggets Technical Content Specialist At-Large on June 27, 2025 in Data Science Image by Editor | ChatGPT Data analytics has changed. By Natassha Selvaraj , KDnuggets Technical Content Specialist At-Large on June 27, 2025 in Data Science Image by Editor | ChatGPT Data analytics has changed.

Data Analytics

Data Analytics Analytics Data Science Machine Learning

Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse

AWS Big Data

APRIL 17, 2025

Amazon SageMaker Lakehouse is a unified, open, and secure data lakehouse that now seamlessly integrates with Amazon S3 Tables , the first cloud object store with built-in Apache Iceberg support. You can then query, analyze, and join the data using Redshift, Amazon Athena , Amazon EMR , and AWS Glue.

Analytics

Analytics Data Lake Data Warehouse Sales

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

AWS Big Data

MARCH 21, 2025

The ability for organizations to quickly analyze data across multiple sources is crucial for maintaining a competitive advantage. Traditionally, answering this question would involve multiple data exports, complex extract, transform, and load (ETL) processes, and careful data synchronization across systems.

Data Warehouse

Data Warehouse Metadata Publishing Sales

Lessons learned building natural language processing systems in health care

O'Reilly on Data

MARCH 7, 2019

Language understanding benefits from every part of the fast-improving ABC of software: AI (freely available deep learning libraries like PyText and language models like BERT ), big data (Hadoop, Spark, and Spark NLP ), and cloud (GPU's on demand and NLP-as-a-service from all the major cloud providers). are written in English.

Deep Learning

Deep Learning Testing Machine Learning Modeling

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

This year, we expanded our partnership with NVIDIA , enabling your data teams to dramatically speed up compute processes for data engineering and data science workloads with no code changes using RAPIDS AI. As a machine learning problem, it is a classification task with tabular data, a perfect fit for RAPIDS.

Machine Learning

Machine Learning Data Science Data Lake Deep Learning

Discover 20 Essential Types Of Graphs And Charts And When To Use Them

datapine

FEBRUARY 23, 2023

2) Charts And Graphs Categories 3) 20 Different Types Of Graphs And Charts 4) How To Choose The Right Chart Type Data and statistics are all around us. That is because graphical representations of data make it easier to convey important information to different audiences. Let’s start this journey by looking at a definition.

Visualization

Visualization Dashboards Sales Measurement

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

AWS Big Data

JUNE 19, 2024

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools.

Data Warehouse

Data Warehouse Testing Sales Structured Data

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python.

Optimization

Optimization Statistics Metadata Data Lake

Explore real-world use cases for Amazon CodeWhisperer powered by AWS Glue Studio notebooks

AWS Big Data

SEPTEMBER 18, 2023

This integration reduces the overall time spent in writing data integration and extract, transform, and load (ETL) logic. AWS Glue Studio notebooks allows you to author data integration jobs with a web-based serverless notebook interface. It also helps beginner-level programmers write their first lines of code.

Data Integration

Data Integration Big Data Interactive Software

Guidelines for Writing Stellar Research Papers While Utilizing Big Data

Smart Data Collective

JULY 19, 2021

One of such research paper types that college students may have to write is a research paper on big data. If you have to write a research paper on big data as a college student, the first thing to note is that it’s not something you’re familiar about if you don’t major in data science or computer science.

Big Data

Big Data Visualization Data Science IT

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

AWS Big Data

DECEMBER 15, 2023

With the exponential growth of data, companies are handling huge volumes and a wide variety of data including personally identifiable information (PII). Identifying and protecting sensitive data at scale has become increasingly complex, expensive, and time-consuming. For our solution, we use Amazon Redshift to store the data.

Data Lake

Data Lake Data Warehouse Big Data Structured Data

Build an analytics pipeline that is resilient to schema changes using Amazon Redshift Spectrum

AWS Big Data

FEBRUARY 20, 2024

You can ingest and integrate data from multiple Internet of Things (IoT) sensors to get insights. However, you may have to integrate data from multiple IoT sensor devices to derive analytics like equipment health information from all the sensors based on common data elements.

Analytics

Analytics IoT Prescriptive Analytics Internet of Things

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

AWS Big Data

JUNE 15, 2023

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems. Note down values of DatabaseName and GlueCrawlerName.

Data Lake

Data Lake Metadata Cost-Benefit Management

Speeding up Queries With Z-Order

Cloudera

AUGUST 4, 2022

Z-order is an ordering for multi-dimensional data, e.g. rows in a database table. Once data is in Z-order it is possible to efficiently search against more columns. But the version of page index filtering that we described could only search efficiently against a limited number of columns. Which are those columns?

IoT

IoT Measurement Statistics Cost-Benefit

Excellent Analytics Tip #22: Calculate Return On Analytics Investment!

Occam's Razor

FEBRUARY 25, 2013

This blog is centered around creating incredible digital experiences powered by qualitative and quantitative data insights. Every post is about unleashing the power of digital analytics (the potent combination of data, systems, software and people). Analysts: Put up or shut up time! Isn't it amazing? 400% ROI, not bad.

Analytics

Analytics ROI Marketing Modeling

Glossary of Digital Terminology for Career Relevance

Rocket-Powered Data Science

JULY 7, 2019

NOTE: This page is a WIP = Work In Progress.). AGI (Artificial General Intelligence): AI (Artificial Intelligence): Application of Machine Learning algorithms to robotics and machines (including bots), focused on taking actions based on sensory inputs (data). 4) Credit Card Fraud Alerts. (5) 5) Chatbots (Conversational AI). See [link].

Internet of Things

Internet of Things Machine Learning Manufacturing IoT

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

This allows you to simplify security and governance over transactional data lakes by providing access controls at table-, column-, and row-level permissions with your Apache Spark jobs. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift is a fast, fully managed petabyte-scale cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Amazon Redshift also supports querying nested data with complex data types such as struct, array, and map.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Using AWS AppSync and AWS Lake Formation to access a secure data lake through a GraphQL API

AWS Big Data

OCTOBER 9, 2023

Data lakes have been gaining popularity for storing vast amounts of data from diverse sources in a scalable and cost-effective way. As the number of data consumers grows, data lake administrators often need to implement fine-grained access controls for different user profiles.

Data Lake

Data Lake Testing Big Data Management

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Data governance is the process of ensuring the integrity, availability, usability, and security of an organization’s data. Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake.

Data Quality

Data Quality Data Governance Data Lake Testing

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

AWS Big Data

JULY 21, 2023

Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.

Data Lake

Data Lake Data Warehouse Marketing Management

Use AWS Glue DataBrew recipes in your AWS Glue Studio visual ETL jobs

AWS Big Data

JULY 27, 2023

DataBrew is a visual data preparation tool that enables you to clean and normalize data without writing any code. In DataBrew, a recipe is a set of data transformation steps that you can author interactively in its intuitive visual interface. Create a DataBrew recipe Start by registering the data store for the claims file.

Visualization

Visualization Cost-Benefit Data Quality Publishing

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

AWS Big Data

JUNE 23, 2023

Events and many other security data types are stored in Imperva’s Threat Research Multi-Region data lake. Imperva harnesses data to improve their business outcomes. As part of their solution, they are using Amazon QuickSight to unlock insights from their data.

Data Lake

Data Lake Dashboards Cost-Benefit Data Warehouse

Filter more pay less with the latest Cloudera Data Warehouse runtime!

Cloudera

MARCH 24, 2021

One of the most effective ways to improve performance and minimize cost in database systems today is by avoiding unnecessary work, such as data reads from the storage layer (e.g., disks, remote storage), transfers over the network, or even data materialization during query execution. CDP Runtime 7.2.9 CDP Runtime 7.2.9

Data Warehouse

Data Warehouse Broadcasting Statistics Cost-Benefit

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

Cloudera delivers an enterprise data cloud that enables companies to build end-to-end data pipelines for hybrid cloud, spanning edge devices to public or private cloud, with integrated security and governance underpinning it to protect customers data. Dynamic row filtering & column masking. Ranger 2.0.

Testing

Testing Metadata Risk Data Science

A side-by-side comparison of Apache Spark and Apache Flink for common streaming use cases

AWS Big Data

JULY 28, 2023

Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Spark is known for its ease of use, high-level APIs, and the ability to process large amounts of data. Flink also allows seamless transition and switching across these APIs.

Data Processing

Data Processing Big Data Data Quality Analytics

Build data integration jobs with AI companion on AWS Glue Studio notebook powered by Amazon CodeWhisperer

AWS Big Data

JULY 26, 2023

Data is essential for businesses to make informed decisions, improve operations, and innovate. Integrating data from different sources can be a complex and time-consuming process. AWS Glue provides different authoring experiences for you to build data integration jobs. One of the most common options is the notebook.

Data Integration

Data Integration Interactive Machine Learning Big Data

Extract time series from satellite weather data with AWS Lambda

AWS Big Data

JULY 6, 2023

Extracting time series on given geographical coordinates from satellite or Numerical Weather Prediction data can be challenging because of the volume of data and of its multidimensional nature (time, latitude, longitude, height, multiple parameters). Note that Lambda is a general purpose serverless engine.

Machine Learning

Machine Learning Visualization IoT Digital Transformation

Multi-Channel Attribution Modeling: The Good, Bad and Ugly Models

Occam's Razor

AUGUST 12, 2013

There are few things more complicated in analytics (all analytics, big data and huge data!) There is lots of missing data. And as if that were not enough, there is lots of unknowable data. Look at the last column: Assisted/Last Click or Direct Conversions. • than multi-channel attribution modeling.

Modeling

Modeling Optimization Marketing Interactive

Automate Data Quality Reports with n8n: From CSV to Professional Analysis

How Skroutz handles real-time schema evolution in Amazon Redshift with Debezium

Webinars

Trending Sources

Setting Up a Machine Learning Pipeline on Google Cloud Platform

Webinars

Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Accelerate your data quality journey for lakehouse architecture with Amazon SageMaker, Apache Iceberg on AWS, Amazon S3 tables, and AWS Glue Data Quality

Amazon EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1

Introducing AWS Glue Data Catalog usage metrics for API usage

Compaction support for Avro and ORC file formats in Apache Iceberg tables in Amazon S3

Develop and monitor a Spark application using existing data in Amazon S3 with Amazon SageMaker Unified Studio

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Build a secure data visualization application using the Amazon Redshift Data API with AWS IAM Identity Center

ERP modernization: Still a make-or-break project for CIOs

How to Learn AI for Data Analytics in 2025

Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

Lessons learned building natural language processing systems in health care

NVIDIA RAPIDS in Cloudera Machine Learning

Discover 20 Essential Types Of Graphs And Charts And When To Use Them

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

Speed up queries with the cost-based optimizer in Amazon Athena

Explore real-world use cases for Amazon CodeWhisperer powered by AWS Glue Studio notebooks

Guidelines for Writing Stellar Research Papers While Utilizing Big Data

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

Build an analytics pipeline that is resilient to schema changes using Amazon Redshift Spectrum

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Speeding up Queries With Z-Order

Excellent Analytics Tip #22: Calculate Return On Analytics Investment!

Glossary of Digital Terminology for Career Relevance

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Using AWS AppSync and AWS Lake Formation to access a secure data lake through a GraphQL API

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

Use AWS Glue DataBrew recipes in your AWS Glue Studio visual ETL jobs

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

Filter more pay less with the latest Cloudera Data Warehouse runtime!

Upgrade Journey: The Path from CDH to CDP Private Cloud

A side-by-side comparison of Apache Spark and Apache Flink for common streaming use cases

Build data integration jobs with AI companion on AWS Glue Studio notebook powered by Amazon CodeWhisperer

Extract time series from satellite weather data with AWS Lambda

Multi-Channel Attribution Modeling: The Good, Bad and Ugly Models

Stay Connected