Data Leaders Brief

Columns Applications-Insight

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight.

Data Lake

Data Lake Sales Metadata Machine Learning

How to Combine Streamlit, Pandas, and Plotly for Interactive Data Apps

KDnuggets

JUNE 27, 2025

These three libraries work seamlessly together to transform static datasets into responsive, visually engaging applications — all without needing a background in web development. This shift from the notebook environment to script-based development opens up new possibilities for sharing and deploying your data applications.

Interactive

Interactive Dashboards Sales Machine Learning

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Trending Sources

Build a Data Cleaning & Validation Pipeline in Under 50 Lines of Python

KDnuggets

JUNE 24, 2025

columns df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median()) columns df[string_columns] = df[string_columns].fillna(Unknown) columns df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median()) columns df[string_columns] = df[string_columns].fillna(Unknown) Happy data cleaning!

Machine Learning

Machine Learning Data Science Advertising Data Quality

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

Data is typically organized into project-specific schemas optimized for business intelligence (BI) applications, advanced analytics, and machine learning. Whether it’s customer analytics, product quality assessments, or inventory insights, the Gold layer is tailored to support specific analytical use cases.

Data Quality

Data Quality Testing Metrics Reporting

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Quants can also gain deeper insights into current market trends and correlate them with historical patterns. Without such a system, applications risk exceeding Amazon S3 API quotas when accessing specific partitions. The data was not sorted on any column in this case, which is the default behavior. alias("day")).distinct().count().show(truncate=False)

Metadata

Metadata Snapshot Cost-Benefit Optimization

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

These improvements enhanced price-performance, enabled data lakehouse architectures by blurring the boundaries between data lakes and data warehouses, simplified ingestion and accelerated near real-time analytics, and incorporated generative AI capabilities to build natural language-based applications and boost user productivity.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Extracting valuable insights from massive datasets is essential for businesses striving to gain a competitive edge. Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values.

Metadata

Metadata Data Lake Modeling Data Warehouse

A Gentle Introduction to Principal Component Analysis (PCA) in Python

KDnuggets

JULY 4, 2025

This is an important data transformation process in various real-world scenarios and industries like image processing, finance, genetics, and machine learning applications where data contains many features that need to be analyzed more efficiently. Is this a good result?

Machine Learning

Machine Learning Data Science Advertising Testing

Automate data lineage in Amazon SageMaker using AWS Glue Crawlers supported data sources

AWS Big Data

JULY 30, 2025

With data lineage captured at the table, column, and job level, data producers can conduct impact analysis of changes in their data pipelines and respond to data issues when needed, for example, when a column in the resulting dataset is missing the quality required by the business.

Metadata

Metadata Visualization Reporting Analytics

What is data architecture? A framework to manage data

CIO Business Intelligence

DECEMBER 20, 2024

A modern data architecture needs to eliminate departmental data silos and give all stakeholders a complete view of the company: 360 degrees of customer insights and the ability to correlate valuable data signals from all business functions, like manufacturing and logistics. Application programming interfaces. Cloud computing.

Data Architecture

Data Architecture Management Consulting Internet of Things

Scaling Data Reliability: The Definitive Guide to Test Coverage for Data Engineers

DataKitchen

JULY 8, 2025

Let us show you how to implement full-coverage automatic data checks on every table, column, tool, and step in your delivery process. Test Coverage Measurement Effective test coverage measurement requires a systematic application across all database levels and zones. months of full-time effort for a trained data engineer.

Testing

Testing Data Quality Cost-Benefit Manufacturing

Amazon EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1

AWS Big Data

DECEMBER 27, 2024

The fact tables used the default partitioning by the date column, which have a number of partitions varying from 2002,100. For additional insights, we also examine the cost aspect. This benchmark application is built from the branch tpcds-v2.13_iceberg. Upload the benchmark application JAR file to Amazon S3.

Cost-Benefit

Cost-Benefit Testing Metrics Optimization

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

AWS Big Data

NOVEMBER 6, 2024

To build such applications, engineering teams are increasingly adopting two trends. First, they’re replacing batch data processing pipelines with real-time streaming, so applications can derive insight and take action within seconds instead of waiting for daily or hourly batch exchange, transform, and load (ETL) jobs.

Metadata

Metadata Data Lake Internet of Things Management

Build up-to-date generative AI applications with real-time vector embedding blueprints for Amazon MSK

AWS Big Data

NOVEMBER 6, 2024

Generative AI, particularly through the use of large language models (LLMs), has become a focal point for creating intelligent applications that deliver personalized experiences. For example, businesses can use generative AI for sentiment analysis of customer reviews, transforming vast amounts of feedback into actionable insights.

Internet of Things

Internet of Things Modeling Management IoT

When is data too clean to be useful for enterprise AI?

CIO Business Intelligence

NOVEMBER 27, 2024

In a relative sense Different domains and applications require different levels of data cleaning. Not all columns are equal, so you need to prioritize cleaning data features that matter to your model, and your business outcomes. “It can end up, at best, wasting a lot of time and effort.

Enterprise

Enterprise Data Quality Structured Data Modeling

Introducing AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

and Python 3.11 , giving you newer Spark and Python releases so you can develop, run, and scale your data integration workloads and get insights faster. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2

Data Lake

Data Lake Cost-Benefit Data Integration Data Warehouse

Compaction support for Avro and ORC file formats in Apache Iceberg tables in Amazon S3

AWS Big Data

JULY 15, 2025

Parquet also provides excellent compression and efficient I/O by enabling selective column reads, reducing the amount of data scanned during queries. Please refer to section “Query and Join data from these S3 Tables to build insights” for query details. Take note of the application-id to use later for launching the jobs.

Optimization

Optimization Data Lake Cost-Benefit IoT

Develop and monitor a Spark application using existing data in Amazon S3 with Amazon SageMaker Unified Studio

AWS Big Data

JULY 9, 2025

The next generation of Amazon SageMaker with Amazon EMR in Amazon SageMaker Unified Studio addresses these pain points through an integrated development environment (IDE) where data workers can develop, test, and refine Spark applications in one consistent environment. Create and configure a Spark application.

Testing

Testing Interactive Sales Dashboards

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Over time, as organizations began to explore broader applications, data lakes have become essential for various data-driven processes beyond just reporting and analytics. The Data Catalog also now supports heavily nested complex data and supports schema evolution as you reorder or rename columns.

Data Lake

Data Lake IoT Metadata Testing

How Does LCNC Enhance BI and Predictive Analytics

Smarten

JANUARY 6, 2025

Data Visualization A drag and drop smart visualization engine allows the user to select the best fit and most appropriate options to visualize a particular dataset based on data columns, types, data volume and other factors.

Predictive Analytics

Predictive Analytics Key Performance Indicator Analytics Predictive Modeling

Introducing GenAI-powered business description recommendations for custom assets in Amazon SageMaker Catalog

AWS Big Data

JULY 1, 2025

An organization’s data can come from various sources, including cloud-based pipelines, partner ecosystems, open table formats like Apache Iceberg, software as a service (SaaS) platforms, and internal applications. We use large language models (LLMs) in Amazon Bedrock to automatically generate key elements for custom structured assets.

Metadata

Metadata Finance Publishing Software

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

AWS Big Data

NOVEMBER 22, 2024

Organizations run millions of Apache Spark applications each month to prepare, move, and process their data for analytics and machine learning (ML). Building and maintaining these Spark applications is an iterative process, where developers spend significant time testing and troubleshooting their code.

Metrics

Metrics Data Lake Software Optimization

Build a secure data visualization application using the Amazon Redshift Data API with AWS IAM Identity Center

AWS Big Data

MARCH 6, 2025

With the newly released feature of Amazon Redshift Data API support for single sign-on and trusted identity propagation , you can build data visualization applications that integrate single sign-on (SSO) and role-based access control (RBAC), simplifying user management while enforcing appropriate access to sensitive information.

Visualization

Visualization Sales Data Warehouse Management

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Unified data quality management The WAP pattern separates the audit and publish logic from the writer applications. By configuring the following parameters, when schema changes occur, new columns from the source are added to the target table with NULL values for existing rows. option("merge-schema","true").append()

Data Quality

Data Quality Publishing Snapshot Data Lake

Create an Analytics Dashboard for Your Google Sheets

KDnuggets

JULY 28, 2025

no blank cells or mixed formats) Use data validation to create dropdowns for categories or statuses Include a timestamp column if you plan to track trends over time # 2. Be sure to include all necessary columns that represent your axes or variables. Google Sheets will insert a chart onto your sheet, initially as a blank canvas.

Dashboards

Dashboards Scorecard Analytics Machine Learning

The 3 key pillars of data governance for AI-driven enterprises

CIO Business Intelligence

JUNE 2, 2025

Without automated tracking, unmapped data flows move between pipelines, APIs and third-party applications without oversight, leading to shadow data redundant, outdated and unstructured datasets that exist outside official repositories, creating compliance blind spots.

Data-driven

Data-driven Data Governance Enterprise Unstructured Data

Amazon Redshift out-of-the-box performance innovations for data lake queries

AWS Big Data

JULY 31, 2025

Data lakes are a powerful architecture to organize data for analytical processing, because they let builders use efficient analytical columnar formats like Apache Parquet , while letting them continue to modify the shape of their data as their applications evolve with open table formats like Apache Iceberg.

Data Lake

Data Lake Statistics Broadcasting Metadata

Using AWS Glue Data Catalog views with Apache Spark in EMR Serverless and Glue 5.0

AWS Big Data

JUNE 5, 2025

This flexibility accelerates insights and improves resource utilization across the analytics stack. We then use LF-Tags to share restricted columns of this view to the downstream engineering team. On the EMR Studio dashboard, choose Create application. You will be directed to the Create application page on EMR Studio.

Data Lake

Data Lake Data Governance Data-driven Interactive

Enforce table level access control on data lake tables using AWS Glue 5.0 with AWS Lake Formation

AWS Big Data

JUNE 30, 2025

This capability enables Data Manipulation Language (DML) operations including CREATE , ALTER , DELETE , UPDATE , and MERGE INTO statements on Apache Hive and Iceberg tables from within the same Apache Spark application. To do this, follow the steps in Application integration for full table access. Migrate an AWS Glue 4.0

Data Lake

Data Lake Management Software Data Integration

Scale your AWS Glue for Apache Spark jobs with R type, G.12X, and G.16X workers

AWS Big Data

JULY 17, 2025

With AWS Glue , organizations can discover, prepare, and combine data for analytics, machine learning (ML), AI, and application development. This was mainly caused by large shuffling on a specific column. 16X on an AWS Glue Studio notebook or interactive sessions, set G.12X This is the same pricing as the existing worker types.

Broadcasting

Broadcasting Metrics Interactive Data Integration

LCNC in BI and Predictive Analytics

Smarten

FEBRUARY 6, 2025

Low-Code Development Low-Code Development allows programmers and developers to quickly and easily create applications using tools that simplify the development process with drag and drop components that enable the team to add features without writing code from scratch.

Predictive Analytics

Predictive Analytics Key Performance Indicator Analytics Predictive Modeling

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

For decades, they have been struggling with scale, speed, and correctness required to derive timely, meaningful, and actionable insights from vast and diverse big data environments. When you run Apache Spark applications on Athena, you submit Spark code for processing and receive the results directly.

Metadata

Metadata Data Warehouse ROI Snapshot

ERP modernization: Still a make-or-break project for CIOs

CIO Business Intelligence

NOVEMBER 25, 2024

Under the “what you need” column, consider functional and non-functional lenses, Briggs advises. I like the way Oracle is embedding AI into the ERP and SCM Fusion applications, and I like the opportunity for the smaller, quarterly updates to the apps versus the big-bang-every-five-years-upgrades to other systems,” Neumeier explains.

Digital Transformation

Digital Transformation Data Warehouse Data Governance Enterprise

Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse

AWS Big Data

APRIL 17, 2025

Example Retails leadership is interested in understanding customer and business insights across thousands of customer touchpoints for millions of their customers that will help them build sales, marketing, and investment plans. Now grant the project role access to subset of columns from customer_churn dataset. Choose Grant.

Analytics

Analytics Data Lake Data Warehouse Sales

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights. This zero-ETL integration reduces the complexity and operational burden of data replication to let you focus on deriving insights from your data.

Analytics

Analytics Data Lake Metadata Data Warehouse

7 Steps to Mastering Vibe Coding

KDnuggets

JULY 8, 2025

For the age column, fill any remaining missing values with the median age. For the category column, fill any missing values with the string unknown. Instead of asking the AI to build an entire application at once, guide it through the process. Lacking these insights, what good is having a chunk of AI-generated code?

Machine Learning

Machine Learning Data Science Testing Advertising

Amazon Redshift Python user-defined functions will reach end of support after June 30, 2026

AWS Big Data

JUNE 30, 2025

On the CloudWatch console, choose Logs in the navigation pane, then choose Log Insights. Filter by the Lambda UDF and use the following query to identify the number of Lambda invocations. This helps track usage patterns and execution frequency.

Cost-Benefit

Cost-Benefit Metrics Testing Optimization

Maximize Your Translytical Write Back Capabilities in Power BI

Jet Global

AUGUST 4, 2025

Enhanced Capabilities Here are some key capabilities of translytical write back: Add multiple columns to the edit form Use textboxes, dropdowns, or buttons as controls Easily implement validation Create visually appealing custom interfaces Build your own versioning logic (yes, it’s possible!) Instantly share insights.

Forecasting

Forecasting Reporting Software Dashboards

Replace Crystal Reports With Spreadsheet Server for Optimized Reporting

Jet Global

OCTOBER 31, 2024

If you rely on IT or external consultants to make custom reporting changes – adding columns, adding data sources, and more – this causes delays that eat into the time you have available for analysis. This means you have more time to analyze the data and generate insights, rather than wasting time on the creation of the report.

Reporting

Reporting Optimization Finance Operational Reporting

Is Your Data Catalog Ready for the AI Age?

BI-Survey

FEBRUARY 27, 2025

Figure 1: Enterprise Data Catalogs interact with AI in two ways These regulations require organizations to document and control both traditional and generative AI models, whether they build them or incorporate them into their own applications, thus driving demand for data catalogs that support compliance.

Unstructured Data

Unstructured Data Metadata Data Quality Data Governance

The Power of Graph Databases, Linked Data, and Graph Algorithms

Rocket-Powered Data Science

MARCH 10, 2020

I wrote an extensive piece on the power of graph databases, linked data, graph algorithms, and various significant graph analytics applications. As you read this, just remember the most important message: the natural data structure of the world is not rows and columns, but a graph.

Metadata

Metadata Machine Learning Prescriptive Analytics ROI

Top 10 Analytics And Business Intelligence Trends For 2020

datapine

NOVEMBER 27, 2019

Spreadsheets finally took a backseat to actionable and insightful data visualizations and interactive business dashboards. A survey conducted by the Business Application Research Center stated the data quality management as the most important trend in 2020. Source: Business Application Research Center *. Agile and flexible.

Business Intelligence

Business Intelligence Analytics Prescriptive Analytics Data Quality

What Is Ad Hoc Reporting? Your Guide To Definition, Meaning, Examples & Benefits

datapine

JULY 1, 2020

“The goal is to turn data into information, and information into insight.” – Carly Fiorina, former executive, president, HP. quintillion bytes of data every single day, with 90% of the world’s digital insights generated in the last two years alone, according to Forbes. Digital data is all around us. In fact, we create around 2.5

Reporting

Reporting Cost-Benefit Dashboards Visualization

Understanding The Value Of Table Graphs & Charts – A Guide With Examples

datapine

NOVEMBER 30, 2023

We have already covered many types of graphs and charts , including bar charts , column charts , area charts , line charts , and more. A table graph is a type of data visualization that uses rows and columns to organize and display numerical or textual data. That is, if columns and rows are arranged correctly.

Visualization

Visualization Cost-Benefit Sales KPI

How BMW streamlined data access using AWS Lake Formation fine-grained access control

How to Combine Streamlit, Pandas, and Plotly for Interactive Data Apps

Webinars

Trending Sources

Build a Data Cleaning & Validation Pipeline in Under 50 Lines of Python

Webinars

The Race For Data Quality in a Medallion Architecture

Build a high-performance quant research platform with Apache Iceberg

Recap of Amazon Redshift key product announcements in 2024

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

A Gentle Introduction to Principal Component Analysis (PCA) in Python

Automate data lineage in Amazon SageMaker using AWS Glue Crawlers supported data sources

What is data architecture? A framework to manage data

Scaling Data Reliability: The Definitive Guide to Test Coverage for Data Engineers

Amazon EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

Build up-to-date generative AI applications with real-time vector embedding blueprints for Amazon MSK

When is data too clean to be useful for enterprise AI?

Introducing AWS Glue 5.0 for Apache Spark

Compaction support for Avro and ORC file formats in Apache Iceberg tables in Amazon S3

Develop and monitor a Spark application using existing data in Amazon S3 with Amazon SageMaker Unified Studio

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

How Does LCNC Enhance BI and Predictive Analytics

Introducing GenAI-powered business description recommendations for custom assets in Amazon SageMaker Catalog

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

Build a secure data visualization application using the Amazon Redshift Data API with AWS IAM Identity Center

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Create an Analytics Dashboard for Your Google Sheets

The 3 key pillars of data governance for AI-driven enterprises

Amazon Redshift out-of-the-box performance innovations for data lake queries

Using AWS Glue Data Catalog views with Apache Spark in EMR Serverless and Glue 5.0

Enforce table level access control on data lake tables using AWS Glue 5.0 with AWS Lake Formation

Scale your AWS Glue for Apache Spark jobs with R type, G.12X, and G.16X workers

LCNC in BI and Predictive Analytics

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

ERP modernization: Still a make-or-break project for CIOs

Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse

Top analytics announcements of AWS re:Invent 2024

7 Steps to Mastering Vibe Coding

Amazon Redshift Python user-defined functions will reach end of support after June 30, 2026

Maximize Your Translytical Write Back Capabilities in Power BI

Replace Crystal Reports With Spreadsheet Server for Optimized Reporting

Is Your Data Catalog Ready for the AI Age?

The Power of Graph Databases, Linked Data, and Graph Algorithms

Top 10 Analytics And Business Intelligence Trends For 2020

What Is Ad Hoc Reporting? Your Guide To Definition, Meaning, Examples & Benefits

Understanding The Value Of Table Graphs & Charts – A Guide With Examples

Stay Connected