Data Leaders Brief

Lessons learned building natural language processing systems in health care

O'Reilly on Data

MARCH 7, 2019

Language understanding benefits from every part of the fast-improving ABC of software: AI (freely available deep learning libraries like PyText and language models like BERT ), big data (Hadoop, Spark, and Spark NLP ), and cloud (GPU's on demand and NLP-as-a-service from all the major cloud providers). NLP Pipeline API’s.

Deep Learning

Deep Learning Testing Machine Learning Modeling

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Rocket-Powered Data Science

FEBRUARY 15, 2023

Third, any commitment to a disruptive technology (including data-intensive and AI implementations) must start with a business strategy. These changes may include requirements drift, data drift, model drift, or concept drift. I suggest that the simplest business strategy starts with answering three basic questions: What?

Strategy

Strategy Experimentation Uncertainty Machine Learning

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

AWS Big Data

MAY 2, 2025

Through a visual designer, you can configure custom AI search flowsa series of AI-driven data enrichments performed during ingestion and search. Flows are a pipeline of processor resources. Ingest flows are created to enrich data as its added to an index. They consist of: A data sample of the documents you want to index.

Machine Learning

Machine Learning Visualization Dashboards Metadata

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. These instructions are included in the prompt sent to the Bedrock model.

Metadata

Metadata Data Lake Modeling Data Warehouse

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset.

Data Quality

Data Quality Measurement Testing Visualization

How Far We Can Go with GenAI as an Information Extraction Tool

Ontotext

JANUARY 10, 2025

Introduction In the real world, obtaining high-quality annotated data remains a challenge. Therefore we explored how GenAI could automate several stages of the graph-building pipeline. Therefore we explored how GenAI could automate several stages of the graph-building pipeline. sec Llama 80 57 66.8 sec CoT prompt GPT-4o 78.9

Informatics

Informatics Modeling Metadata Experimentation

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Data governance is the process of ensuring the integrity, availability, usability, and security of an organization’s data. Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake.

Data Quality

Data Quality Data Governance Data Lake Testing

Introducing enhanced functionality for worker configuration management in Amazon MSK Connect

AWS Big Data

MARCH 25, 2024

With a few clicks, MSK Connect allows you to deploy connectors that move data between Apache Kafka and external systems. MSK Connect now supports the ability to delete MSK Connect worker configurations, tag resources, and manage worker configurations and custom plugins using AWS CloudFormation. Provide a name and optional description.

Management

Management Metadata Reporting Big Data

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Backtesting is a process used in quantitative finance to evaluate trading strategies using historical data. We specifically explore how Amazon EMR and the newly developed Apache Iceberg branching and tagging feature can address the challenge of look-ahead bias in backtesting.

Snapshot

Snapshot Data Lake Testing Strategy

What Is ‘Equity As Code,’ And How Can It Eliminate AI Bias?

DataKitchen

OCTOBER 28, 2021

We have the tools to create data analytics workflows that address AI bias. When our work processes for creating and monitoring analytics contain built-in controls against bias, data analytics organizations will no longer be dependent on individual social awareness or heroism. What Is AI Bias? What Is AI Bias?

Testing

Testing IT Manufacturing Machine Learning

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data. Query the data using Athena.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Migrate workloads from AWS Data Pipeline

AWS Big Data

JULY 25, 2024

AWS Data Pipeline helps customers automate the movement and transformation of data. With Data Pipeline, customers can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. The option you choose depends on your current workload on Data Pipeline.

Visualization

Visualization Management Data Integration Testing

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

AWS Big Data

JULY 26, 2023

Data is a key enabler for your business. Many AWS customers have integrated their data across multiple data sources using AWS Glue , a serverless data integration service, in order to make data-driven business decisions. Are there recommended approaches to provisioning components for data integration?

Data Integration

Data Integration Snapshot Testing Visualization

Implement data warehousing solution using dbt on Amazon Redshift

AWS Big Data

NOVEMBER 17, 2023

Amazon Redshift is a cloud data warehousing service that provides high-performance analytical processing based on a massively parallel processing (MPP) architecture. Building and maintaining data pipelines is a common challenge for all enterprises. Macros – These are pieces of code that can be reused multiple times.

Snapshot

Snapshot Data Processing Testing Data Warehouse

Using DataOps to Drive Agility and Business Value

DataKitchen

JUNE 24, 2021

In May 2021 at the CDO & Data Leaders Global Summit, DataKitchen sat down with the following data leaders to learn how to use DataOps to drive agility and business value. Kurt Zimmer, Head of Data Engineering for Data Enablement at AstraZeneca. Jim Tyo, Chief Data Officer, Invesco. Data takes a long journey.

Metrics

Metrics ROI Measurement Cost-Benefit

Best practices for hybrid cloud banking applications secure and compliant deployment across IBM Cloud and Satellite

IBM Big Data Hub

NOVEMBER 29, 2023

In this paper, we showcase how to easily deploy a banking application on both IBM Cloud for Financial Services and Satellite , using automated CI/CD/CC pipelines in a common and consistent manner. To achieve the deployment on Satellite the CI/CC pipelines were reused, and a new CD pipeline was created.

Testing

Testing Dashboards Risk Data Processing

What is a data fabric architecture?

IBM Big Data Hub

MARCH 25, 2022

To simplify data access and empower users to leverage trusted information, organizations need a better approach that provides better insights and business outcomes faster, without sacrificing data access controls. There are many different approaches, but you’ll want an architecture that can be used regardless of your data estate.

Metadata

Metadata Data Quality Data Governance Data Integration

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

Cloudera delivers an enterprise data cloud that enables companies to build end-to-end data pipelines for hybrid cloud, spanning edge devices to public or private cloud, with integrated security and governance underpinning it to protect customers data. Data Science and machine learning workloads using CDSW.

Testing

Testing Metadata Risk Data Science

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

Modak, a leading provider of modern data engineering solutions, is now a certified solution partner with Cloudera. Customers can now seamlessly automate migration to Cloudera’s Hybrid Data Platform — Cloudera Data Platform (CDP) to dynamically auto-scale cloud services with Cloudera Data Engineering (CDE) integration with Modak Nabu.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

One Big Cluster Stuck: The Right Tool for the Right Job

Cloudera

JUNE 26, 2023

Here are some tips and tricks of the trade to prevent well-intended yet inappropriate data engineering and data science activities from cluttering or crashing the cluster. For data engineering and data science teams, CDSW is highly effective as a comprehensive platform that trains, develops, and deploys machine learning models.

Data Processing

Data Processing Testing Visualization Data Science

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the cloud at scale. The data engineering team wants to separate the raw data into its own AWS account (Account B in the diagram) for increased security and control.

Metadata

Metadata Data Processing Management Testing

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. Picture a scenario where you, the VP of Data and Analytics, are in charge of your data and analytics environments and workloads running on AWS where you manage a team of data engineers and analysts.

Reporting

Reporting Metrics Optimization Data Lake

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

AWS Big Data

MAY 31, 2024

This encompasses tasks such as integrating diverse data from various sources with distinct formats and structures, optimizing the user experience for performance and security, providing multilingual support, and optimizing for cost, operations, and reliability.

Metadata

Metadata Data-driven Management Testing

Achieve peak performance and boost scalability using multiple Amazon Redshift serverless workgroups and Network Load Balancer

AWS Big Data

MAY 9, 2024

As data analytics use cases grow, factors of scalability and concurrency become crucial for businesses. Your analytic solution architecture should be able to handle large data volumes at high concurrency and without compromising speed, thereby delivering a scalable high-performance analytics environment. Enter the endpoint name.

Testing

Testing Cost-Benefit Reporting Optimization

Addressing Irreproducibility in the Wild

Domino Data Lab

MAY 1, 2019

This Domino Data Science Field Note provides highlights and excerpted slides from Chloe Mawer ’s “ The Ingredients of a Reproducible Machine Learning Model ” talk at a recent WiMLDS meetup. Mawer is a Principal Data Scientist at Lineage Logistics as well as an Adjunct Lecturer at Northwestern University. Introduction.

Machine Learning

Machine Learning Testing Data Science Modeling

H&R Block answers tax questions using gen AI

CIO Business Intelligence

APRIL 15, 2024

Proceeding with caution While H&R Block’s leadership and board were enticed by the possibilities of gen AI, Lowden notes he had to address some concerns before they fully bought into the project, especially with regard to safety and data privacy. “No The first was safety and data privacy testing. The third was guardrails.

Testing

Testing Machine Learning Data Quality Experimentation

7 Benefits of Metadata Management

erwin

FEBRUARY 19, 2021

Metadata management is key to wringing all the value possible from data assets. However, most organizations don’t use all the data at their disposal to reach deeper conclusions about how to drive revenue, achieve regulatory compliance or accomplish other strategic objectives. Quite simply, metadata is data about data.

Metadata

Metadata Management Data Quality Cost-Benefit

In-place version upgrades for applications on Amazon Managed Service for Apache Flink now supported

AWS Big Data

MAY 23, 2024

With in-place version upgrades, upgrading your application runtime version can be achieved simply, statefully, and without incurring data loss or adding additional orchestration to your workload. In addition, logs, metrics, application tags, application configurations, VPCs, and other settings are retained between version upgrades.

Snapshot

Snapshot Management Testing Consulting

Accelerate release lifecycle with pathway to deploy: Part 2

IBM Big Data Hub

DECEMBER 19, 2023

Stage 1: Development automation Infrastructure automation (IaC) and pipeline automation are self-contained within the development team, which makes automation a great place to start. Build AuthN/AuthZ integration patterns that abstract nuances and standardize authentication and authorization of applications, data and services.

Testing

Testing Enterprise Modeling Software

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Over the years, organizations have invested in creating purpose-built, cloud-based data lakes that are siloed from one another. A major challenge is enabling cross-organization discovery and access to data across these multiple data lakes, each built on different technology stacks.

Data Lake

Data Lake Publishing Metadata Data-driven

Query big data with resilience using Trino in Amazon EMR with Amazon EC2 Spot Instances for less cost

AWS Big Data

OCTOBER 4, 2023

Amazon EMR provides a managed Hadoop framework that makes it straightforward, fast, and cost-effective to process vast amounts of data using EC2 instances. Amazon EMR with Spot Instances allows you to reduce costs for running your big data workloads on AWS. The following diagram illustrates this architecture.

Big Data

Big Data Optimization Data-driven Management

Ensuring Data Transformation Quality with dbt Core

Wayne Yaddow

MARCH 14, 2025

How dbt Core aids data teams test, validate, and monitor complex data transformations and conversions Photo by NASA on Unsplash Introduction dbt Core, an open-source framework for developing, testing, and documenting SQL-based data transformations, has become a must-have tool for modern data teams as the complexity of data pipelines grows.

Data Transformation

Data Transformation Testing Unstructured Data Data Quality

Driving quality assurance through the IBM Ignite Quality Platform

IBM Big Data Hub

MARCH 22, 2024

However, various challenges arise in the QA domain that affect test case inventory, test case automation and defect volume. Managing test case inventory can become problematic due to the sheer volume of cases, which lead to inefficiencies and resource constraints.

Testing

Testing Optimization Software Risk

Define per-team resource limits for big data workloads using Amazon EMR Serverless

AWS Big Data

OCTOBER 5, 2023

Customers face a challenge when distributing cloud resources between different teams running workloads such as development, testing, or production. In this post, we show how to define per-team resource limits for big data workloads using EMR serverless. and you need to test the same workload on Amazon EMR 6.10.0,

Big Data

Big Data Cost-Benefit Testing Dashboards

Build, deploy, and run Spark jobs on Amazon EMR with the open-source EMR CLI tool

AWS Big Data

MAY 3, 2023

In this post, we walk through creating a new PySpark project that analyzes weather data from the NOAA Global Surface Summary of Day open dataset. to run and is tested on macOS, Linux, and Windows. One use case for this is with CI/CD pipelines. Most CI/CD pipelines allow you to access the git tag.

Data Processing

Data Processing Management Testing IT

Real-time cost savings for Amazon Managed Service for Apache Flink

AWS Big Data

MARCH 11, 2024

In this post, you can learn about the Managed Service for Apache Flink cost model, areas to save on cost in your Apache Flink applications, and overall gain a better understanding of your data processing pipelines. An additional KPU per application is also charged for orchestration and not directly used for data processing.

Management

Management Snapshot Metrics Cost-Benefit

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

Today, we are pleased to announce that Amazon DataZone is now able to present data quality information for data assets. Other organizations monitor the quality of their data through third-party solutions. Amazon DataZone now integrates directly with AWS Glue to display data quality scores for AWS Glue Data Catalog assets.

Data Quality

Data Quality Visualization Metadata Metrics

Bringing ML to Agriculture: Transforming a Millennia-old Industry

Domino Data Lab

OCTOBER 14, 2020

Guest post by Jeff Melching, Distinguished Engineer / Chief Architect Data & Analytics. We’ve developed a model-driven software platform, called Climate FieldView , that captures, visualizes, and analyzes a vast array of data for farmers and provides new insight and personalized recommendations to maximize crop yield.

Experimentation

Experimentation Deep Learning Modeling Testing

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

As most organizations, that have turned to Google Analytics (GA) as a digital analytics solution, mature they discover a more pressing need to integrate this data silo with the rest of their organization’s data to enable better analytics and resulting product development and fraud detection.

Analytics

Analytics Data Lake Testing Optimization

Fraud Detection with Cloudera Stream Processing Part 1

Cloudera

JUNE 28, 2022

In a previous blog of this series, Turning Streams Into Data Products , we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. containing data that may have to be used to enrich the streaming data.

Dashboards

Dashboards Machine Learning Statistics KPI

Accelerate release lifecycle with pathway to deploy: Part 1

IBM Big Data Hub

DECEMBER 19, 2023

Rapidly deploying applications to cloud requires not just development acceleration with continuous integration, deployment and testing (CI/CD/CT), It also requires supply chain lifecycle acceleration, which involves multiple other groups such as governance risk and compliance (GRC), change management, operations, resiliency and reliability.

Data-driven

Data-driven Enterprise Software Testing

How to Integrate DataRobot and Apache Airflow for Orchestration and MLOps Workflows

DataRobot Blog

JUNE 16, 2022

Airflow is a perfect tool to orchestrate stages of the DataRobot machine learning (ML) pipeline, because it provides an easy but powerful solution to integrate DataRobot capabilities into bigger pipelines, combine it with other services, as well as to clean your data, and store or publish the results. DataRobot Provider Modules.

Machine Learning

Machine Learning Modeling Testing Publishing

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

bridgei2i

MARCH 3, 2021

Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities. Unlocking the Value of Enterprise AI with Data Engineering Capabilities. They discuss how the data engineering team is instrumental in easing collaboration between analysts, data scientists and ML engineers to build enterprise AI solutions.

Enterprise

Enterprise Digital Transformation Data-driven Interactive

Open Data Science and Machine Learning for Business with Cloudera Data Science Workbench on HDP

Cloudera

JANUARY 30, 2019

It’s official – Cloudera and Hortonworks have merged , and today I’m excited to announce the availability of Cloudera Data Science Workbench (CDSW) for Hortonworks Data Platform (HDP). Trusted by large data science teams across hundreds of enterprises —. Sound familiar? What is CDSW? Install any library or framework (e.g.

Data Science

Data Science Machine Learning Experimentation Visualization

Lessons learned building natural language processing systems in health care

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Webinars

Trending Sources

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

Webinars

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Measure performance of AWS Glue Data Quality for ETL pipelines

How Far We Can Go with GenAI as an Information Extraction Tool

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Introducing enhanced functionality for worker configuration management in Amazon MSK Connect

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

What Is ‘Equity As Code,’ And How Can It Eliminate AI Bias?

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Migrate workloads from AWS Data Pipeline

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

Implement data warehousing solution using dbt on Amazon Redshift

Using DataOps to Drive Agility and Business Value

Best practices for hybrid cloud banking applications secure and compliant deployment across IBM Cloud and Satellite

What is a data fabric architecture?

Upgrade Journey: The Path from CDH to CDP Private Cloud

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

One Big Cluster Stuck: The Right Tool for the Right Job

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Automate alerting and reporting for AWS Glue job resource usage

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

Achieve peak performance and boost scalability using multiple Amazon Redshift serverless workgroups and Network Load Balancer

Addressing Irreproducibility in the Wild

H&R Block answers tax questions using gen AI

7 Benefits of Metadata Management

In-place version upgrades for applications on Amazon Managed Service for Apache Flink now supported

Accelerate release lifecycle with pathway to deploy: Part 2

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Query big data with resilience using Trino in Amazon EMR with Amazon EC2 Spot Instances for less cost

Ensuring Data Transformation Quality with dbt Core

Driving quality assurance through the IBM Ignite Quality Platform

Define per-team resource limits for big data workloads using Amazon EMR Serverless

Build, deploy, and run Spark jobs on Amazon EMR with the open-source EMR CLI tool

Real-time cost savings for Amazon Managed Service for Apache Flink

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Bringing ML to Agriculture: Transforming a Millennia-old Industry

How SumUp made digital analytics more accessible using AWS Glue

Fraud Detection with Cloudera Stream Processing Part 1

Accelerate release lifecycle with pathway to deploy: Part 1

How to Integrate DataRobot and Apache Airflow for Orchestration and MLOps Workflows

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

Open Data Science and Machine Learning for Business with Cloudera Data Science Workbench on HDP

Stay Connected