Blog - Data Leaders Brief

Author visual ETL flows on Amazon SageMaker Unified Studio (preview)

AWS Big Data

DECEMBER 4, 2024

Amazon SageMaker Unified Studio (preview) provides an integrated data and AI development environment within Amazon SageMaker. From the Unified Studio, you can collaborate and build faster using familiar AWS tools for model development, generative AI, data processing, and SQL analytics.

Visualization

Visualization Sales Data-driven Analytics

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. In addition, organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge.

Data Integration

Data Integration Data Lake Statistics Data-driven

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Data is the most significant asset of any organization. However, enterprises often encounter challenges with data silos, insufficient access controls, poor governance, and quality issues. Embracing data as a product is the key to address these challenges and foster a data-driven culture.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Are You Content with Your Organization’s Content Strategy?

Rocket-Powered Data Science

JULY 6, 2021

If you include the title of this blog, you were just presented with 13 examples of heteronyms in the preceding paragraphs. Specifically, in the modern era of massive data collections and exploding content repositories, we can no longer simply rely on keyword searches to be sufficient. Data catalogs are very useful and important.

Strategy

Strategy Machine Learning Metadata Knowledge Discovery

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

To achieve this, they aimed to break down data silos and centralize data from various business units and countries into the BMW Cloud Data Hub (CDH). However, the initial version of CDH supported only coarse-grained access control to entire data assets, and hence it was not possible to scope access to data asset subsets.

Data Lake

Data Lake Sales Metadata Machine Learning

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

This week on the keynote stages at AWS re:Invent 2024, you heard from Matt Garman, CEO, AWS, and Swami Sivasubramanian, VP of AI and Data, AWS, speak about the next generation of Amazon SageMaker , the center for all of your data, analytics, and AI. The relationship between analytics and AI is rapidly evolving.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

The DataOps Vendor Landscape, 2021

DataKitchen

APRIL 13, 2021

Read the complete blog below for a more detailed description of the vendors and their capabilities. This is not surprising given that DataOps enables enterprise data teams to generate significant business value from their data. Testing and Data Observability. Reflow — A system for incremental data processing in the cloud.

Testing

Testing Machine Learning Consulting Data Science

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

The need to integrate diverse data sources has grown exponentially, but there are several common challenges when integrating and analyzing data from multiple sources, services, and applications. First, you need to create and maintain independent connections to the same data source for different services.

Visualization

Visualization Data Processing Testing Publishing

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Rocket-Powered Data Science

FEBRUARY 15, 2023

Third, any commitment to a disruptive technology (including data-intensive and AI implementations) must start with a business strategy. These changes may include requirements drift, data drift, model drift, or concept drift. I suggest that the simplest business strategy starts with answering three basic questions: What?

Strategy

Strategy Experimentation Uncertainty Machine Learning

Scaling RISE with SAP data and AWS Glue

AWS Big Data

NOVEMBER 29, 2024

Customers often want to augment and enrich SAP source data with other non-SAP source data. Such analytic use cases can be enabled by building a data warehouse or data lake. Customers can now use the AWS Glue SAP OData connector to extract data from SAP.

Visualization

Visualization Data Processing Data-driven Cost-Benefit

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Amazon Redshift , launched in 2013, has undergone significant evolution since its inception, allowing customers to expand the horizons of data warehousing and SQL analytics. Industry-leading price-performance Amazon Redshift offers up to three times better price-performance than alternative cloud data warehouses.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

What is a Machine Learning Data Catalog?

Alation

MARCH 17, 2021

Key Features of a Machine Learning Data Catalog. Data intelligence is crucial for the development of data catalogs. At the center of this innovation are machine learning data catalogs (MLDCs). Data stewardship. Data governance streamlining. Business glossary.

Machine Learning

Machine Learning Metadata Data Governance Data Quality

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. Navigate to the AWS Service Catalog console and choose Amazon Bedrock. Choose Notebook instances.

Metadata

Metadata Data Lake Modeling Data Warehouse

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

The data mesh design pattern breaks giant, monolithic enterprise data architectures into subsystems or domains, each managed by a dedicated team. DataOps helps the data mesh deliver greater business agility by enabling decentralized domains to work in concert. . But first, let’s define the data mesh design pattern.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Cloudera

JANUARY 6, 2021

Python is used extensively among Data Engineers and Data Scientists to solve all sorts of problems from ETL/ELT pipelines to building machine learning models. Apache HBase is an effective data storage system for many workflows but accessing this data specifically through Python can be a struggle. builder.

Machine Learning

Machine Learning Data Science Management Enterprise

Introducing Cloudera Fine Tuning Studio for Training, Evaluating, and Deploying LLMs with Cloudera AI

Cloudera

NOVEMBER 13, 2024

LLMs deployed as internal enterprise-specific agents can help employees find internal documentation, data, and other company information to help organizations easily extract and summarize important internal content. Given some example data, LLMs can quickly learn new content that wasn’t available during the initial training of the base model.

Cost-Benefit

Cost-Benefit Data Processing Machine Learning Testing

Why data observability is essential to AI governance

erwin

DECEMBER 9, 2024

When it comes to using AI and machine learning across your organization, there are many good reasons to provide your data and analytics community with an intelligent data foundation. For instance, Large Language Models (LLMs) are known to ultimately perform better when data is structured.

Metadata

Metadata Data Quality Sales Modeling

Cloudera’s Applied ML Prototype Catalog Continues to Grow

Cloudera

JUNE 10, 2022

Here at Cloudera, we’re committed to helping make the lives of data practitioners as painless as possible. For data scientists, we continue to provide new Applied Machine Learning Prototypes (AMPs), which are open source and available on GitHub. Video footage constitutes a significant portion of all data in the world.

Machine Learning

Machine Learning Data Science Metrics Modeling

Introducing AWS Glue Data Quality anomaly detection

AWS Big Data

AUGUST 8, 2024

Thousands of organizations build data integration pipelines to extract and transform data. They establish data quality rules to ensure the extracted data is of high quality for accurate business decisions. These rules commonly assess the data based on fixed criteria reflecting the current business state.

Data Quality

Data Quality Statistics Visualization Metrics

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

In today’s data-driven world, the ability to seamlessly integrate and utilize diverse data sources is critical for gaining actionable insights and driving innovation. Use case Consider a large ecommerce company that relies heavily on data-driven insights to optimize its operations, marketing strategies, and customer experiences.

Analytics

Analytics Data-driven Data Integration Data Lake

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

AWS Big Data

FEBRUARY 12, 2024

Public health organizations need access to data insights that they can quickly act upon, especially in times of health emergencies, when data needs to be updated multiple times daily. Instead, they rely on up-to-date dashboards that help them visualize data insights to make informed decisions quickly.

Publishing

Publishing Dashboards Metadata Visualization

Best Practices for Metadata Management

Alation

JULY 19, 2021

Metadata is information about data. A clothing catalog or dictionary are both examples of metadata repositories. Indeed, a popular online catalog, like Amazon, offers rich metadata around products to guide shoppers: ratings, reviews, and product details are all examples of metadata. What Is Metadata? Why Is Metadata Important?

Metadata

Metadata Management Data Governance Machine Learning

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Cloudera

DECEMBER 11, 2020

In a previous blog post on CDW performance, we compared Azure HDInsight to CDW. In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to EMR 6.0 (also powered by Apache Hive-LLAP) on Amazon using the TPC-DS 2.9 More on this later in the blog.

Data Warehouse

Data Warehouse Metadata Machine Learning Measurement

Introducing Lightweight, Customizable ML Runtimes in Cloudera Machine Learning

Cloudera

NOVEMBER 24, 2020

With the complexity of data growing across the enterprise and emerging approaches to machine learning and AI use cases, data scientists and machine learning engineers have needed more versatile and efficient ways of enabling data access, faster processing, and better, more customizable resource management across their machine learning projects.

Machine Learning

Machine Learning Data Science Enterprise Management

Migrate workloads from AWS Data Pipeline

AWS Big Data

JULY 25, 2024

AWS Data Pipeline helps customers automate the movement and transformation of data. With Data Pipeline, customers can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. Some customers want a deeper level of control and specificity than possible using Data Pipeline.

Visualization

Visualization Management Data Integration Testing

What is a data fabric architecture?

IBM Big Data Hub

MARCH 25, 2022

To simplify data access and empower users to leverage trusted information, organizations need a better approach that provides better insights and business outcomes faster, without sacrificing data access controls. There are many different approaches, but you’ll want an architecture that can be used regardless of your data estate.

Metadata

Metadata Data Quality Data Governance Data Integration

How to establish lineage transparency for your machine learning initiatives

IBM Big Data Hub

MAY 20, 2024

Machine learning (ML) has become a critical component of many organizations’ digital transformation strategy. The answer lies in the data used to train these models and how that data is derived. The answer lies in the data used to train these models and how that data is derived.

Machine Learning

Machine Learning Modeling Metadata Strategy

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Cloudera

SEPTEMBER 29, 2020

Performance is one of the key, if not the most important deciding criterion, in choosing a Cloud Data Warehouse service. In today’s fast changing world, enterprises have to make data driven decisions quickly and for that they rely heavily on their data warehouse service. . Cloudera Data Warehouse vs HDInsight.

Data Warehouse

Data Warehouse Metadata Data-driven Machine Learning

Using AWS AppSync and AWS Lake Formation to access a secure data lake through a GraphQL API

AWS Big Data

OCTOBER 9, 2023

Data lakes have been gaining popularity for storing vast amounts of data from diverse sources in a scalable and cost-effective way. As the number of data consumers grows, data lake administrators often need to implement fine-grained access controls for different user profiles.

Data Lake

Data Lake Testing Big Data Management

The People’s Data Catalog: Alation Featured as Top Choice in Eckerson’s Latest Report

Alation

JULY 15, 2021

Many data catalog initiatives fail. How can prospective buyers ensure they partner with the right catalog to drive success? According to the latest report from Eckerson Group, Deep Dive on Data Catalogs , shoppers must match the goals of their organizations to the capabilities of their chosen catalog.

Reporting

Reporting Data Governance Recreation/Entertainment Metadata

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance.

Data Lake

Data Lake Snapshot Metadata Optimization

Alation State of Data Culture Report: Bad Data Spells Trouble for AI

Alation

MARCH 24, 2021

The third installment of the quarterly Alation State of Data Culture Report was recently released, highlighting the data challenges enterprises face as they continue investing in artificial intelligence (AI). AI fails when it’s fed bad data, resulting in inaccurate or unfair results.

Reporting

Reporting Data Quality Business Driver Enterprise

Simplify data transfer: Google BigQuery to Amazon S3 using Amazon AppFlow

AWS Big Data

OCTOBER 5, 2023

In today’s data-driven world, the ability to effortlessly move and analyze data across diverse platforms is essential. Amazon AppFlow , a fully managed data integration service, has been at the forefront of streamlining data transfer between AWS services, software as a service (SaaS) applications, and now Google BigQuery.

Data Warehouse

Data Warehouse Machine Learning Data Integration Data-driven

Cloudera Named Leader in The Forrester Wave: Notebook-Based Predictive Analytics and Machine Learning, Q3 2020

Cloudera

SEPTEMBER 10, 2020

Cloudera has been named a Leader in The Forrester Wave : Notebook-Based Predictive Analytics and Machine Learning, Q3 2020. For enterprise machine learning teams, this means having the right platform, tools, and processes that streamline end-to-end ML to tackle once-impossible business challenges effectively and at scale.

Machine Learning

Machine Learning Predictive Analytics Analytics Enterprise

How Salesforce optimized their detection and response platform using AWS managed services

AWS Big Data

APRIL 18, 2024

This is a guest blog post co-authored with Atul Khare and Bhupender Panwar from Salesforce. The platform ingests more than 1 PB of data per day, more than 10 million events per second, and more than 200 different log types. The data lake consumers then use Apache Presto running on Amazon EMR cluster to perform one-time queries.

Optimization

Optimization Data Lake Management Key Performance Indicator

AI assistants optimize automation with API-based agents

IBM Big Data Hub

NOVEMBER 8, 2023

Capable of understanding and generating human-like responses and content, these assistants are revolutionizing the way humans and machines collaborate. LLMs are trained on vast amounts of data and can be used across endless applications. In addition, each API contains fields of varying data types.

Optimization

Optimization Data-driven Modeling Enterprise

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 1

AWS Big Data

AUGUST 27, 2024

Director of Product, Salesforce Data Cloud. In today’s ever-evolving business landscape, organizations must harness and act on data to fuel analytics, generate insights, and make informed decisions to deliver exceptional customer experiences. What is Salesforce Data Cloud? What is Amazon Redshift?

Data Lake

Data Lake Analytics Data-driven Management

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Over the years, organizations have invested in creating purpose-built, cloud-based data lakes that are siloed from one another. A major challenge is enabling cross-organization discovery and access to data across these multiple data lakes, each built on different technology stacks.

Data Lake

Data Lake Publishing Metadata Data-driven

Data – the Octane Accelerating Intelligent Connected Vehicles

Cloudera

FEBRUARY 8, 2021

Within the vehicle, current electronics and wiring infrastructures were not designed for this complex data wrangling capability. In addition, moving outside the vehicle, existing fragmented approaches for data management associated with the machine learning lifecycle are limiting the ability to deploy new use cases at scale.

Machine Learning

Machine Learning Manufacturing Unstructured Data Data Collection

How to Deliver Data Quality with Data Governance: Ryan Doupe, CDO of American Fidelity, 9-Step Process

Alation

JANUARY 20, 2022

Several weeks ago (prior to the Omicron wave), I got to attend my first conference in roughly two years: Dataversity’s Data Quality and Information Quality Conference. Ryan Doupe, Chief Data Officer of American Fidelity, held a thought-provoking session that resonated with me. Instead, data quality rules promote awareness and trust.

Data Quality

Data Quality Data Governance Metrics Statistics

The Value of Catalog-Led Data Governance

Alation

NOVEMBER 4, 2021

This week I was talking to a data practitioner at a global systems integrator. The practitioner asked me to add something to a presentation for his organization: the value of data governance for things other than data compliance and data security. Now to be honest, I immediately jumped onto data quality.

Data Governance

Data Governance Metadata Data Quality Enterprise

Unlock data across organizational boundaries using Amazon DataZone – now generally available

AWS Big Data

OCTOBER 4, 2023

Amazon DataZone enables customers to discover, access, share, and govern data at scale across organizational boundaries, reducing the undifferentiated heavy lifting of making data and analytics tools accessible to everyone in the organization. This is challenging because access to data is managed differently by each of the tools.

Metadata

Metadata Data Lake Publishing Data Governance

What Are ChatGPT and Its Friends?

O'Reilly on Data

MARCH 23, 2023

Many of these are unsurprising: you can ask it to write a letter, you can ask it to make up a story, you can ask it to write descriptive entries for products in a catalog. Unlike labels, embeddings are learned from the training data, not produced by humans. The design of Transformers lends itself to large sets of training data.

IT

IT Modeling Testing Risk

Author visual ETL flows on Amazon SageMaker Unified Studio (preview)

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Webinars

Trending Sources

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Webinars

Are You Content with Your Organization’s Content Strategy?

How BMW streamlined data access using AWS Lake Formation fine-grained access control

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

The DataOps Vendor Landscape, 2021

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Scaling RISE with SAP data and AWS Glue

Recap of Amazon Redshift key product announcements in 2024

What is a Machine Learning Data Catalog?

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

What is a Data Mesh?

Building a Machine Learning Application With Cloudera Data Science Workbench And Operational Database, Part 1: The Set-Up & Basics

Introducing Cloudera Fine Tuning Studio for Training, Evaluating, and Deploying LLMs with Cloudera AI

Why data observability is essential to AI governance

Cloudera’s Applied ML Prototype Catalog Continues to Grow

Introducing AWS Glue Data Quality anomaly detection

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

Best Practices for Metadata Management

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Introducing Lightweight, Customizable ML Runtimes in Cloudera Machine Learning

Migrate workloads from AWS Data Pipeline

What is a data fabric architecture?

How to establish lineage transparency for your machine learning initiatives

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Using AWS AppSync and AWS Lake Formation to access a secure data lake through a GraphQL API

The People’s Data Catalog: Alation Featured as Top Choice in Eckerson’s Latest Report

Introducing Apache Hudi support with AWS Glue crawlers

Alation State of Data Culture Report: Bad Data Spells Trouble for AI

Simplify data transfer: Google BigQuery to Amazon S3 using Amazon AppFlow

Cloudera Named Leader in The Forrester Wave: Notebook-Based Predictive Analytics and Machine Learning, Q3 2020

How Salesforce optimized their detection and response platform using AWS managed services

AI assistants optimize automation with API-based agents

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 1

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Data – the Octane Accelerating Intelligent Connected Vehicles

How to Deliver Data Quality with Data Governance: Ryan Doupe, CDO of American Fidelity, 9-Step Process

The Value of Catalog-Led Data Governance

Unlock data across organizational boundaries using Amazon DataZone – now generally available

What Are ChatGPT and Its Friends?

Stay Connected