Data Lake, Machine Learning and Testing

Incremental refresh for Amazon Redshift materialized views on data lake tables

AWS Big Data

NOVEMBER 8, 2024

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it cost-effective to analyze your data using standard SQL and business intelligence tools. Customers use data lake tables to achieve cost effective storage and interoperability with other tools. The sample files are ‘|’ delimited text files.

Data Lake

Data Lake Data Warehouse Optimization Testing

5 things on our data and AI radar for 2021

O'Reilly on Data

FEBRUARY 19, 2021

MLOps attempts to bridge the gap between Machine Learning (ML) applications and the CI/CD pipelines that have become standard practice. The Time Is Now to Adopt Responsible Machine Learning. Data use is no longer a “wild west” in which anything goes; there are legal and reputational consequences for using data improperly.

Data Lake

Data Lake Machine Learning Data Warehouse Modeling

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. As a machine learning problem, it is a classification task with tabular data, a perfect fit for RAPIDS. Data Ingestion. Ingest Data. Write Data.

Machine Learning

Machine Learning Data Science Data Lake Modeling

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

Much has been written about struggles of deploying machine learning projects to production. As with many burgeoning fields and disciplines, we don’t yet have a shared canonical infrastructure stack or best practices for developing and deploying data-intensive applications. An Overarching Concern: Correctness and Testing.

IT

IT Testing Experimentation Software

Eight Top DataOps Trends for 2022

DataKitchen

NOVEMBER 29, 2021

In 2022, data organizations will institute robust automated processes around their AI systems to make them more accountable to stakeholders. Model developers will test for AI bias as part of their pre-deployment testing. Quality test suites will enforce “equity,” like any other performance metric.

Testing

Testing Data Lake Data Architecture Manufacturing

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Using AWS AppSync and AWS Lake Formation to access a secure data lake through a GraphQL API

AWS Big Data

OCTOBER 9, 2023

Data lakes have been gaining popularity for storing vast amounts of data from diverse sources in a scalable and cost-effective way. As the number of data consumers grows, data lake administrators often need to implement fine-grained access controls for different user profiles.

Data Lake

Data Lake Testing Big Data Management

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

This amalgamation empowers vendors with authority over a diverse range of workloads by virtue of owning the data. This authority extends across realms such as business intelligence, data engineering, and machine learning thus limiting the tools and capabilities that can be used.

Data Lake

Data Lake Metadata Snapshot Analytics

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Over the years, organizations have invested in creating purpose-built, cloud-based data lakes that are siloed from one another. A major challenge is enabling cross-organization discovery and access to data across these multiple data lakes, each built on different technology stacks.

Data Lake

Data Lake Publishing Metadata Data-driven

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Monitor data pipelines in a serverless data lake

AWS Big Data

AUGUST 9, 2023

The combination of a data lake in a serverless paradigm brings significant cost and performance benefits. By monitoring application logs, you can gain insights into job execution, troubleshoot issues promptly to ensure the overall health and reliability of data pipelines.

Data Lake

Data Lake Metrics Testing Cost-Benefit

Interview with: Sankar Narayanan, Chief Practice Officer at Fractal Analytics

Corinium

JUNE 6, 2019

Some of the work is very foundational, such as building an enterprise data lake and migrating it to the cloud, which enables other more direct value-added activities such as self-service. There is usually a steep learning curve in terms of “doing AI right”, which is invaluable.

Insurance

Insurance Analytics Forecasting Deep Learning

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Navigating Data Entities, BYOD, and Data Lakes in Microsoft Dynamics

Jet Global

SEPTEMBER 4, 2020

There is an established body of practice around creating, managing, and accessing OLAP data (known as “cubes”). Data Lakes. There has been a lot of talk over the past year or two in the D365F&SCM world about “data lakes.” Traditional databases and data warehouses do not lend themselves to that task.

Data Lake

Data Lake OLAP Data Warehouse Unstructured Data

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

These features allow efficient data corrections, gap-filling in time series, and historical data updates without disrupting ongoing analyses or compromising data integrity. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Use cases for Hive metastore federation for Amazon EMR Hive metastore federation for Amazon EMR is applicable to the following use cases: Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

DataKitchen

SEPTEMBER 18, 2024

These leaders are expected to influence organizational behavior without direct authority, leading to what DataKitchen CEO Christopher Bergh described as “data nags”—individuals who know what’s wrong but struggle to get others to act. Who should make the change (data engineers, system owners, or data quality professionals).

Scorecard

Scorecard Data Quality Measurement Testing

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

In this example, the Machine Learning (ML) model struggles to differentiate between a chihuahua and a muffin. We will learn what it is, why it is important and how Cloudera Machine Learning (CML) is helping organisations tackle this challenge as part of the broader objective of achieving Ethical AI.

Machine Learning

Machine Learning Modeling Metadata Recreation/Entertainment

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

CIO Business Intelligence

AUGUST 9, 2024

At Atlanta’s Hartsfield-Jackson International Airport, an IT pilot has led to a wholesale data journey destined to transform operations at the world’s busiest airport, fueled by machine learning and generative AI. Data integrity presented a major challenge for the team, as there were many instances of duplicate data.

Data Transformation

Data Transformation Machine Learning Data Lake Dashboards

At AstraZeneca, data and AI are more than game changers – they are life changers

CIO Business Intelligence

OCTOBER 11, 2022

The goal, she explained, is to knock down data silos between those groups, using multiple data lakes supported by strong security and governance, to drive positive impact across the supply chain, manufacturing, and the clinical trials of new drugs. . Four ways to improve data-driven business transformation . “You

Machine Learning

Machine Learning Data Science Data-driven Testing

Regeneron turns to IT to accelerate drug discovery

CIO Business Intelligence

NOVEMBER 4, 2022

Rigid requirements to ensure the accuracy of data and veracity of scientific formulas as well as machine learning algorithms and data tools are common in modern laboratories. When Bob McCowan was promoted to CIO at Regeneron Pharmaceuticals in 2018, he had previously run the data center infrastructure for the $81.5

Data Lake

Data Lake IT Experimentation Data-driven

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

Comparison of modern data architectures : Architecture Definition Strengths Weaknesses Best used when Data warehouse Centralized, structured and curated data repository. Inflexible schema, poor for unstructured or real-time data. Data lake Raw storage for all types of structured and unstructured data.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Introducing generative AI upgrades for Apache Spark in AWS Glue (preview)

AWS Big Data

NOVEMBER 22, 2024

Organizations run millions of Apache Spark applications each month on AWS, moving, processing, and preparing data for analytics and machine learning. Data practitioners need to upgrade to the latest Spark releases to benefit from performance improvements, new features, bug fixes, and security enhancements. Python 3.7)

Cost-Benefit

Cost-Benefit Data-driven Software Testing

Rocket Mortgage lays foundation for generative AI success

CIO Business Intelligence

MARCH 29, 2024

That’s why Rocket Mortgage has been a vigorous implementor of machine learning and AI technologies — and why CIO Brian Woodring emphasizes a “human in the loop” AI strategy that will not be pinned down to any one generative AI model. Today, 60% to 70% of Rocket’s workloads run on the cloud, with more than 95% of those workloads in AWS.

Data Lake

Data Lake Machine Learning Data Warehouse Unstructured Data

Carhartt turns to data under new CIO

CIO Business Intelligence

NOVEMBER 25, 2022

As part of that transformation, Agusti has plans to integrate a data lake into the company’s data architecture and expects two AI proofs of concept (POCs) to be ready to move into production within the quarter. Today, we backflush our data lake through our data warehouse.

Data Lake

Data Lake Data Warehouse Unstructured Data Data Architecture

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

You can use it for big data analytics and machine learning workloads. Azure Databricks Delta Live Table s: These provide a more straightforward way to build and manage Data Pipelines for the latest, high-quality data in Delta Lake. Azure Blob Storage serves as the data lake to store raw data.

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

Data storage databases. Your SaaS company can store and protect any amount of data using Amazon Simple Storage Service (S3), which is ideal for data lakes, cloud-native applications, and mobile apps. AWS also offers developers the technology to develop smart apps using machine learning and complex algorithms.

Cost-Benefit

Cost-Benefit Data Lake Software Machine Learning

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. A question arises on what level of details we need to include in the table metadata.

Metadata

Metadata Data Lake Modeling Data Warehouse

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. To solve for these challenges, we launched Amazon SageMaker Lakehouse unified data connectivity. To learn more, refer to Amazon SageMaker Unified Studio.

Visualization

Visualization Data Processing Testing Publishing

The Future Of The Telco Industry And Impact Of 5G & IoT – Part II

Cloudera

AUGUST 28, 2020

In order for edge analytics to be successful, you still need the cloud or a centralized data hub, when you can land petabytes of live or test data in the cloud, or a centralized data cluster and then use it to, train, test, and iterate on machine learning models using all that data.

IoT

IoT Machine Learning B2B Testing

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

What is Data Pipeline? A Detailed Explanation

Smart Data Collective

OCTOBER 17, 2022

A point of data entry in a given pipeline. Examples of an origin include storage systems like data lakes, data warehouses and data sources that include IoT devices, transaction processing applications, APIs or social media. The final point to which the data has to be eventually transferred is a destination.

Data Warehouse

Data Warehouse Data Lake Visualization Big Data

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

This cloud service was a significant leap from the traditional data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Amazon Redshift Serverless, generally available since 2021, allows you to run and scale analytics without having to provision and manage the data warehouse.

Data Warehouse

Data Warehouse Analytics Data Lake Machine Learning

TransUnion transforms its business model with IT

CIO Business Intelligence

APRIL 26, 2024

billion acquisition of data and analytics company Neustar in 2021, TransUnion has expanded into other services such as marketing, fraud detection and prevention, and robust analytical services. The platform approach to AI TransUnion has been developing, deploying, and continuously modifying machine learning models for some time.

Modeling

Modeling IT Machine Learning Data Governance

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights. From enhancing data lakes to empowering AI-driven analytics, AWS unveiled new tools and services that are set to shape the future of data and analytics.

Analytics

Analytics Data Lake Metadata Data Warehouse

Making the most of MLOps

CIO Business Intelligence

MAY 26, 2022

When companies first start deploying artificial intelligence and building machine learning projects, the focus tends to be on theory. But the tools that data scientists use to create these proofs of concept often don’t translate well into production systems. You need traceability of your code, data, and models.”.

Machine Learning

Machine Learning Data-driven Modeling Dashboards

Making the most of MLOps

CIO Business Intelligence

MAY 28, 2022

When companies first start deploying artificial intelligence and building machine learning projects, the focus tends to be on theory. But the tools that data scientists use to create these proofs of concept often don’t translate well into production systems. You need traceability of your code, data, and models.”.

Machine Learning

Machine Learning Data-driven Modeling Dashboards

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

Amazon Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless. These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime. options(**read_config).option("query",

Data Processing

Data Processing Data Lake Data Warehouse Optimization

Access Amazon Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

AWS Big Data

JUNE 25, 2024

In today’s data-driven business landscape, organizations collect a wealth of data across various touch points and unify it in a central data warehouse or a data lake to deliver business insights. Connection to Amazon Redshift is established by deploying a data stream in Salesforce Data Cloud.

Data Lake

Data Lake Cost-Benefit Data-driven Data Warehouse

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

Today’s general availability announcement covers Iceberg running within key data services in the Cloudera Data Platform (CDP) — including Cloudera Data Warehousing ( CDW ), Cloudera Data Engineering ( CDE ), and Cloudera Machine Learning ( CML ).

Data Lake

Data Lake Data Warehouse Data Architecture Metadata

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

In the beginning, CDP ran only on AWS with a set of services that supported a handful of use cases and workload types: CDP Data Warehouse: a kubernetes-based service that allows business analysts to deploy data warehouses with secure, self-service access to enterprise data. Learn More, Keep in Touch. This is Now.

Data Warehouse

Data Warehouse Machine Learning Visualization Data Lake

Incremental refresh for Amazon Redshift materialized views on data lake tables

5 things on our data and AI radar for 2021

Webinars

Trending Sources

NVIDIA RAPIDS in Cloudera Machine Learning

Webinars

MLOps and DevOps: Why Data Makes It Different

Eight Top DataOps Trends for 2022

Run Apache XTable in AWS Lambda for background conversion of open table formats

Using AWS AppSync and AWS Lake Formation to access a secure data lake through a GraphQL API

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Monitor data pipelines in a serverless data lake

Interview with: Sankar Narayanan, Chief Practice Officer at Fractal Analytics

Choosing an open table format for your transactional data lake on AWS

Recap of Amazon Redshift key product announcements in 2024

Navigating Data Entities, BYOD, and Data Lakes in Microsoft Dynamics

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Build a high-performance quant research platform with Apache Iceberg

What is a Data Mesh?

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

Of Muffins and Machine Learning Models

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

At AstraZeneca, data and AI are more than game changers – they are life changers

Regeneron turns to IT to accelerate drug discovery

Data’s dark secret: Why poor quality cripples AI and growth

Introducing generative AI upgrades for Apache Spark in AWS Glue (preview)

Rocket Mortgage lays foundation for generative AI success

Carhartt turns to data under new CIO

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

10 Things AWS Can Do for Your SaaS Company

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

The Future Of The Telco Industry And Impact Of 5G & IoT – Part II

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

What is Data Pipeline? A Detailed Explanation

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

TransUnion transforms its business model with IT

Top analytics announcements of AWS re:Invent 2024

Making the most of MLOps

Making the most of MLOps

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Access Amazon Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Happy Birthday, CDP Public Cloud

Stay Connected