Data Lake, Machine Learning and Modeling

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. Delete the bucket.

Data Lake

Data Lake Data Processing Optimization Machine Learning

5 things on our data and AI radar for 2021

O'Reilly on Data

FEBRUARY 19, 2021

MLOps attempts to bridge the gap between Machine Learning (ML) applications and the CI/CD pipelines that have become standard practice. The Time Is Now to Adopt Responsible Machine Learning. Data use is no longer a “wild west” in which anything goes; there are legal and reputational consequences for using data improperly.

Data Lake

Data Lake Machine Learning Data Warehouse Modeling

Rapidminer Platform Supports Entire Data Science Lifecycle

David Menninger's Analyst Perspectives

SEPTEMBER 16, 2021

Rapidminer is a visual enterprise data science platform that includes data extraction, data mining, deep learning, artificial intelligence and machine learning (AI/ML) and predictive analytics. Rapidminer Studio is its visual workflow designer for the creation of predictive models.

Data Science

Data Science Data Lake Data mining Deep Learning

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Understanding the Differences Between Data Lakes and Data Warehouses

Smart Data Collective

AUGUST 28, 2021

Data lakes and data warehouses are probably the two most widely used structures for storing data. Data Warehouses and Data Lakes in a Nutshell. A data warehouse is used as a central storage space for large amounts of structured data coming from various sources. Data Type and Processing.

Data Lake

Data Lake Data Warehouse Unstructured Data Structured Data

Databricks Lakehouse Platform Streamlines Big Data Processing

David Menninger's Analyst Perspectives

OCTOBER 26, 2021

Databricks is a data engineering and analytics cloud platform built on top of Apache Spark that processes and transforms huge volumes of data and offers data exploration capabilities through machine learning models. The platform supports streaming data, SQL queries, graph processing and machine learning.

Big Data

Big Data Data Processing Machine Learning Modeling

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

The following requirements were essential to decide for adopting a modern data mesh architecture: Domain-oriented ownership and data-as-a-product : EUROGATE aims to: Enable scalable and straightforward data sharing across organizational boundaries. Eliminate centralized bottlenecks and complex data pipelines.

IoT

IoT Machine Learning Metadata Data-driven

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. RAPIDS on the Cloudera Data Platform comes pre-configured with all the necessary libraries and dependencies to bring the power of RAPIDS to your projects. Data Ingestion.

Machine Learning

Machine Learning Data Science Data Lake Modeling

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

They’re taking data they’ve historically used for analytics or business reporting and putting it to work in machine learning (ML) models and AI-powered applications. This innovation drives an important change: you’ll no longer have to copy or move data between data lake and data warehouses.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight.

Data Lake

Data Lake Sales Metadata Machine Learning

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Accelerate analytics and AI innovation with the next generation of Amazon SageMaker

AWS Big Data

MARCH 13, 2025

At AWS re:Invent 2024, we announced the next generation of Amazon SageMaker , the center for all your data, analytics, and AI. Governance features including fine-grained access control are built into SageMaker Unified Studio using Amazon SageMaker Catalog to help you meet enterprise security requirements across your entire data estate.

Analytics

Analytics Data Lake Data Warehouse Data-driven

What is data architecture? A framework to manage data

CIO Business Intelligence

DECEMBER 20, 2024

Data architecture definition Data architecture describes the structure of an organizations logical and physical data assets, and data management resources, according to The Open Group Architecture Framework (TOGAF). An organizations data architecture is the purview of data architects. Curate the data.

Data Architecture

Data Architecture Management Consulting Internet of Things

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

Much has been written about struggles of deploying machine learning projects to production. As with many burgeoning fields and disciplines, we don’t yet have a shared canonical infrastructure stack or best practices for developing and deploying data-intensive applications. However, the concept is quite abstract.

IT

IT Testing Experimentation Software

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

In this example, the Machine Learning (ML) model struggles to differentiate between a chihuahua and a muffin. Will the model correctly determine it is a muffin or get confused and think it is a chihuahua? The extent to which we can predict how the model will classify an image given a change input (e.g.

Machine Learning

Machine Learning Modeling Metadata Recreation/Entertainment

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Navigating Data Entities, BYOD, and Data Lakes in Microsoft Dynamics

Jet Global

SEPTEMBER 4, 2020

Consultants and developers familiar with the AX data model could query the database using any number of different tools, including a myriad of different report writers. Data entities are more secure and arguably easier to master than the relational database model, but one downside is there are lots of them! Data Lakes.

Data Lake

Data Lake OLAP Data Warehouse Unstructured Data

Federated Learning, Machine Learning, Decentralized Data

Cloudera

DECEMBER 8, 2020

Federated Learning is a paradigm in which machine learning models are trained on decentralized data. Instead of collecting data on a single server or data lake, it remains in place — on smartphones, industrial sensing equipment, and other edge devices — and models are trained on-device.

Machine Learning

Machine Learning Data Lake Reporting Data Collection

Complexity Drives Costs: A Look Inside BYOD and Azure Data Lakes

Jet Global

NOVEMBER 5, 2020

That stands for “bring your own database,” and it refers to a model in which core ERP data are replicated to a separate standalone database used exclusively for reporting. OLAP reporting based on a data warehouse model is a well-proven solution for companies with robust reporting requirements. Option 3: Azure Data Lakes.

Data Lake

Data Lake OLAP Data Warehouse Unstructured Data

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Use cases for Hive metastore federation for Amazon EMR Hive metastore federation for Amazon EMR is applicable to the following use cases: Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Deriving Value from Data Lakes with AI

Sisense

DECEMBER 23, 2019

Artificial Intelligence and machine learning are the future of every industry, especially data and analytics. Let’s talk about AI and machine learning (ML). AI and ML are the only ways to derive value from massive data lakes, cloud-native data warehouses, and other huge stores of information.

Data Lake

Data Lake Machine Learning Data Warehouse Digital Transformation

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Given the importance of data in the world today, organizations face the dual challenges of managing large-scale, continuously incoming data while vetting its quality and reliability. One of its key features is the ability to manage data using branches. The data is visualized using matplotlib for interactive data analysis.

Data Quality

Data Quality Publishing Snapshot Data Lake

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Azure Data Sources for Data Science and Machine Learning

Jen Stirrup

MAY 5, 2020

Azure allows you to protect your enterprise data assets, using Azure Active Directory and setting up your virtual network. Other technologies, such as Azure Data Factory, can help process large amounts of data around in the cloud. The data is also distributed. So, Azure Databricks connects to many different data sources.

Machine Learning

Machine Learning Data Science Data Lake Big Data

TransUnion transforms its business model with IT

CIO Business Intelligence

APRIL 26, 2024

billion acquisition of data and analytics company Neustar in 2021, TransUnion has expanded into other services such as marketing, fraud detection and prevention, and robust analytical services. The multilayered data platform will enable TransUnion’s customers to perform deep analytics and build complex AI models.

Modeling

Modeling IT Machine Learning Data Governance

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. foundation model (FM) in Amazon Bedrock as the LLM. Can it also help write SQL queries?

Metadata

Metadata Data Lake Modeling Data Warehouse

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. Of those tables, some are larger (such as in terms of record volume) than others, and some are updated more frequently than others.

Data Lake

Data Lake Data Processing Metadata Snapshot

Eight Top DataOps Trends for 2022

DataKitchen

NOVEMBER 29, 2021

In 2022, data organizations will institute robust automated processes around their AI systems to make them more accountable to stakeholders. Model developers will test for AI bias as part of their pre-deployment testing. Continuous testing, monitoring and observability will prevent biased models from deploying or continuing to operate.

Testing

Testing Data Lake Data Architecture Manufacturing

Outdated business apps can cloud your AI vision

CIO Business Intelligence

FEBRUARY 20, 2025

When building a machine-learning-powered tool to predict the maintenance needs of its customers, Ensono found that its customers used multiple old apps to collect incident tickets, but those apps stored incident data in very different formats, with inconsistent types of data collected, he says.

Insurance

Insurance Cost-Benefit Unstructured Data Data Lake

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

Data Lakes: What Are They and Who Needs Them?

Jet Global

JULY 2, 2019

To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake. What’s in a Data Lake? Data warehouses do a great job of standardizing data from disparate sources for analysis. Taking a Dip.

Data Lake

Data Lake Data Warehouse Big Data Machine Learning

MongoDB Enhances Developer Data Platform

David Menninger's Analyst Perspectives

JANUARY 21, 2025

I assert that through 2026, almost all enterprises developing applications based on GenAI will explore vector search and retrieval-augmented generation (RAG) to complement foundation models with proprietary data and content.

Data Lake

Data Lake IoT Cost-Benefit Enterprise

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

AWS Big Data

JUNE 23, 2023

Events and many other security data types are stored in Imperva’s Threat Research Multi-Region data lake. Imperva harnesses data to improve their business outcomes. As part of their solution, they are using Amazon QuickSight to unlock insights from their data.

Data Lake

Data Lake Dashboards Cost-Benefit Data Warehouse

Interview with: Sankar Narayanan, Chief Practice Officer at Fractal Analytics

Corinium

JUNE 6, 2019

Some of the work is very foundational, such as building an enterprise data lake and migrating it to the cloud, which enables other more direct value-added activities such as self-service. It is also important to have a strong test and learn culture to encourage rapid experimentation.

Insurance

Insurance Analytics Forecasting Deep Learning

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

CIO Business Intelligence

AUGUST 9, 2024

At Atlanta’s Hartsfield-Jackson International Airport, an IT pilot has led to a wholesale data journey destined to transform operations at the world’s busiest airport, fueled by machine learning and generative AI. Data integrity presented a major challenge for the team, as there were many instances of duplicate data.

Data Transformation

Data Transformation Machine Learning Data Lake Dashboards

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

These features allow efficient data corrections, gap-filling in time series, and historical data updates without disrupting ongoing analyses or compromising data integrity. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

AWS Big Data

APRIL 24, 2023

Building a data lake on Amazon Simple Storage Service (Amazon S3) provides numerous benefits for an organization. However, many use cases, like performing change data capture (CDC) from an upstream relational database to an Amazon S3-based data lake, require handling data at a record level.

Data Lake

Data Lake Data Governance Machine Learning Cost-Benefit

Rocket Mortgage lays foundation for generative AI success

CIO Business Intelligence

MARCH 29, 2024

That’s why Rocket Mortgage has been a vigorous implementor of machine learning and AI technologies — and why CIO Brian Woodring emphasizes a “human in the loop” AI strategy that will not be pinned down to any one generative AI model. The rest are on premises.

Data Lake

Data Lake Machine Learning Data Warehouse Unstructured Data

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

DataKitchen

SEPTEMBER 18, 2024

These leaders are expected to influence organizational behavior without direct authority, leading to what DataKitchen CEO Christopher Bergh described as “data nags”—individuals who know what’s wrong but struggle to get others to act. Who should make the change (data engineers, system owners, or data quality professionals).

Scorecard

Scorecard Data Quality Measurement Testing

Building a vision for real-time artificial intelligence

CIO Business Intelligence

APRIL 12, 2023

Real-time AI involves processing data for making decisions within a given time frame. Real-time AI brings together streaming data and machine learning algorithms to make fast and automated decisions; examples include recommendations, fraud detection, security monitoring, and chatbots. It isn’t easy.

Machine Learning

Machine Learning Cost-Benefit Data-driven Strategy

Accelerate data science feature engineering on transactional data lakes using Amazon Athena with Apache Iceberg

AWS Big Data

JUNE 20, 2023

It manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Data labeling is required for various use cases, including forecasting, computer vision, natural language processing, and speech recognition.

Data Lake

Data Lake Data Science Recreation/Entertainment Experimentation

7 key Microsoft Azure analytics services (plus one extra)

CIO Business Intelligence

JUNE 29, 2022

Taking the broadest possible interpretation of data analytics , Azure offers more than a dozen services — and that’s before you include Power BI, with its AI-powered analysis and new datamart option , or governance-oriented approaches such as Microsoft Purview. Azure Data Factory. Azure Data Lake Analytics.

Data Lake

Data Lake Analytics Data Warehouse Machine Learning

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

5 things on our data and AI radar for 2021

Webinars

Trending Sources

Rapidminer Platform Supports Entire Data Science Lifecycle

Webinars

Understanding the Differences Between Data Lakes and Data Warehouses

Databricks Lakehouse Platform Streamlines Big Data Processing

How EUROGATE established a data mesh architecture using Amazon DataZone

NVIDIA RAPIDS in Cloudera Machine Learning

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

How BMW streamlined data access using AWS Lake Formation fine-grained access control

Recap of Amazon Redshift key product announcements in 2024

Accelerate analytics and AI innovation with the next generation of Amazon SageMaker

What is data architecture? A framework to manage data

Run Apache XTable in AWS Lambda for background conversion of open table formats

MLOps and DevOps: Why Data Makes It Different

Of Muffins and Machine Learning Models

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Choosing an open table format for your transactional data lake on AWS

Navigating Data Entities, BYOD, and Data Lakes in Microsoft Dynamics

Federated Learning, Machine Learning, Decentralized Data

Complexity Drives Costs: A Look Inside BYOD and Azure Data Lakes

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Deriving Value from Data Lakes with AI

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Azure Data Sources for Data Science and Machine Learning

TransUnion transforms its business model with IT

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Eight Top DataOps Trends for 2022

Outdated business apps can cloud your AI vision

What is a Data Mesh?

Data Lakes: What Are They and Who Needs Them?

MongoDB Enhances Developer Data Platform

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

Interview with: Sankar Narayanan, Chief Practice Officer at Fractal Analytics

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

Build a high-performance quant research platform with Apache Iceberg

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

Rocket Mortgage lays foundation for generative AI success

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

Building a vision for real-time artificial intelligence

Accelerate data science feature engineering on transactional data lakes using Amazon Athena with Apache Iceberg

7 key Microsoft Azure analytics services (plus one extra)

Stay Connected