Data Lake, Document and Machine Learning

Data Lake

Document

Machine Learning

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. format(add_column)).select("DATA_TYPE").toPandas().iterrows())[0]

Data Lake

Data Lake Data Processing Optimization Machine Learning

MongoDB Enhances Developer Data Platform

David Menninger's Analyst Perspectives

JANUARY 21, 2025

MongoDB was founded in 2007 and has established itself as one of the most prominent NoSQL database providers with its document-oriented database and associated cloud services. MongoDB has benefited from a focus on the needs of development teams to deliver innovation through the development of data-driven applications.

Data Lake

Data Lake IoT Cost-Benefit Enterprise

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. The AWS Glue crawler will then populate the additional metadata in AWS Glue Data Catalog.

Metadata

Metadata Data Lake Modeling Data Warehouse

Amazon SageMaker Lakehouse now supports attribute-based access control

AWS Big Data

APRIL 24, 2025

You can secure and centrally manage your data in the lakehouse by defining fine-grained permissions with Lake Formation that are consistently applied across all analytics and machine learning(ML) tools and engines. Set up a data lake admin. For instructions, see Create a data lake administrator.

Sales

Sales Data Lake Management Data-driven

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

AWS Big Data

DECEMBER 4, 2024

In today’s data-driven world , organizations are constantly seeking efficient ways to process and analyze vast amounts of information across data lakes and warehouses. This post will showcase how this data can also be queried by other data teams using Amazon Athena. Verify that you have Python version 3.7

Data Lake

Data Lake Metadata Insurance Data-driven

Compose your ETL jobs for MongoDB Atlas with AWS Glue

AWS Big Data

MAY 3, 2023

In today’s data-driven business environment, organizations face the challenge of efficiently preparing and transforming large amounts of data for analytics and data science purposes. Businesses need to build data warehouses and data lakes based on operational data.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

Comparison of modern data architectures : Architecture Definition Strengths Weaknesses Best used when Data warehouse Centralized, structured and curated data repository. Inflexible schema, poor for unstructured or real-time data. Data lake Raw storage for all types of structured and unstructured data.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Data Cataloging in the Data Lake: Alation + Kylo

Alation

FEBRUARY 20, 2020

When it was no longer a hard requirement that a physical data model be created upon the ingestion of data, there was a resulting drop in richness of the description and consistency of the data stored in Hadoop. You did not have to understand or prepare the data to get it into Hadoop, so people rarely did.

Data Lake

Data Lake Metadata Structured Data Big Data

Introducing generative AI upgrades for Apache Spark in AWS Glue (preview)

AWS Big Data

NOVEMBER 22, 2024

Organizations run millions of Apache Spark applications each month on AWS, moving, processing, and preparing data for analytics and machine learning. Data practitioners need to upgrade to the latest Spark releases to benefit from performance improvements, new features, bug fixes, and security enhancements.

Cost-Benefit

Cost-Benefit Data-driven Software Testing

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Cloud Data Science News – Beta 6

Data Science 101

DECEMBER 16, 2019

It now also supports PDF documents. Azure Data Factory Preserves Metadata during File Copy When performing a File copy between Amazon S3, Azure Blob, and Azure Data Lake Gen 2, the metadata will be copied as well. Azure Tips and Tricks: Make your data Searchable A quick video to demonstrate Azure Search.

Data Science

Data Science Machine Learning Metadata Data Lake

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Foundation models (FMs) are large machine learning (ML) models trained on a broad spectrum of unlabeled and generalized datasets. To learn more about RAG, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart.

Data Lake

Data Lake Unstructured Data Management Snapshot

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights. From enhancing data lakes to empowering AI-driven analytics, AWS unveiled new tools and services that are set to shape the future of data and analytics.

Analytics

Analytics Data Lake Metadata Data Warehouse

A CIO’s first rule for automation: Have a clear business case

CIO Business Intelligence

MARCH 2, 2023

They’re also implementing a cloud-based data lake and analytics solution that will provide what Tandon calls a single source of truth, and drive self-service analytics and data-backed decision-making to help them operate more efficiently.

Data Lake

Data Lake Forecasting B2B Optimization

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

Text, images, audio, and videos are common examples of unstructured data. Most companies produce and consume unstructured data such as documents, emails, web pages, engagement center phone calls, and social media. The steps of the workflow are as follows: Integrated AI services extract data from the unstructured data.

Unstructured Data

Unstructured Data Metadata Management Analytics

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

The need for an end-to-end strategy for data management and data governance at every step of the journey—from ingesting, storing, and querying data to analyzing, visualizing, and running artificial intelligence (AI) and machine learning (ML) models—continues to be of paramount importance for enterprises.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

AWS Big Data

OCTOBER 20, 2023

Today, we are pleased to announce new AWS Glue connectors for Azure Blob Storage and Azure Data Lake Storage that allow you to move data bi-directionally between Azure Blob Storage, Azure Data Lake Storage, and Amazon Simple Storage Service (Amazon S3). Learn more in README. option("header","true").load("wasbs://yourblob@youraccountname.blob.core.windows.net/loadingtest-input/100mb")

Data Lake

Data Lake Big Data Data Warehouse Consulting

FINRA CIO Steve Randich pushes the public cloud forward

CIO Business Intelligence

FEBRUARY 10, 2023

Deploying new data types for machine learning Mai-Lan Tomsen-Bukovec, vice president of foundational data services at AWS, sees the cloud giant’s enterprise customers deploying more unstructured data, as well as wider varieties of data sets, to inform the accuracy and training of ML models of late.

Unstructured Data

Unstructured Data Data Lake Machine Learning Enterprise

LA Public Defender CIO digitizes to divert people to programs, not prison

CIO Business Intelligence

APRIL 4, 2024

In total, it took the CIO’s team and agency a little over two years to convert 160 million documents into a transformed, revamped, and people-centric system, built on the Salesforce CRM, that tells their stories and focuses on people outcomes, not case outcomes.

Digital Transformation

Digital Transformation Data Lake ROI Modeling

Data Modeling 301 for the cloud: data lake and NoSQL data modeling and design

erwin

AUGUST 15, 2022

For NoSQL, data lakes, and data lake houses—data modeling of both structured and unstructured data is somewhat novel and thorny. This blog is an introduction to some advanced NoSQL and data lake database design techniques (while avoiding common pitfalls) is noteworthy. Machine Learning.

Data Lake

Data Lake Modeling Unstructured Data Data Warehouse

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 1

AWS Big Data

AUGUST 27, 2024

The following diagram depicts the end-to-end process involved for sharing Salesforce Data Cloud data with Amazon Redshift in the same Region using a Zero Copy architecture. This architecture follows the pattern documented in Cross-account data sharing best practices and considerations.

Data Lake

Data Lake Analytics Data-driven Management

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

AWS Big Data

JANUARY 6, 2025

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Analytics

Analytics Data Warehouse Big Data Metrics

Dairyland powers up for a generative AI edge

CIO Business Intelligence

APRIL 9, 2024

Previously head of cybersecurity at Ingersoll-Rand, Melby started developing neural networks and machine learning models more than a decade ago. I was literally just waiting for commercial availability [of LLMs] but [services] like Azure Machine Learning made it so you could easily apply it to your data.

Digital Transformation

Digital Transformation Machine Learning Data Lake Software

Using other CDP services with Cloudera Operational Database

Cloudera

FEBRUARY 16, 2021

Cloudera Data Warehouse to perform ETL operations. Cloudera Machine learning to train and serve machine learning and AI models. The following image shows you how COD integrates within your enterprise data lifecycle. . Cloudera Shared Data Experience (SDX) . Cloudera Machine Learning .

Machine Learning

Machine Learning Data Lake Enterprise Data Warehouse

Modernizing the Data Warehouse: Challenges and Benefits

BI-Survey

AUGUST 21, 2020

Leaders rely less on data mart deployment than on lean, flexible architectures and usable data based on cloud services, a complementary data lake, data governance, data hubs and data catalogs. They are opting for cloud data services more frequently. BARC Report Modernizing the Data Warehouse.

Data Warehouse

Data Warehouse Data Lake Data Governance Data Architecture

Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents

AWS Big Data

MAY 28, 2024

LLMs could automate the extraction and summarization of key information from these documents, enabling analysts to query the LLM and receive reliable summaries. This would allow analysts to process the documents to develop investment recommendations faster and more efficiently. If yes, run query to extract information.

Unstructured Data

Unstructured Data Data Warehouse Structured Data Testing

Breaking State and Local Data Silos with Modern Data Architectures

Cloudera

AUGUST 30, 2022

It outlines a scenario in which “recently married people might want to change their names on their driver’s licenses or other documentation. That should be easy, but when agencies don’t share data or applications, they don’t have a unified view of people. Towards Data Science ). Forrester ).

Data Architecture

Data Architecture Data Lake Data Warehouse Metadata

How data literacy allows gen AI to drive productivity at Dow

CIO Business Intelligence

JULY 31, 2024

At the core, digital at Dow is about changing how we work, which includes how we interact with systems, data, and each other to be more productive and to grow. Data is at the heart of everything we do today, from AI to machine learning or generative AI. A significant Copilot use case has been finding documents.

Manufacturing

Manufacturing Cost-Benefit Digital Transformation Forecasting

Introducing watsonx: The future of AI for business

IBM Big Data Hub

MAY 9, 2023

After some impressive advances over the past decade, largely thanks to the techniques of Machine Learning (ML) and Deep Learning , the technology seems to have taken a sudden leap forward. A data store built on open lakehouse architecture, it runs both on premises and across multi-cloud environments. Watsonx.ai

Data Warehouse

Data Warehouse Machine Learning Cost-Benefit Metadata

How the Public Sector Can Maximize the Value of Dark Data

Cloudera

JANUARY 30, 2023

Have you ever considered how much data a single person generates in a day? Every web document, scanned document, email, social media post, and media download? One estimate states that “ on average, people will produce 463 exabytes of data per day by 2025.” Now consider that the federal government has approximately 2.8

IoT

IoT Data Architecture Data Lake Machine Learning

Differentiate generative AI applications with your data using AWS analytics and managed databases

AWS Big Data

SEPTEMBER 12, 2024

The document and key value data models allow you the flexibility to adjust the schema of the conversation state over time. You can store that data in relational databases like Amazon Aurora , NoSQL databases, or graph databases like Amazon Neptune. Tiziano Curci is a Manager, EMEA Data & AI PDS at AWS.

Management

Management Analytics Data Lake Interactive

How Etihad taps data science to optimise airline operations

CIO Business Intelligence

MARCH 9, 2022

Despite the worldwide chaos, UAE national airline Etihad has managed to generate productivity gains and cost savings from insights using data science. Etihad began its data science journey with the Cloudera Data Platform and moved its data to the cloud to set up a data lake. Reem Alaya Lebhar.

Data Science

Data Science Data Lake Cost-Benefit Digital Transformation

Celebrating Data Superheroes: The 2021 Data Impact Awards Winners

Cloudera

NOVEMBER 18, 2021

By adopting a custom developed application based on the Cloudera ecosystem, Carrefour has combined the legacy systems into one platform which provides access to customer data in a single data lake. The solution implemented the Cloudera Data Platform (CDP) to better support high-performance computing.

Data Lake

Data Lake Cost-Benefit Digital Transformation Risk

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Cloudera

MAY 23, 2024

In modern enterprises, the exponential growth of data means organizational knowledge is distributed across multiple formats, ranging from structured data stores such as data warehouses to multi-format data stores like data lakes. This makes gathering information for decision making a challenge.

Management

Management Metrics Data Processing Machine Learning

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Cloudera

OCTOBER 7, 2022

Using these adapters, Cloudera customers can use dbt to collaborate, test, deploy, and document their data transformation and analytic pipelines on CDP Public Cloud, CDP One, and CDP Private Cloud. CDP Public Cloud via Cloudera Machine Learning. CDP Private Cloud via Cloudera Data Science Workbench. dbt-impala .

Data Warehouse

Data Warehouse Data Transformation Machine Learning Data Lake

The Data Journey: From Raw Data to Insights

Sisense

JULY 22, 2020

The trend has been towards using cloud-based applications and tools for different functions, such as Salesforce for sales, Marketo for marketing automation, and large-scale data storage like AWS or data lakes such as Amazon S3 , Hadoop and Microsoft Azure. Sisense provides instant access to your cloud data warehouses.

Slice and Dice

Slice and Dice Digital Transformation Data Warehouse Data Lake

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

Storing data in a proprietary, single-workload solution also recreates dangerous data silos all over again, as it locks out other types of workloads over the same shared data. The Data Lake service in Cloudera’s Data Platform provides a central place to understand, manage, secure, and govern data assets across the enterprise.

Data Warehouse

Data Warehouse Data Lake IT Analytics

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

By using AWS Glue to integrate data from Snowflake, Amazon S3, and SaaS applications, organizations can unlock new opportunities in generative artificial intelligence (AI) , machine learning (ML) , business intelligence (BI) , and self-service analytics or feed data to underlying applications.

Analytics

Analytics Data-driven Data Integration Data Lake

Did Big Data Deliver Business Transformation & Improved CX?

Alation

AUGUST 4, 2022

Many CIOs argue the rise of big data pushed people to use data more proactively for business decision-making. Big data got“ more leaders and people in the organization to use data, analytics, and machine learning in their decision making,” says former CIO Isaac Sacolick. Big data can grow too big fast.

Big Data

Big Data Digital Transformation Data Lake Data-driven

Generative AI: 5 enterprise predictions for AI and security — for 2023, 2024, and beyond

CIO Business Intelligence

OCTOBER 25, 2023

The release of intellectual property and non-public information Generative AI tools can make it easy for well-meaning users to leak sensitive and confidential data. Once shared, this data can be fed into the data lakes used to train large language models (LLMs) and can be discovered by other users.

Enterprise

Enterprise Manufacturing Risk Data-driven

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

MongoDB Enhances Developer Data Platform

Webinars

Trending Sources

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Webinars

Choosing an open table format for your transactional data lake on AWS

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Amazon SageMaker Lakehouse now supports attribute-based access control

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

Compose your ETL jobs for MongoDB Atlas with AWS Glue

Data’s dark secret: Why poor quality cripples AI and growth

Data Cataloging in the Data Lake: Alation + Kylo

Introducing generative AI upgrades for Apache Spark in AWS Glue (preview)

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Cloud Data Science News – Beta 6

Exploring real-time streaming for generative AI Applications

Top analytics announcements of AWS re:Invent 2024

A CIO’s first rule for automation: Have a clear business case

Unstructured data management and governance using AWS AI/ML and analytics services

Data governance in the age of generative AI

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

FINRA CIO Steve Randich pushes the public cloud forward

LA Public Defender CIO digitizes to divert people to programs, not prison

Data Modeling 301 for the cloud: data lake and NoSQL data modeling and design

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 1

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

Dairyland powers up for a generative AI edge

Using other CDP services with Cloudera Operational Database

Modernizing the Data Warehouse: Challenges and Benefits

Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents

Breaking State and Local Data Silos with Modern Data Architectures

How data literacy allows gen AI to drive productivity at Dow

Introducing watsonx: The future of AI for business

How the Public Sector Can Maximize the Value of Dark Data

Differentiate generative AI applications with your data using AWS analytics and managed databases

How Etihad taps data science to optimise airline operations

Celebrating Data Superheroes: The 2021 Data Impact Awards Winners

What is a data architect? Skills, salaries, and how to become a data framework master

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

The Data Journey: From Raw Data to Insights

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

Did Big Data Deliver Business Transformation & Improved CX?

Generative AI: 5 enterprise predictions for AI and security — for 2023, 2024, and beyond

Stay Connected