Data Lake and Modeling - Data Leaders Brief

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. Delete the bucket.

Data Lake

Data Lake Data Processing Optimization Machine Learning

How a Delta Lake is Process with Azure Synapse Analytics

Analytics Vidhya

JULY 29, 2022

This article was published as a part of the Data Science Blogathon. Introduction We are all pretty much familiar with the common modern cloud data warehouse model, which essentially provides a platform comprising a data lake (based on a cloud storage account such as Azure Data Lake Storage Gen2) AND a data warehouse compute engine […].

Data Lake

Data Lake Data Warehouse Analytics Data Science

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

AWS Big Data

NOVEMBER 22, 2024

The need for streamlined data transformations As organizations increasingly adopt cloud-based data lakes and warehouses, the demand for efficient data transformation tools has grown. This saves time and effort, especially for teams looking to minimize infrastructure management and focus solely on data modeling.

Data Lake

Data Lake Data Warehouse Cost-Benefit Data Transformation

Webinars

Going Beyond Chatbots: Connecting AI to Your Tools, Systems, & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

MORE WEBINARS

5 things on our data and AI radar for 2021

O'Reilly on Data

FEBRUARY 19, 2021

The data that powers ML applications is as important as code, making version control difficult; outputs are probabilistic rather than deterministic, making testing difficult; training a model is processor intensive and time consuming, making rapid build/deploy cycles difficult. A Wave of Cloud-Native, Distributed Data Frameworks.

Data Lake

Data Lake Machine Learning Data Warehouse Modeling

Rapidminer Platform Supports Entire Data Science Lifecycle

David Menninger's Analyst Perspectives

SEPTEMBER 16, 2021

Rapidminer is a visual enterprise data science platform that includes data extraction, data mining, deep learning, artificial intelligence and machine learning (AI/ML) and predictive analytics. It can support AI/ML processes with data preparation, model validation, results visualization and model optimization.

Data Science

Data Science Data Lake Data mining Deep Learning

Understanding the Differences Between Data Lakes and Data Warehouses

Smart Data Collective

AUGUST 28, 2021

Data lakes and data warehouses are probably the two most widely used structures for storing data. Data Warehouses and Data Lakes in a Nutshell. A data warehouse is used as a central storage space for large amounts of structured data coming from various sources. Data Type and Processing.

Data Lake

Data Lake Data Warehouse Unstructured Data Structured Data

Unleash deeper insights with Amazon Redshift data sharing for data lake tables

AWS Big Data

OCTOBER 10, 2024

Over the years, this customer-centric approach has led to the introduction of groundbreaking features such as zero-ETL , data sharing , streaming ingestion , data lake integration , Amazon Redshift ML , Amazon Q generative SQL , and transactional data lake capabilities.

Data Lake

Data Lake Data Warehouse Recreation/Entertainment Data-driven

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

For many organizations, this centralized data store follows a data lake architecture. Although data lakes provide a centralized repository, making sense of this data and extracting valuable insights can be challenging. We recommend testing your use case and data with different models.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Unlocking the true value of data often gets impeded by siloed information. Traditional data management—wherein each business unit ingests raw data in separate data lakes or warehouses—hinders visibility and cross-functional analysis. Amazon DataZone natively supports data sharing for Amazon Redshift data assets.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

The success of GenAI models lies in your data management strategy

CIO Business Intelligence

OCTOBER 9, 2024

Depending on your needs, large language models (LLMs) may not be necessary for your operations, since they are trained on massive amounts of text and are largely for general use. As a result, they may not be the most cost-efficient AI model to adopt, as they can be extremely compute-intensive.

Strategy

Strategy Modeling Management Data Lake

Accelerate analytics and AI innovation with the next generation of Amazon SageMaker

AWS Big Data

MARCH 13, 2025

From within the unified studio, you can discover data and AI assets from across your organization, then work together in projects to securely build and share analytics and AI artifacts, including data, models, and generative AI applications.

Analytics

Analytics Data Lake Data Warehouse Data-driven

United Airlines sets its flight plan for gen AI success

CIO Business Intelligence

DECEMBER 20, 2024

With the core architectural backbone of the airlines gen AI roadmap in place, including United Data Hub and an AI and ML platform dubbed Mars, Birnbaum has released a handful of models into production use for employees and customers alike.

IT

IT Unstructured Data Experimentation Data Lake

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

They’re taking data they’ve historically used for analytics or business reporting and putting it to work in machine learning (ML) models and AI-powered applications. Amazon SageMaker Unified Studio (Preview) solves this challenge by providing an integrated authoring experience to use all your data and tools for analytics and AI.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

This led to inefficiencies in data governance and access control. AWS Lake Formation is a service that streamlines and centralizes the data lake creation and management process. The Solution: How BMW CDH solved data duplication The CDH is a company-wide data lake built on Amazon Simple Storage Service (Amazon S3).

Data Lake

Data Lake Sales Metadata Machine Learning

What is data architecture? A framework to manage data

CIO Business Intelligence

DECEMBER 20, 2024

Data architecture definition Data architecture describes the structure of an organizations logical and physical data assets, and data management resources, according to The Open Group Architecture Framework (TOGAF). An organizations data architecture is the purview of data architects. Curate the data.

Data Architecture

Data Architecture Management Consulting Internet of Things

Bridging the gap between mainframe data and hybrid cloud environments

CIO Business Intelligence

FEBRUARY 27, 2025

Data professionals need to access and work with this information for businesses to run efficiently, and to make strategic forecasting decisions through AI-powered data models. Without integrating mainframe data, it is likely that AI models and analytics initiatives will have blind spots.

Metadata

Metadata Data Lake Cost-Benefit Forecasting

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. Of those tables, some are larger (such as in terms of record volume) than others, and some are updated more frequently than others.

Data Lake

Data Lake Data Processing Metadata Snapshot

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Use cases for Hive metastore federation for Amazon EMR Hive metastore federation for Amazon EMR is applicable to the following use cases: Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

Let’s start by considering the job of a non-ML software engineer: writing traditional software deals with well-defined, narrowly-scoped inputs, which the engineer can exhaustively and cleanly model in the code. Not only is data larger, but models—deep learning models in particular—are much larger than before.

IT

IT Testing Experimentation Software

Eight Top DataOps Trends for 2022

DataKitchen

NOVEMBER 29, 2021

In 2022, data organizations will institute robust automated processes around their AI systems to make them more accountable to stakeholders. Model developers will test for AI bias as part of their pre-deployment testing. Continuous testing, monitoring and observability will prevent biased models from deploying or continuing to operate.

Testing

Testing Data Lake Data Architecture Manufacturing

Implementing a Pharma Data Mesh using DataOps

DataKitchen

AUGUST 19, 2021

Figure 3 shows an example processing architecture with data flowing in from internal and external sources. Each data source is updated on its own schedule, for example, daily, weekly or monthly. The data scientists and analysts have what they need to build analytics for the user. The new Recipes run, and BOOM! Conclusion.

Data Warehouse

Data Warehouse Data Lake Manufacturing Testing

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. foundation model (FM) in Amazon Bedrock as the LLM. Can it also help write SQL queries?

Metadata

Metadata Data Lake Modeling Data Warehouse

MongoDB Enhances Developer Data Platform

David Menninger's Analyst Perspectives

JANUARY 21, 2025

I assert that through 2026, almost all enterprises developing applications based on GenAI will explore vector search and retrieval-augmented generation (RAG) to complement foundation models with proprietary data and content.

Data Lake

Data Lake IoT Cost-Benefit Enterprise

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

In addition to real-time analytics and visualization, the data needs to be shared for long-term data analytics and machine learning applications. To achieve this, EUROGATE designed an architecture that uses Amazon DataZone to publish specific digital twin data sets, enabling access to them with SageMaker in a separate AWS account.

IoT

IoT Machine Learning Metadata Data-driven

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. Search for the Jira Cloud connector.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

TransUnion transforms its business model with IT

CIO Business Intelligence

APRIL 26, 2024

Once completed within two years, the platform, OneTru, will give TransUnion and its customers access to TransUnion’s behemoth trove of consumer data to fuel next-generation analytical services, machine learning models and generative AI applications, says Achanta, who is driving the effort, and held similar posts at Neustar and Walmart.

Modeling

Modeling IT Machine Learning Data Governance

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

AWS Big Data

JULY 21, 2023

Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.

Data Lake

Data Lake Data Warehouse Marketing Management

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

Outdated business apps can cloud your AI vision

CIO Business Intelligence

FEBRUARY 20, 2025

This allows for the extraction and integration of data into AI models without overhauling entire platforms, Erolin says. CIOs should also use data lakes to aggregate information from multiple sources, he adds. AI models can then access the data they need without direct reliance on outdated apps.

Insurance

Insurance Cost-Benefit Unstructured Data Data Lake

Amazon Web Services named a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools

AWS Big Data

FEBRUARY 26, 2025

This evaluation, we feel, critically examines vendors capabilities to address key service needs, including data engineering, operational data integration, modern data architecture delivery, and enabling less-technical data integration across various deployment models.

Data Integration

Data Integration Data Lake Data Warehouse Unstructured Data

SAP customers on Business Suite: New strategy, same old concerns

CIO Business Intelligence

APRIL 10, 2025

To achieve this, SAP must continuously harmonize its product landscape and consistently implement uniform standards, for example, in data models and identity and security services, said DSAG CTO Sebastian Westphal. Above all, SAP also intends to assume operational responsibility for the integration and provision of uniform data models.

Strategy

Strategy Cost-Benefit Data Lake Modeling

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

AWS Big Data

APRIL 24, 2023

Building a data lake on Amazon Simple Storage Service (Amazon S3) provides numerous benefits for an organization. However, many use cases, like performing change data capture (CDC) from an upstream relational database to an Amazon S3-based data lake, require handling data at a record level.

Data Lake

Data Lake Data Governance Machine Learning Cost-Benefit

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. When combined with well-timed maintenance operations, these patterns help build resilient data pipelines that can handle concurrent writes reliably.

Snapshot

Snapshot Management Metadata Big Data

Einstein Studio 1: What it is and what to expect

CIO Business Intelligence

JULY 31, 2024

This year, however, Salesforce has accelerated its agenda, integrating much of its recent work with large language models (LLMs) and machine learning into a low-code tool called Einstein 1 Studio. Einstein 1 Studio is a set of low-code tools to create, customize, and embed AI models in Salesforce workflows. What is Einstein 1 Studio?

Data Lake

Data Lake IT Sales Experimentation

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

AWS Big Data

DECEMBER 12, 2024

In the context of comprehensive data governance, Amazon DataZone offers organization-wide data lineage visualization using Amazon Web Services (AWS) services, while dbt provides project-level lineage through model analysis and supports cross-project integration between data lakes and warehouses.

Snapshot

Snapshot Recreation/Entertainment Experimentation Data Lake

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

These strategies, such as investing in AI-powered cleansing tools and adopting federated governance models, not only address the current data quality challenges but also pave the way for improved decision-making, operational efficiency and customer satisfaction. When financial data is inconsistent, reporting becomes unreliable.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Databricks Lakehouse Platform Streamlines Big Data Processing

David Menninger's Analyst Perspectives

OCTOBER 26, 2021

Databricks is a data engineering and analytics cloud platform built on top of Apache Spark that processes and transforms huge volumes of data and offers data exploration capabilities through machine learning models. The platform supports streaming data, SQL queries, graph processing and machine learning.

Big Data

Big Data Data Processing Machine Learning Modeling

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

2021 Gift Giving Guide for Data Nerds

DataKitchen

DECEMBER 7, 2021

In the book, author Zhamak Dehghani reveals that, despite the time, money, and effort poured into them, data warehouses and data lakes fail when applied at the scale and speed of today’s organizations. A distributed data mesh is a better choice. The book will be available from O’Reilly Media here.

Data-driven

Data-driven Data Governance Big Data Data Science

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

How a Delta Lake is Process with Azure Synapse Analytics

Webinars

Trending Sources

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

Webinars

5 things on our data and AI radar for 2021

Rapidminer Platform Supports Entire Data Science Lifecycle

Understanding the Differences Between Data Lakes and Data Warehouses

Unleash deeper insights with Amazon Redshift data sharing for data lake tables

Recap of Amazon Redshift key product announcements in 2024

Choosing an open table format for your transactional data lake on AWS

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Enrich your serverless data lake with Amazon Bedrock

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

The success of GenAI models lies in your data management strategy

Accelerate analytics and AI innovation with the next generation of Amazon SageMaker

United Airlines sets its flight plan for gen AI success

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Run Apache XTable in AWS Lambda for background conversion of open table formats

How BMW streamlined data access using AWS Lake Formation fine-grained access control

What is data architecture? A framework to manage data

Bridging the gap between mainframe data and hybrid cloud environments

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

MLOps and DevOps: Why Data Makes It Different

Eight Top DataOps Trends for 2022

Implementing a Pharma Data Mesh using DataOps

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

MongoDB Enhances Developer Data Platform

How EUROGATE established a data mesh architecture using Amazon DataZone

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

TransUnion transforms its business model with IT

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

What is a Data Mesh?

Outdated business apps can cloud your AI vision

Amazon Web Services named a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools

SAP customers on Business Suite: New strategy, same old concerns

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Einstein Studio 1: What it is and what to expect

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

Data’s dark secret: Why poor quality cripples AI and growth

Databricks Lakehouse Platform Streamlines Big Data Processing

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

2021 Gift Giving Guide for Data Nerds

Stay Connected