Data Lake, Metadata and Software

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Understanding the Differences Between Data Lakes and Data Warehouses

Smart Data Collective

AUGUST 28, 2021

Data lakes and data warehouses are probably the two most widely used structures for storing data. Data Warehouses and Data Lakes in a Nutshell. A data warehouse is used as a central storage space for large amounts of structured data coming from various sources. Data Type and Processing.

Data Lake

Data Lake Data Warehouse Unstructured Data Structured Data

Collibra Brings Effective Data Governance to Line-of-Business

David Menninger's Analyst Perspectives

SEPTEMBER 28, 2021

Collibra is a data governance software company that offers tools for metadata management and data cataloging. The software enables organizations to find data quickly, identify its source and assure its integrity.

Data Governance

Data Governance Metadata Software Management

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Bridging the gap between mainframe data and hybrid cloud environments

CIO Business Intelligence

FEBRUARY 27, 2025

A high hurdle many enterprises have yet to overcome is accessing mainframe data via the cloud. Giving the mobile workforce access to this data via the cloud allows them to be productive from anywhere, fosters collaboration, and improves overall strategic decision-making.

Metadata

Metadata Data Lake Cost-Benefit Forecasting

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

Under the hood, UniForm generates Iceberg metadata files (including metadata and manifest files) that are required for Iceberg clients to access the underlying data files in Delta Lake tables. Both Delta Lake and Iceberg metadata files reference the same data files.

Metadata

Metadata Data Warehouse Big Data Data Lake

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

When evolving such a partition definition, the data in the table prior to the change is unaffected, as is its metadata. Only data that is written to the table after the evolution is partitioned with the new definition, and the metadata for this new set of data is kept separately.

Data Lake

Data Lake Metadata Snapshot Analytics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Today, many customers build data quality validation pipelines using its Data Quality Definition Language (DQDL) because with static rules, dynamic rules , and anomaly detection capability , its fairly straightforward. One of its key features is the ability to manage data using branches. Sotaro Hikita is a Solutions Architect.

Data Quality

Data Quality Publishing Snapshot Data Lake

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

AWS Big Data

SEPTEMBER 10, 2024

We often see requests from customers who have started their data journey by building data lakes on Microsoft Azure, to extend access to the data to AWS services. In such scenarios, data engineers face challenges in connecting and extracting data from storage containers on Microsoft Azure.

Data Lake

Data Lake Metadata Management Software

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt. Decentralization promotes creativity and empowerment.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

Collaborate and build faster using familiar AWS tools for model development, generative AI, data processing, and SQL analytics with Amazon Q Developer , the most capable generative AI assistant for software development, helping you along the way. Having confidence in your data is key.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. availability. show() The snapshots that have expired show the latest snapshot ID as null.

Data Lake

Data Lake Snapshot Metadata Optimization

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

The Increasing Importance of Open Table Formats

David Menninger's Analyst Perspectives

OCTOBER 31, 2024

I previously wrote about the importance of open table formats to the evolution of data lakes into data lakehouses. The concept of the data lake was initially proposed as a single environment where data could be combined from multiple sources to be stored and processed to enable analysis by multiple users for multiple purposes.

Data Lake

Data Lake Unstructured Data Data Warehouse Software

Data Lakes: What Are They and Who Needs Them?

Jet Global

JULY 2, 2019

To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake. What’s in a Data Lake? All the while, your marketing team is relying on marketing automation or CRM software they find the most productive.

Data Lake

Data Lake Data Warehouse Big Data Machine Learning

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

These tools range from enterprise service bus (ESB) products, data integration tools; extract, transform and load (ETL) tools, procedural code, application program interfaces (APIs), file transfer protocol (FTP) processes, and even business intelligence (BI) reports that further aggregate and transform data.

Data Governance

Data Governance Metadata Testing Data Lake

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance.

Data Lake

Data Lake Snapshot Metadata Optimization

Building a Beautiful Data Lakehouse

CIO Business Intelligence

MARCH 9, 2022

However, they do contain effective data management, organization, and integrity capabilities. As a result, users can easily find what they need, and organizations avoid the operational and cost burdens of storing unneeded or duplicate data copies. Warehouse, data lake convergence. Meet the data lakehouse.

Data Lake

Data Lake Unstructured Data Data Warehouse Big Data

Informatica’s new data management clouds target health, finance services

CIO Business Intelligence

MAY 24, 2022

The company said that IDMC for Financial Services has built-in metadata scanners that can help extract lineage, technical, business, operational, and usage metadata from over 50,000 systems (including data warehouses and data lakes) and applications including business intelligence, data science, CRM, and ERP software.

Finance

Finance Management Metadata Machine Learning

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

After decades of digitizing everything in your enterprise, you may have an enormous amount of data, but with dormant value. However, with the help of AI and machine learning (ML), new software tools are now available to unearth the value of unstructured data. The solution integrates data in three tiers.

Unstructured Data

Unstructured Data Metadata Management Analytics

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

This approach simplifies your data journey and helps you meet your security requirements. The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections. About the Authors Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team.

Visualization

Visualization Data Processing Testing Publishing

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Alation

FEBRUARY 20, 2020

For many enterprises, a hybrid cloud data lake is no longer a trend, but becoming reality. Performance is more reliable, and there is wide array of mature software products at the enterprises’ disposal. Due to these needs, hybrid cloud data lakes emerged as a logical middle ground between the two consumption models.

Data Lake

Data Lake ROI Metadata Cost-Benefit

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all. Marketing-focused or not, DMPs excel at negotiating with a wide array of databases, data lakes, or data warehouses, ingesting their streams of data and then cleaning, sorting, and unifying the information therein.

Management

Management Advertising Data Lake Sales

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge offers a comprehensive suite of tools that automate much of the complex conversion work, allowing organizations to quickly and reliably transition their data analytics capabilities to the scalable Amazon Redshift data warehouse. Amazon Redshift is a fully managed data warehouse service offered by Amazon Web Services (AWS).

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Denodo Provides a Logical Approach to Data Management

David Menninger's Analyst Perspectives

OCTOBER 24, 2024

Data fabric and data mesh are also both related to logical data management, which is the approach of providing virtualized access to data across an enterprise without the requirement to first extract and load it into a central repository.

Management

Management Data-driven Data Governance Data Lake

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization. The target accounts read data from the source account S3 buckets.

Metadata

Metadata Data Lake Machine Learning Big Data

Don’t Fear Artificial Intelligence; Embrace it Through Data Governance

CIO Business Intelligence

APRIL 29, 2022

This first article emphasizes data as the ‘foundation-stone’ of AI-based initiatives. Establishing a Data Foundation. The shift away from ‘Software 1.0’ where applications have been based on hard-coded rules has begun and the ‘Software 2.0’ era is upon us.

Data Governance

Data Governance IT Risk Data Lake

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

To provide a response that includes the enterprise context, each user prompt needs to be augmented with a combination of insights from structured data from the data warehouse and unstructured data from the enterprise data lake.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Apache Ozone is one of the major innovations introduced in CDP, which provides the next generation storage architecture for Big Data applications, where data blocks are organized in storage containers for larger scale and to handle small objects. Collects and aggregates metadata from components and present cluster state.

Data Lake

Data Lake Cost-Benefit Metadata Testing

Salesforce readies Einstein Copilot to unleash generative AI across its offerings

CIO Business Intelligence

SEPTEMBER 12, 2023

The hype around generative AI since ChatGPT’s launch in November 2022 has driven some software vendors to rush to incorporate the technology into their applications. To that end, Salesforce is leveraging Data Cloud as a central data hub for enterprise implementations of Einstein Copilot.

IT

IT Metadata Data Lake Cost-Benefit

Integrating Data Governance and Enterprise Architecture

erwin

SEPTEMBER 3, 2020

To better understand and align data governance and enterprise architecture, let’s look at data at rest and data in motion and why they both have to be documented. Documenting data at rest involves looking at where data is stored, such as in databases, data lakes , data warehouses and flat files.

Data Governance

Data Governance Enterprise Risk Data Lake

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries.

Data Lake

Data Lake Metadata Data Processing Big Data

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience.

Data Warehouse

Data Warehouse Analytics Data Lake Machine Learning

Cross-account data collaboration with Amazon DataZone and AWS analytical tools

AWS Big Data

MARCH 5, 2025

Quick setup enables two default blueprints and creates the default environment profiles for the data lake and data warehouse default blueprints. You will then publish the data assets from these data sources. Add an AWS Glue data source to publish the new AWS Glue table. Review and choose Create.

Analytics

Analytics Publishing Metadata Sales

How Morningstar used tag-based access controls in AWS Lake Formation to manage permissions for an Amazon Redshift data warehouse

AWS Big Data

APRIL 6, 2023

In this post, Morningstar’s Data Lake Team Leads discuss how they utilized tag-based access control in their data lake with AWS Lake Formation and enabled similar controls in Amazon Redshift. This way, our existing data lake consumers could easily transition to Amazon Redshift.

Data Warehouse

Data Warehouse Data Lake Management Data-driven

What is an Information Steward, and Why You Should Care

Grooper

MARCH 5, 2020

If your organization has any kind of data and analytics initiative, then chances are you have people – maybe even an entire department dedicated to managing and integrating data for (and between) software applications to achieve some sort of business outcome. Is a Power-User or a Data Scientist an Information Steward?

Data Lake

Data Lake Metadata Data Quality Software

Run Apache XTable in AWS Lambda for background conversion of open table formats

Understanding the Differences Between Data Lakes and Data Warehouses

Webinars

Trending Sources

Collibra Brings Effective Data Governance to Line-of-Business

Webinars

Use Apache Iceberg in a data lake to support incremental data processing

Bridging the gap between mainframe data and hybrid cloud environments

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Choosing an open table format for your transactional data lake on AWS

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

Use open table format libraries on AWS Glue 5.0 for Apache Spark

What is a Data Mesh?

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Build a real-time GDPR-aligned Apache Iceberg data lake

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

The Increasing Importance of Open Table Formats

Data Lakes: What Are They and Who Needs Them?

Doing Cloud Migration and Data Governance Right the First Time

Introducing Apache Hudi support with AWS Glue crawlers

Building a Beautiful Data Lakehouse

Informatica’s new data management clouds target health, finance services

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Unstructured data management and governance using AWS AI/ML and analytics services

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Top 15 data management platforms

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Denodo Provides a Logical Approach to Data Management

How Cargotec uses metadata replication to enable cross-account data sharing

Don’t Fear Artificial Intelligence; Embrace it Through Data Governance

Data governance in the age of generative AI

Apache Ozone and Dense Data Nodes

Salesforce readies Einstein Copilot to unleash generative AI across its offerings

Integrating Data Governance and Enterprise Architecture

Query your Apache Hive metastore with AWS Lake Formation permissions

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Cross-account data collaboration with Amazon DataZone and AWS analytical tools

How Morningstar used tag-based access controls in AWS Lake Formation to manage permissions for an Amazon Redshift data warehouse

What is an Information Steward, and Why You Should Care

Stay Connected