Data Lake, IT and Metadata - Data Leaders Brief

Data Lake

Metadata

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. Table metadata is fetched from AWS Glue. Can it also help write SQL queries?

Metadata

Metadata Data Lake Modeling Data Warehouse

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Trending Sources

Jumia builds a next-generation data platform with metadata-driven specification frameworks

AWS Big Data

DECEMBER 20, 2024

In this post, we share part of the journey that Jumia took with AWS Professional Services to modernize its data platform that ran under a Hadoop distribution to AWS serverless based solutions. These phases are: data orchestration, data migration, data ingestion, data processing, and data maintenance.

Metadata

Metadata Data-driven Snapshot Data Lake

Webinars

How to Streamline Payment Applications & Lien Waivers Through Innovative Construction Technology

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

To achieve this, they aimed to break down data silos and centralize data from various business units and countries into the BMW Cloud Data Hub (CDH). However, the initial version of CDH supported only coarse-grained access control to entire data assets, and hence it was not possible to scope access to data asset subsets.

Data Lake

Data Lake Sales Metadata Machine Learning

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. However, commits can still fail if the latest metadata is updated after the base metadata version is established.

Snapshot

Snapshot Management Metadata Big Data

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. As mentioned earlier, 80% of quantitative research work is attributed to data management tasks.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Amazon Redshift , launched in 2013, has undergone significant evolution since its inception, allowing customers to expand the horizons of data warehousing and SQL analytics. Industry-leading price-performance Amazon Redshift offers up to three times better price-performance than alternative cloud data warehouses.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

This interoperability is crucial for enabling seamless data access, reducing data silos, and fostering a more flexible and efficient data ecosystem. Delta Lake UniForm is an open table format extension designed to provide a universal data representation that can be efficiently read by different processing engines.

Metadata

Metadata Data Warehouse Big Data Data Lake

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

Their terminal operations rely heavily on seamless data flows and the management of vast volumes of data. Recently, EUROGATE has developed a digital twin for its container terminal Hamburg (CTH), generating millions of data points every second from Internet of Things (IoT)devices attached to its container handling equipment (CHE).

IoT

IoT Machine Learning Metadata Data-driven

Unifying data insights with Amazon QuickSight and Amazon SageMaker

AWS Big Data

JULY 18, 2025

Amazon SageMaker has announced an integration with Amazon QuickSight , bringing together data in SageMaker seamlessly with QuickSight capabilities like interactive dashboards, pixel perfect reports and generative business intelligence (BI)—all in a governed and automated manner. Select the Blueprints tab.

Dashboards

Dashboards Metadata KPI Publishing

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Cloudera

OCTOBER 10, 2024

The open data lakehouse is quickly becoming the standard architecture for unified multifunction analytics on large volumes of data. It combines the flexibility and scalability of data lake storage with the data analytics, data governance, and data management functionality of the data warehouse.

Optimization

Optimization Snapshot Data Lake Cost-Benefit

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

Key takeaways for CIOs from AWS re:Invent 2024

CIO Business Intelligence

DECEMBER 9, 2024

Streamlining workflows and boosting productivity The next set of re:Invent announcements focused on streamlining workflows for enterprises and helping businesses boost the productivity of developers and data professionals. On the storage front, AWS unveiled S3 Table Buckets and the S3 Metadata features.

Metadata

Metadata Unstructured Data Data Lake Data-driven

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

They’re taking data they’ve historically used for analytics or business reporting and putting it to work in machine learning (ML) models and AI-powered applications. This innovation drives an important change: you’ll no longer have to copy or move data between data lake and data warehouses.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

AWS Big Data

NOVEMBER 6, 2024

As businesses generate more data from a variety of sources, they need systems to effectively manage that data and use it for business outcomes—such as providing better customer experiences or reducing costs. Iceberg brings the reliability and simplicity of SQL tables to Amazon Simple Storage Service (Amazon S3) data lakes.

Metadata

Metadata Data Lake Internet of Things Management

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. By using features like Icebergs compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale.

Data Lake

Data Lake IoT Metadata Testing

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

AWS Big Data

OCTOBER 23, 2024

It’s a set of HTTP endpoints to perform operations such as invoking Directed Acyclic Graphs (DAGs), checking task statuses, retrieving metadata about workflows, managing connections and variables, and even initiating dataset-related events, without directly accessing the Airflow web interface or command line tools.

Interactive

Interactive Testing Data-driven Data Lake

Reduce time to access your transactional data for analytical processing using the power of Amazon SageMaker Lakehouse and zero-ETL

AWS Big Data

JUNE 16, 2025

This shift requires shorter time to value and tighter collaboration among data analysts, data scientists, machine learning (ML) engineers, and application developers. However, the reality of scattered data across various systems—from data lakes to data warehouses and applications—makes it difficult to access and use data efficiently.

Data Lake

Data Lake Analytics Data Warehouse Metadata

Bridging the gap between mainframe data and hybrid cloud environments

CIO Business Intelligence

FEBRUARY 27, 2025

A high hurdle many enterprises have yet to overcome is accessing mainframe data via the cloud. Mainframes hold an enormous amount of critical and sensitive business data including transactional information, healthcare records, customer data, and inventory metrics.

Metadata

Metadata Data Lake Cost-Benefit Forecasting

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

However, enterprises often encounter challenges with data silos, insufficient access controls, poor governance, and quality issues. Embracing data as a product is the key to address these challenges and foster a data-driven culture. These controls are designed to grant access with the right level of privileges and context.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

Data quality is no longer a back-office concern. We also examine how centralized, hybrid and decentralized data architectures support scalable, trustworthy ecosystems. One thing is clear for leaders aiming to drive trusted AI, resilient operations and informed decisions at scale: transformation starts with data you can trust.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Amazon DataZone now launched authentication supports through the Amazon Athena JDBC driver, allowing data users to seamlessly query their subscribed data lake assets via popular business intelligence (BI) and analytics tools like Tableau, Power BI, Excel, SQL Workbench, DBeaver, and more.

Visualization

Visualization Data Lake Testing Data Governance

Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation

AWS Big Data

OCTOBER 25, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Statistics

Statistics Data Warehouse Metadata Data Lake

Stream data from Amazon MSK to Apache Iceberg tables in Amazon S3 and Amazon S3 Tables using Amazon Data Firehose

AWS Big Data

JUNE 20, 2025

Increasingly, organizations are adopting Apache Iceberg, an open source table format that simplifies data processing on large datasets stored in data lakes. It seamlessly integrates with popular open source big data processing frameworks Apache Spark , Apache Hive , Apache Flink , Presto , and Trino.

Data Lake

Data Lake Big Data Data-driven Internet of Things

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

This approach simplifies your data journey and helps you meet your security requirements. The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections. Choose Add data. Prerequisites Before you begin, make sure you have the followings: An AWS account. Choose Save changes.

Visualization

Visualization Data Processing Testing Publishing

How Stifel built a modern data platform using AWS Glue and an event-driven domain architecture

AWS Big Data

JULY 7, 2025

In this post, we show you how Stifel implemented a modern data platform using AWS services and open data standards, building an event-driven architecture for domain data products while centralizing the metadata to facilitate discovery and sharing of data products.

Data-driven

Data-driven Metadata Digital Transformation Data Lake

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge offers a comprehensive suite of tools that automate much of the complex conversion work, allowing organizations to quickly and reliably transition their data analytics capabilities to the scalable Amazon Redshift data warehouse. Amazon Redshift is a fully managed data warehouse service offered by Amazon Web Services (AWS).

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

Accelerating development with the AWS Data Processing MCP Server and Agent

AWS Big Data

JULY 22, 2025

To address these challenges and enhance developer productivity, we’re excited to introduce the AWS Data Processing MCP Server , an open-source tool that uses the Model Context Protocol (MCP) to simplify analytics environment setup on AWS. The MCP server interprets this request and automatically constructs the appropriate analytical workflow.

Data Processing

Data Processing Metadata Interactive Data-driven

Amazon SageMaker Lakehouse now supports attribute-based access control

AWS Big Data

APRIL 24, 2025

You can secure and centrally manage your data in the lakehouse by defining fine-grained permissions with Lake Formation that are consistently applied across all analytics and machine learning(ML) tools and engines. These tags are assigned to IAM users or roles and can be used to define or restrict access to specific resources or data.

Sales

Sales Data Lake Management Data-driven

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

AWS Big Data

DECEMBER 9, 2024

As organizations process vast amounts of data, maintaining an accurate historical record is crucial. History management in data systems is fundamental for compliance, business intelligence, data quality, and time-based analysis. Common use cases for historical record management in CDC scenarios span various domains.

Snapshot

Snapshot Data Warehouse Data Lake Data Quality

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

AWS Big Data

NOVEMBER 22, 2024

During development, data engineers often spend hours sifting through log files, analyzing execution plans, and making configuration changes to resolve issues. Troubleshooting these production issues requires extensive analysis of logs and metrics, often leading to extended downtimes and delayed insights from critical data pipelines.

Metrics

Metrics Data Lake Software Optimization

Introducing Jobs in Amazon SageMaker

AWS Big Data

JULY 15, 2025

The next generation of Amazon SageMaker is the center for all your data, analytics, and AI. Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access all of the data in your organization and act on it using the best tools across any use case. First, create a Notebook.

Visualization

Visualization Data Processing Metrics Big Data

Realizing ocean data democratization: Furuno Electric’s initiatives using Amazon DataZone

AWS Big Data

JULY 10, 2025

Like many manufacturing companies, Furuno Electric faced significant changes in revenue structure and technical architecture as they transitioned from traditional business to data-driven business. To succeed in this transformation, it was essential to build a foundation that promotes data utilization across the entire organization.

Manufacturing

Manufacturing IoT Data Lake Digital Transformation

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

However, many companies today still struggle to effectively harness and use their data due to challenges such as data silos, lack of discoverability, poor data quality, and a lack of data literacy and analytical capabilities to quickly access and use data across the organization.

Data Governance

Data Governance Publishing Data-driven Metadata

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

AWS Big Data

MAY 20, 2025

Create an Amazon Bedrock knowledge base referencing structured data in Amazon Redshift Amazon Bedrock Knowledge Bases uses Amazon Redshift as the query engine to query your data. It reads metadata from your structured data store to generate SQL queries. For this demo, we use Amazon Bedrock to access the Amazon Nova FMs.

Structured Data

Structured Data Data Warehouse Analytics Finance

Cross-account data collaboration with Amazon DataZone and AWS analytical tools

AWS Big Data

MARCH 5, 2025

Quick setup enables two default blueprints and creates the default environment profiles for the data lake and data warehouse default blueprints. The script creates a table with sample marketing and sales data. You will then publish the data assets from these data sources. AS wholesale_cost, 45.0

Analytics

Analytics Publishing Metadata Sales

AWS Glue Data Catalog supports automatic optimization of Apache Iceberg tables through your Amazon VPC

AWS Big Data

NOVEMBER 21, 2024

Similarly, the orphan file deletion process scans the table metadata and the actual data files, identifies the unreferenced files, and deletes them to reclaim storage space. These storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance.

Optimization

Optimization Snapshot Metadata Software

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Given the importance of data in the world today, organizations face the dual challenges of managing large-scale, continuously incoming data while vetting its quality and reliability. One of its key features is the ability to manage data using branches. One of its key features is the ability to manage data using branches.

Data Quality

Data Quality Publishing Snapshot Data Lake

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

In today’s rapidly evolving financial landscape, data is the bedrock of innovation, enhancing customer and employee experiences and securing a competitive edge. Like many large financial institutions, ANZ Institutional Division operated with siloed data practices and centralized data management teams.

Metadata

Metadata Data Governance Data Quality Data-driven

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

AWS Big Data

DECEMBER 11, 2024

Data engineers use data warehouses, data lakes, and analytics tools to load, transform, clean, and aggregate data. Data scientists use notebook environments (such as JupyterLab) to create predictive models for different target segments.

Data Lake

Data Lake Data Warehouse Data-driven Big Data

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

Cloudera’s open data lakehouse, powered by Apache Iceberg, solves the real-world big data challenges mentioned above by providing a unified, curated, shareable, and interoperable data lake that is accessible by a wide array of Iceberg-compatible compute engines and tools. spark.sql(SELECT * FROM airlines_data.carriers).show()

Metadata

Metadata Data Warehouse ROI Snapshot

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Webinars

Trending Sources

Jumia builds a next-generation data platform with metadata-driven specification frameworks

Webinars

Run Apache XTable in AWS Lambda for background conversion of open table formats

How BMW streamlined data access using AWS Lake Formation fine-grained access control

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Build a high-performance quant research platform with Apache Iceberg

Recap of Amazon Redshift key product announcements in 2024

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

How EUROGATE established a data mesh architecture using Amazon DataZone

Unifying data insights with Amazon QuickSight and Amazon SageMaker

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Key takeaways for CIOs from AWS re:Invent 2024

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Reduce time to access your transactional data for analytical processing using the power of Amazon SageMaker Lakehouse and zero-ETL

Bridging the gap between mainframe data and hybrid cloud environments

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Data’s dark secret: Why poor quality cripples AI and growth

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation

Stream data from Amazon MSK to Apache Iceberg tables in Amazon S3 and Amazon S3 Tables using Amazon Data Firehose

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

How Stifel built a modern data platform using AWS Glue and an event-driven domain architecture

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

Accelerating development with the AWS Data Processing MCP Server and Agent

Amazon SageMaker Lakehouse now supports attribute-based access control

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

Introducing Jobs in Amazon SageMaker

Realizing ocean data democratization: Furuno Electric’s initiatives using Amazon DataZone

HEMA accelerates their data governance journey with Amazon DataZone

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

Cross-account data collaboration with Amazon DataZone and AWS analytical tools

AWS Glue Data Catalog supports automatic optimization of Apache Iceberg tables through your Amazon VPC

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Stay Connected