2023, Data Lake and Metadata - Data Leaders Brief

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2023, we released several updates to AWS Glue crawlers. Crawlers, salut!

Data Lake

Data Lake Metadata Data Governance Statistics

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. availability. parquet") df.sortWithinPartitions("review_date").writeTo("dev.db.amazon_reviews_iceberg").append()

Data Lake

Data Lake Snapshot Metadata Optimization

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift was launched in preview during AWS re:Invent 2023. Custom context enhances the AI model’s understanding of your specific data model, business logic, and query patterns, allowing it to generate more relevant and accurate SQL recommendations. Your data is not shared across accounts.

Metadata

Metadata Sales Data Warehouse Optimization

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

These announcements drive forward the AWS Zero-ETL vision to unify all your data, enabling you to better maximize the value of your data with comprehensive analytics and ML capabilities, and innovate faster with secure data collaboration within and across organizations.

Data Warehouse

Data Warehouse Analytics Data Lake Machine Learning

The Increasing Importance of Open Table Formats

David Menninger's Analyst Perspectives

OCTOBER 31, 2024

I previously wrote about the importance of open table formats to the evolution of data lakes into data lakehouses. The concept of the data lake was initially proposed as a single environment where data could be combined from multiple sources to be stored and processed to enable analysis by multiple users for multiple purposes.

Data Lake

Data Lake Unstructured Data Data Warehouse Software

AWS re:Invent 2023 Amazon Redshift Sessions Recap

AWS Big Data

DECEMBER 18, 2023

Sessions ANT203 | What’s new in Amazon Redshift Watch this session to learn about the newest innovations within Amazon Redshift—the petabyte-scale AWS Cloud data warehousing solution. Easily build and train machine learning models using SQL within Amazon Redshift to generate predictive analytics and propel data-driven decision-making.

Data Warehouse

Data Warehouse Machine Learning Data-driven Data Lake

Top Opportunities for SAP Partners in 2023

Timo Elliott

NOVEMBER 30, 2022

My role was to talk about the trends and opportunities for 2023, for customers, SAP, and our partners. Because of technology limitations, we have always had to start by ripping information from the business systems and moving it to a different platform—a data warehouse, data lake, data lakehouse, data cloud.

Recreation/Entertainment

Recreation/Entertainment Metadata Data Warehouse Cost-Benefit

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

AWS Big Data

JULY 21, 2023

Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.

Data Lake

Data Lake Data Warehouse Marketing Management

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Denodo Provides a Logical Approach to Data Management

David Menninger's Analyst Perspectives

OCTOBER 24, 2024

Data silos are a perennial data management problem for enterprises, with almost three-quarters (73%) of participants in ISG Research’s Data Governance Benchmark Research citing disparate data sources and systems as a data governance challenge.

Management

Management Data-driven Data Governance Data Lake

How Fujitsu implemented a global data mesh architecture and democratized data

AWS Big Data

MAY 1, 2024

Currently, we have approximately 120,000 employees worldwide (as of March 2023), including group companies. To achieve data-driven management, we built OneData, a data utilization platform used in the four global AWS Regions, which started operation in April 2022. It is crucial in data governance and data management.

Dashboards

Dashboards Publishing Data-driven Cost-Benefit

My Understanding of the Gartner® Hype Cycle™ for Finance Data and Analytics Governance, 2023

Data Virtualization

MARCH 28, 2024

As noted in the Gartner Hype Cycle for Finance Data and Analytics Governance, 2023, “Through. The post My Understanding of the Gartner® Hype Cycle™ for Finance Data and Analytics Governance, 2023 appeared first on Data Management Blog - Data Integration and Modern Data Management Articles, Analysis and Information.

Finance

Finance Digital Transformation Analytics Data Integration

Educating ChatGPT on Data Lakehouse

Cloudera

MARCH 17, 2023

I took the free version of ChatGPT on a test drive (in March 2023) and asked some simple questions on data lakehouse and its components. Hopefully this blog will give ChatGPT an opportunity to learn and correct itself while counting towards my 2023 contribution to social good. I thought this was a fairly comprehensive list.

Unstructured Data

Unstructured Data Data Lake Data Warehouse Machine Learning

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

The snapshotId of the source tables involved in the materialized view are also maintained in the metadata. Incremental and full rebuild of materialized view We will insert rows into the base table and examine how the materialized view can be updated to reflect the new data. Furthermore, it is partitioned on the d_year column.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Introducing watsonx: The future of AI for business

IBM Big Data Hub

MAY 9, 2023

Optimized for all data, analytics and AI workloads, watsonx.data combines the flexibility of a data lake with the performance of a data warehouse, helping businesses to scale data analytics and AI anywhere their data resides. Savings may vary depending on configurations, workloads and vendors.

Data Warehouse

Data Warehouse Machine Learning Cost-Benefit Metadata

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Data Firehose uses an AWS Lambda function to transform data and ingest the transformed records into an Amazon Simple Storage Service (Amazon S3) bucket. An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog. Let’s drill down into details.

Metrics

Metrics Visualization Dashboards Interactive

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

erwin

AUGUST 31, 2023

It delivers the ability to capture and unify the business and technical perspectives of data assets, enables effective collaboration between a variety of stakeholders, and delivers metadata-driven automation to accelerate the creation and maintenance of data sources on virtually any data management platform. by Quest ®.

Data-driven

Data-driven Modeling Enterprise Structured Data

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.

Analytics

Analytics Data Lake Management Enterprise

Empower Your Cyber Defenders with Real-Time Analytics

Cloudera

NOVEMBER 15, 2024

In fact, according to the Identity Theft Resource Center (ITRC) Annual Data Breach Report , there were 2,365 cyber attacks in 2023 with more than 300 million victims, and a 72% increase in data breaches since 2021. Real-Time Threat Detection with Iceberg Cyber log data is massive and constantly evolving.

Analytics

Analytics Metadata Snapshot Data-driven

CIOs rise to the ESG reporting challenge

CIO Business Intelligence

JANUARY 30, 2024

“Always the gatekeepers of much of the data necessary for ESG reporting, CIOs are finding that companies are even more dependent on them,” says Nancy Mentesana, ESG executive director at Labrador US, a global communications firm focused on corporate disclosure documents. The complexity is at a much higher level.”

Reporting

Reporting Data Quality Strategy Data-driven

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

In the era of data, organizations are increasingly using data lakes to store and analyze vast amounts of structured and unstructured data. Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making.

Optimization

Optimization Data Lake Cost-Benefit Reporting

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

This involves unifying and sharing a single copy of data and metadata across IBM® watsonx.data ™, IBM® Db2 ®, IBM® Db2® Warehouse and IBM® Netezza ®, using native integrations and supporting open formats, all without the need for migration or recataloging.

Cost-Benefit

Cost-Benefit Metadata Optimization Management

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

Watsonx.data is built on 3 core integrated components: multiple query engines, a catalog that keeps track of metadata, and storage and relational data sources which the query engines directly access. Easy to use, integrated data console: Bring your own data and stay in control of your data.

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Iceberg also helps guarantee data correctness under concurrent write scenarios. We fetch the metadata of the users_xxxxxx table from Athena.

Data Lake

Data Lake Metadata Testing Snapshot

Use the Amazon Redshift Data API to interact with Amazon Redshift Serverless

AWS Big Data

APRIL 28, 2023

describe-table Describes the detailed information about a table including column metadata. The result set contains the complete result set and the column metadata. If you want to get help on a specific command, run the following command: aws redshift-data list-tables help Now we look at how you can use these commands.

Interactive

Interactive Data Warehouse Metadata Data-driven

How Cloudera Supports Zero Trust for Data

Cloudera

JUNE 7, 2023

Subsequent to the ZTMM release, CISA issued a request for comment, which has led to the revised version 2 of the ZTMM in April 2023 , as “commenters requested additional guidance and space to evolve along the maturity model,” according to CISA. Understanding your data is critical to protecting the data.

Metadata

Metadata Data Lake Optimization Modeling

Do the Benefits of Cloud Outweigh the Costs?

Jet Global

SEPTEMBER 19, 2023

What are the best practices for analyzing cloud ERP data? Data Management How do we create a data warehouse or data lake in the cloud using our cloud ERP? How do I access the legacy data from my previous ERP? Self-service BI How can we rapidly build BI reports on cloud ERP data without any help from IT?

Cost-Benefit

Cost-Benefit Data Warehouse Reporting Enterprise

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

AWS Big Data

DECEMBER 19, 2024

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. By using features like Icebergs compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale.

Data Lake

Data Lake IoT Metadata Testing

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Cloudera

NOVEMBER 15, 2024

In fact, according to the Identity Theft Resource Center (ITRC) Annual Data Breach Report , there were 2,365 cyber attacks in 2023 with more than 300 million victims, and a 72% increase in data breaches since 2021. Real-Time Threat Detection with Iceberg Cyber log data is massive and constantly evolving.

Analytics

Analytics Metadata Snapshot Data-driven

Redefining enterprise transformation in the age of intelligent ecosystems

CIO Business Intelligence

JANUARY 16, 2025

The mega-vendor era By 2020, the basis of competition for what are now referred to as mega-vendors was interoperability, automation and intra-ecosystem participation and unlocking access to data to drive business capabilities, value and manage risk. The new new moats How do systems of intelligence fit in?

Enterprise

Enterprise Digital Transformation Scorecard Interactive

Run Apache XTable in AWS Lambda for background conversion of open table formats

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Webinars

Trending Sources

Build a high-performance quant research platform with Apache Iceberg

Webinars

Use Apache Iceberg in a data lake to support incremental data processing

AWS Lake Formation 2023 year in review

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

The Increasing Importance of Open Table Formats

AWS re:Invent 2023 Amazon Redshift Sessions Recap

Top Opportunities for SAP Partners in 2023

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Denodo Provides a Logical Approach to Data Management

How Fujitsu implemented a global data mesh architecture and democratized data

My Understanding of the Gartner® Hype Cycle™ for Finance Data and Analytics Governance, 2023

Educating ChatGPT on Data Lakehouse

Materialized Views in Hive for Iceberg Table Format

Introducing watsonx: The future of AI for business

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Empower Your Cyber Defenders with Real-Time Analytics

CIOs rise to the ESG reporting challenge

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Tackling AI’s data challenges with IBM databases on AWS

Exploring the AI and data capabilities of watsonx

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Use the Amazon Redshift Data API to interact with Amazon Redshift Serverless

How Cloudera Supports Zero Trust for Data

Do the Benefits of Cloud Outweigh the Costs?

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

Empower Your Cyber Defenders with Real-Time Analytics Author: Carolyn Duby, Field CTO

Redefining enterprise transformation in the age of intelligent ecosystems

Stay Connected