Blog, Metadata and Optimization - Data Leaders Brief

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

It leverages knowledge graphs to keep track of all the data sources and data flows, using AI to fill the gaps so you have the most comprehensive metadata management solution. Together, Cloudera and Octopai will help reinvent how customers manage their metadata and track lineage across all their data sources.

Metadata

Metadata Management Data Governance Data-driven

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Better Metadata Management Add Descriptions and Data Product tags to tables and columns in the Data Catalog for improved governance. With updated TestGen 3.0 , you have the power to score, monitor, and optimize your data quality like never before. DataOps just got more intelligent.

Data Quality

Data Quality Scorecard Testing Dashboards

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

Despite their advantages, traditional data lake architectures often grapple with challenges such as understanding deviations from the most optimal state of the table over time, identifying issues in data pipelines, and monitoring a large number of tables. It is essential for optimizing read and write performance.

Metadata

Metadata Snapshot Data Lake Metrics

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

First query response times for dashboard queries have significantly improved by optimizing code execution and reducing compilation overhead. We have enhanced autonomics algorithms to generate and implement smarter and quicker optimal data layout recommendations for distribution and sort keys, further optimizing performance.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

This is part of our series of blog posts on recent enhancements to Impala. Impala Optimizations for Small Queries. We’ll discuss the various phases Impala takes a query through and how small query optimizations are incorporated into the design of each phase. The entire collection is available here. Query Planner Design.

Optimization

Optimization Metadata Statistics Cost-Benefit

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

How RFS works OpenSearch and Elasticsearch snapshots are a directory tree that contains both data and metadata. Metadata files exist in the snapshot to provide details about the snapshot as a whole, the source cluster’s global metadata and settings, each index in the snapshot, and each shard in the snapshot.

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

RDF-Star: Metadata Complexity Simplified

Ontotext

JUNE 10, 2021

Relational databases benefit from decades of tweaks and optimizations to deliver performance. Not Every Graph is a Knowledge Graph: Schemas and Semantic Metadata Matter. This metadata should then be represented, along with its intricate relationships, in a connected knowledge graph model that can be understood by the business teams”.

Metadata

Metadata Cost-Benefit OLAP Modeling

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

The adoption of open table formats is a crucial consideration for organizations looking to optimize their data management practices and extract maximum value from their data. An Iceberg table’s metadata stores a history of snapshots, which are updated with each transaction. In earlier posts, we discussed AWS Glue 5.0 for Apache Spark.

Snapshot

Snapshot Metadata Data Lake Optimization

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

Optimization

Optimization Snapshot Data Lake Metadata

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Run the following commands: export PROJ_NAME=lfappblog aws s3 cp s3://aws-blogs-artifacts-public/BDB-3934/schema.graphql ~/${PROJ_NAME}/amplify/backend/api/${PROJ_NAME}/schema.graphql In the s chema.graphql file, you can see that the lf-app-lambda-engine function is set as the data source for the GraphQL queries.

Data Processing

Data Processing Metadata Publishing Testing

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

This blog post is co-written with Hardeep Randhawa and Abhay Kumar from HPE. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. This metadata file is later used to read source file names during processing into the staging layer.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

5 Ways Data Modeling Is Critical to Data Governance

erwin

JANUARY 9, 2020

For decades, data modeling has been the optimal way to design and deploy new relational databases with high-quality data sources and support application development. That’s because it’s the best way to visualize metadata , and metadata is now the heart of enterprise data management and data governance/ intelligence efforts.

Data Governance

Data Governance Modeling Metadata Unstructured Data

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

In this context, Amazon DataZone is the optimal choice for managing the enterprise data platform. Business analysts enhance the data with business metadata/glossaries and publish the same as data assets or data products. As stated earlier, the first step involves data ingestion. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

With all these diverse metadata sources, it is difficult to understand the complicated web they form much less get a simple visual flow of data lineage and impact analysis. The metadata-driven suite automatically finds, models, ingests, catalogs and governs cloud data assets. Subscribe to the erwin Expert Blog.

Data Governance

Data Governance Metadata Testing Data Lake

Do I Need a Data Catalog?

erwin

JUNE 26, 2020

Organizations with particularly deep data stores might need a data catalog with advanced capabilities, such as automated metadata harvesting to speed up the data preparation process. The most optimal and streamlined way to achieve this is by using a data catalog, which can provide a first stop for users ahead of working in BI platforms.

Metadata

Metadata Cost-Benefit Measurement Data-driven

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

When you use Trino on Amazon EMR or Athena, you get the latest open source community innovations along with proprietary, AWS developed optimizations. and Athena engine version 2, AWS has been developing query plan and engine behavior optimizations that improve query performance on Trino. Starting from Amazon EMR 6.8.0

Metadata

Metadata Statistics Broadcasting Optimization

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Cloudinary is a cloud-based media management platform that provides a comprehensive set of tools and services for managing, optimizing, and delivering images, videos, and other media assets on websites and mobile applications. This concept makes Iceberg extremely versatile.

Data Lake

Data Lake Metadata Snapshot Analytics

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake

Data Lake Data Processing Metadata Snapshot

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

This blog post will explore how zero-ETL capabilities combined with its new application connectors are transforming the way businesses integrate and analyze their data from popular platforms such as ServiceNow, Salesforce, Zendesk, SAP and others. The data is also registered in the Glue Data Catalog , a metadata repository.

Data Integration

Data Integration Data Lake Statistics Data-driven

Maximize your data dividends with active metadata

IBM Big Data Hub

NOVEMBER 28, 2022

Metadata management performs a critical role within the modern data management stack. However, as data volumes continue to grow, manual approaches to metadata management are sub-optimal and can result in missed opportunities. This puts into perspective the role of active metadata management. Improve data discovery.

Metadata

Metadata Data Quality Data-driven Data Governance

Building Your Human Benchmark with Ontotext Metadata Studio

Ontotext

FEBRUARY 16, 2023

Ontotext’s approach is to optimize models and algorithms through human contribution and benchmarking in order to create better and more accurate AI. You can read more about it in this blog post. What Are The Benefits Of Using Ontotext Metadata Studio? Ontotext Metadata Studio addresses all of these problems head on.

Metadata

Metadata Measurement Metrics Modeling

Introducing Cloudera Observability Premium

Cloudera

JULY 10, 2024

In the public cloud, these cost management issues are compounded by consumption rates, where compute is often overused due to a lack of visibility into optimization opportunities. The data temperature feature lets us see whether hot or cold data sets are deployed optimally, including the underlying file sizes and partitioning styles.

Cost-Benefit

Cost-Benefit Metadata Optimization Measurement

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This is a guest blog post co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland. Through their unique position in ports, at sea, and on roads, they optimize global cargo flows and create sustainable customer value. An AWS Glue job (metadata exporter) runs daily on the source account.

Metadata

Metadata Data Lake Machine Learning Big Data

How Far We Can Go with GenAI as an Information Extraction Tool

Ontotext

JANUARY 10, 2025

This blog post summarizes our findings, focusing on NER as a first-step key task for knowledge extraction. We also experimented with prompt optimization tools, however these experiments did not yield promising results. In many cases, prompt optimizers were removing crucial entity-specific information and oversimplifying.

Informatics

Informatics Modeling Metadata Experimentation

How to optimize application performance with NS1 traffic steering

IBM Big Data Hub

DECEMBER 18, 2023

Achieving consistently high performance requires an efficient routing system, optimizing traffic between the services your application depends on. In summary, IBM NS1 Connect offers a range of traffic steering options to meet diverse business needs to help ensure optimal application performance in the “now” era.

Optimization

Optimization Metadata Metrics Management

Why You Need End-to-End Data Lineage

erwin

SEPTEMBER 10, 2020

In a previous blog , I explained that data lineage is basically the history of data, including a data set’s origin, characteristics, quality and movement over time. This information is critical to regulatory compliance, change management and data governance not to mention delivering an optimal customer experience.

Data Governance

Data Governance Key Performance Indicator Metadata Digital Transformation

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

Companies such as Adobe , Expedia , LinkedIn , Tencent , and Netflix have published blogs about their Apache Iceberg adoption for processing their large scale analytics datasets. . Users do not need to know how the table is partitioned to optimize the SQL query performance. Multi-function analytics .

Snapshot

Snapshot Metadata Cost-Benefit Data Architecture

A Look Back at the Gartner Data and Analytics Summit

Cloudera

APRIL 18, 2024

Artificial intelligence (AI) is something that, by its very nature, can be surrounded by a sea of skepticism but also excitement and optimism when it comes to harnessing its power. Preparing For an AI-powered Future There’s plenty of optimism and interest surrounding GenAI and AI more broadly.

Analytics

Analytics Metadata Data Strategy Optimization

What is Active Metadata & Why it Matters: Key Insights from Gartner’s Market Guide

Alation

MARCH 2, 2023

With lots of data comes yet more calls for automation, optimization, and productivity initiatives to put that data to good use. Analysis, however, requires enterprises to find and collect metadata. Download Gartner’s “Market Guide for Active Metadata Management” to learn more, or read on for a summary of the firm’s outlook.

Metadata

Metadata Marketing IT Data Quality

6 Case Studies on The Benefits of Business Intelligence And Analytics

datapine

JANUARY 31, 2022

This benefit goes directly in hand with the fact that analytics provide businesses with technologies to spot trends and patterns that will lead to the optimization of resources and processes. It is important to optimize processes, increase operational efficiency, drive new revenue, and improve the decision-making of the company.

Business Intelligence

Business Intelligence Analytics Cost-Benefit ROI

Optimized joins & filtering with Bloom filter predicate in Kudu

Cloudera

JANUARY 15, 2021

Pushing down column predicate filters to Kudu allows for optimized execution by skipping reading column values for filtered out rows and reducing network IO between a client, like the distributed query engine Apache Impala, and Kudu. One of the ways Apache Kudu achieves this is by supporting column predicates with scanners. Join Queries.

Optimization

Optimization Broadcasting Testing Metadata

Biggest Trends in Data Visualization Taking Shape in 2022

Smart Data Collective

OCTOBER 13, 2021

This is something that you can learn more about in just about any technology blog. How is Data Virtualization performance optimized? The study and analysis of data allows to improve the automation of processes, optimizing sales strategies and improving business efficiency. for scalable performance in demanding environments.

Visualization

Visualization Cost-Benefit Big Data Prescriptive Analytics

Data Intelligence in the Next Normal; Why, Who and When?

erwin

JANUARY 14, 2021

As the economy slowed, they focused on cost optimization. Even if you don’t have a formal data intelligence program in place, there is a good possibility your organization has intelligence about its data, because it is difficult for data to exist without some form of associated metadata.

Digital Transformation

Digital Transformation Metadata Big Data Data-driven

Benefits of AI-Driven Mobile App Development in E-Commerce

Smart Data Collective

MAY 11, 2023

Bhaval Patel of Space-O Technologies wrote a blog post about the growing importance of AI for mobile apps. In this blog post, we will explore how AI-driven app development strategies can help your e-commerce business stay ahead in the mobile-first world. AI has been invaluable for e-commerce brands.

Cost-Benefit

Cost-Benefit Data-driven Optimization Machine Learning

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

This blog discusses a few problems that you might encounter with Iceberg tables and offers strategies on how to optimize them in each of those scenarios. A bloated metadata.json file could increase both read/write times because a large metadata file needs to be read/written every time. Iceberg doesn’t delete the old data files.

Optimization

Optimization Strategy Snapshot Metadata

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Hudi provides tables , transactions , efficient upserts and deletes , advanced indexes , streaming ingestion services , data clustering and compaction optimizations, and concurrency control , all while keeping your data in open source file formats. Read optimized queries – For MoR tables, queries see the latest data compacted.

Data Lake

Data Lake Snapshot Metadata Optimization

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Data and Metadata: Data inputs and data outputs produced based on the application logic. Introduction.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. Update your-iceberg-storage-blog in the following configuration with the bucket that you created to test this example.

Data Lake

Data Lake Snapshot Metadata Optimization

Introducing Amazon EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform

AWS Big Data

MAY 28, 2024

It’s the preferred choice to run big data workloads because it helps improve throughput and optimize Amazon EC2 spend. and higher support using the Data Catalog as a metadata store for streaming and batch SQL workflows. Amazon EMR on EKS natively integrates tools and functionalities to enable these—and more.

Data Processing

Data Processing Cost-Benefit Metadata Optimization

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Overview This blog post describes support for materialized views for the Iceberg table format. Queries containing joins, filters, projections, group-by, or aggregations without group-by can be transparently rewritten by the Hive optimizer to use one or more eligible materialized views.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

2024 Gartner Market Guide To DataOps

DataKitchen

AUGUST 16, 2024

Data Pipeline Observability: Optimizes pipelines by monitoring data quality, detecting issues, tracing data lineage, and identifying anomalies using live and historical metadata. At DataKitchen, we think of this is a ‘meta-orchestration’ of the code and tools acting upon the data.

Marketing

Marketing Data Quality Testing Metadata

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Cloudera

DECEMBER 11, 2020

In a previous blog post on CDW performance, we compared Azure HDInsight to CDW. In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to EMR 6.0 (also powered by Apache Hive-LLAP) on Amazon using the TPC-DS 2.9 More on this later in the blog.

Data Warehouse

Data Warehouse Metadata Machine Learning Measurement

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Webinars

Trending Sources

Announcing Open Source DataOps Data Quality TestGen 3.0

Webinars

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Recap of Amazon Redshift key product announcements in 2024

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

RDF-Star: Metadata Complexity Simplified

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Enhance data governance with enforced metadata rules in Amazon DataZone

Integrate custom applications with AWS Lake Formation – Part 2

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

5 Ways Data Modeling Is Critical to Data Governance

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Doing Cloud Migration and Data Governance Right the First Time

Do I Need a Data Catalog?

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Use Apache Iceberg in a data lake to support incremental data processing

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Maximize your data dividends with active metadata

Building Your Human Benchmark with Ontotext Metadata Studio

Introducing Cloudera Observability Premium

How Cargotec uses metadata replication to enable cross-account data sharing

How Far We Can Go with GenAI as an Information Extraction Tool

How to optimize application performance with NS1 traffic steering

Why You Need End-to-End Data Lineage

Introducing Apache Iceberg in Cloudera Data Platform

A Look Back at the Gartner Data and Analytics Summit

What is Active Metadata & Why it Matters: Key Insights from Gartner’s Market Guide

6 Case Studies on The Benefits of Business Intelligence And Analytics

Optimized joins & filtering with Bloom filter predicate in Kudu

Biggest Trends in Data Visualization Taking Shape in 2022

Data Intelligence in the Next Normal; Why, Who and When?

Benefits of AI-Driven Mobile App Development in E-Commerce

Optimization Strategies for Iceberg Tables

Introducing Apache Hudi support with AWS Glue crawlers

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Introducing Amazon EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform

Materialized Views in Hive for Iceberg Table Format

2024 Gartner Market Guide To DataOps

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Stay Connected