Blog, Metadata and Testing - Data Leaders Brief

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Now With Actionable, Automatic, Data Quality Dashboards Imagine a tool that can point at any dataset, learn from your data, screen for typical data quality issues, and then automatically generate and perform powerful tests, analyzing and scoring your data to pinpoint issues before they snowball. DataOps just got more intelligent.

Data Quality

Data Quality Scorecard Testing Dashboards

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

7 Benefits of Metadata Management

erwin

FEBRUARY 19, 2021

Metadata management is key to wringing all the value possible from data assets. What Is Metadata? Analyst firm Gartner defines metadata as “information that describes various facets of an information asset to improve its usability throughout its life cycle. It is metadata that turns information into an asset.”.

Metadata

Metadata Management Data Quality Cost-Benefit

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Apache Ozone Metadata Explained

Cloudera

JUNE 2, 2021

As an important part of achieving better scalability, Ozone separates the metadata management among different services: . Ozone Manager (OM) service manages the metadata of the namespace such as volume, bucket and keys. Datanode service manages the metadata of blocks, containers and pipelines running on the datanode. .

Metadata

Metadata Snapshot Testing Management

Addressing Data Mesh Technical Challenges with DataOps

DataKitchen

AUGUST 9, 2021

The domain requires a team that creates/updates/runs the domain, and we can’t forget metadata: catalogs, lineage, test results, processing history, etc., …. It can orchestrate a hierarchy of directed acyclic graphs ( DAGS ) that span domains and integrates testing at each step of processing.

Testing

Testing Data Lake Metadata Publishing

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Rocket-Powered Data Science

FEBRUARY 15, 2023

These rules are not necessarily “Rocket Science” (despite the name of this blog site), but they are common business sense for most business-disruptive technology implementations in enterprises. Keep it agile, with short design, develop, test, release, and feedback cycles: keep it lean, and build on incremental changes.

Strategy

Strategy Experimentation Uncertainty Machine Learning

Four Use Cases Proving the Benefits of Metadata-Driven Automation

erwin

FEBRUARY 7, 2019

Organization’s cannot hope to make the most out of a data-driven strategy, without at least some degree of metadata-driven automation. Metadata-Driven Automation in the BFSI Industry. Metadata-Driven Automation in the Pharmaceutical Industry. Metadata-Driven Automation in the Insurance Industry.

Metadata

Metadata Insurance Data-driven Cost-Benefit

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

You can now test the newly created application by running the following command: npm run dev By default, the application is available on port 5173 on your local machine. Unfiltered Table Metadata This tab displays the response of the AWS Glue API GetUnfilteredTableMetadata policies for the selected table.

Data Processing

Data Processing Metadata Publishing Testing

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. In internal tests, AI-driven scaling and optimizations showcased up to 10 times price-performance improvements for variable workloads.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

With all these diverse metadata sources, it is difficult to understand the complicated web they form much less get a simple visual flow of data lineage and impact analysis. The metadata-driven suite automatically finds, models, ingests, catalogs and governs cloud data assets. Subscribe to the erwin Expert Blog.

Data Governance

Data Governance Metadata Testing Data Lake

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Collaborating closely with our partners, we have tested and validated Amazon DataZone authentication via the Athena JDBC connection, providing an intuitive and secure connection experience for users. Choose Test connection. See the Amazon DataZone and Tableau blog post for step-by-step instructions. Choose Test Connection.

Visualization

Visualization Data Lake Testing Data Governance

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

A five to nine-person team owns the dev, test, deployment, monitoring and maintenance of a domain. Discoverable – users have access to a catalog or metadata management tool which renders the domain discoverable and accessible. We’ll cover some of the potential challenges facing data mesh enterprise architectures in our next blog.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake

Data Lake Data Processing Metadata Snapshot

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

There are no automated tests , so errors frequently pass through the pipeline. There is no process to spin up an isolated dev environment to quickly add a feature, test it with actual data and deploy it to production. The pipeline has automated tests at each step, making sure that each step completes successfully.

Testing

Testing Metadata Dashboards Statistics

Using the metadata service to identify disks in your VSI with IBM Cloud VPC

IBM Big Data Hub

JUNE 12, 2023

If we log in to the VSI, we can see the volume disks: [root@test-metadata ~]# ls -la /dev/disk/by-id total 0 drwxr-xr-x. vdb If we want to find the data volume named test-metadata-volume , we see that it is the vdd disk. Recently, IBM Cloud VPC introduced the metadata service. 2 root root 200 Apr 7 12:58.

Metadata

Metadata Testing Software IT

A Data Prediction for 2025

DataKitchen

FEBRUARY 2, 2023

DataOps Automation (Orchestration, Environment Management, Deployment Automation) DataOps Observability (Monitoring, Test Automation) Data Governance (Catalogs, Lineage, Stewardship) Data Privacy (Access and Compliance) Data Team Management (Projects, Tickets, Documentation, Value Stream Management) What are the drivers of this consolidation?

Metadata

Metadata Testing Data Science Risk

5 Ways Data Modeling Is Critical to Data Governance

erwin

JANUARY 9, 2020

That’s because it’s the best way to visualize metadata , and metadata is now the heart of enterprise data management and data governance/ intelligence efforts. erwin DM 2020 is an essential source of metadata and a critical enabler of data governance and intelligence efforts. Click here to test drive of the new erwin DM.

Data Governance

Data Governance Modeling Metadata Unstructured Data

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

In this blog post, we dive into different data aspects and how Cloudinary breaks the two concerns of vendor locking and cost efficient data analytics by using Apache Iceberg, Amazon Simple Storage Service (Amazon S3 ), Amazon Athena , Amazon EMR , and AWS Glue. This concept makes Iceberg extremely versatile.

Data Lake

Data Lake Metadata Snapshot Analytics

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

This means the data files in the data lake aren’t modified during the migration and all Apache Iceberg metadata files (manifests, manifest files, and table metadata files) are generated outside the purview of the data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

DataKitchen

SEPTEMBER 21, 2023

In the context of Data in Place, validating data quality automatically with Business Domain Tests is imperative for ensuring the trustworthiness of your data assets. Running these automated tests as part of your DataOps and Data Observability strategy allows for early detection of discrepancies or errors.

Testing

Testing Data Quality Predictive Modeling Metrics

DataOps Facilitates Remote Work

DataKitchen

JANUARY 5, 2021

Data Governance/Catalog (Metadata management) Workflow – Alation, Collibra, Wikis. Observability – Testing inputs, outputs, and business logic at each stage of the data analytics pipeline. Tests catch potential errors and warnings before they are released, so the quality remains high.

Testing

Testing Data Governance Metadata Visualization

Why data observability is essential to AI governance

erwin

DECEMBER 9, 2024

Metadata is the basis of trust for data forensics as we answer the questions of fact or fiction when it comes to the data we see. Being that AI is comprised of more data than code, it is now more essential than ever to combine data with metadata in near real-time.

Metadata

Metadata Data Quality Sales Modeling

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations. with Spark 3.3.2,

Optimization

Optimization Snapshot Data Lake Metadata

Metadata enrichment – highly scalable data classification and data discovery

IBM Big Data Hub

JULY 28, 2022

Metadata enrichment is about scaling the onboarding of new data into a governed data landscape by taking data and applying the appropriate business terms, data classes and quality assessments so it can be discovered, governed and utilized effectively. Scalability and elasticity.

Metadata

Metadata Machine Learning Data Quality Statistics

2024 Gartner Market Guide To DataOps

DataKitchen

AUGUST 16, 2024

Data Pipeline Observability: Optimizes pipelines by monitoring data quality, detecting issues, tracing data lineage, and identifying anomalies using live and historical metadata. This capability includes monitoring, logging, and business-rule detection.

Marketing

Marketing Data Quality Testing Metadata

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

After you create the asset, you can add glossaries or metadata forms, but its not necessary for this post. Create it as a JSON file on your workstation (for this post, we call it blog-sub-target.json ). Enter a name for the asset. For Asset type , choose S3 object collection. For S3 location ARN , enter the ARN of the S3 prefix.

Publishing

Publishing Unstructured Data Metadata Data-driven

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

They have dev, test, and production clusters running critical workloads and want to upgrade their clusters to CDP Private Cloud Base. Customer Environment: The customer has three environments: development, test, and production. Test and QA. Test and QA. Let’s take a look at one customer’s upgrade journey. Background: .

Testing

Testing Metadata Risk Data Science

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports. Also known as data validation, integrity refers to the structural testing of data to ensure that the data complies with procedures. Your Chance: Want to test a professional analytics software?

Data Quality

Data Quality Metrics Data-driven Management

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections.

Visualization

Visualization Data Processing Testing Publishing

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

This is part of our series of blog posts on recent enhancements to Impala. Metadata Caching. As Impala’s adoption grew the catalog service started to experience these growing pains, therefore recently we introduced two new features to alleviate the stress, On-demand Metadata and Zero Touch Metadata. More on this below.

Optimization

Optimization Metadata Statistics Cost-Benefit

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

Companies such as Adobe , Expedia , LinkedIn , Tencent , and Netflix have published blogs about their Apache Iceberg adoption for processing their large scale analytics datasets. . In CDP we enable Iceberg tables side-by-side with the Hive table types, both of which are part of our SDX metadata and security framework. What’s Next.

Snapshot

Snapshot Metadata Cost-Benefit Data Architecture

How Far We Can Go with GenAI as an Information Extraction Tool

Ontotext

JANUARY 10, 2025

This blog post summarizes our findings, focusing on NER as a first-step key task for knowledge extraction. Our goal is to test whether GenAI can handle diverse domains effectively and determine if its a viable tool for domain-specific graph-building tasks.

Informatics

Informatics Modeling Metadata Experimentation

6 Case Studies on The Benefits of Business Intelligence And Analytics

datapine

JANUARY 31, 2022

Everything is being tested, and then the campaigns that succeed get more money put into them, while the others aren’t repeated. This methodology of “test, look at the data, adjust” is at the heart and soul of business intelligence. Your Chance: Want to try a professional BI analytics software?

Business Intelligence

Business Intelligence Analytics Cost-Benefit ROI

What Are ChatGPT and Its Friends?

O'Reilly on Data

MARCH 23, 2023

But Transformers have some other important advantages: Transformers don’t require training data to be labeled; that is, you don’t need metadata that specifies what each sentence in the training data means. It’s by far the most convincing example of a conversation with a machine; it has certainly passed the Turing test.

IT

IT Modeling Testing Risk

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. Parquet also stores type metadata which makes reading back and processing the files later slightly easier. This notebook goes through loading just the train and test datasets.

Machine Learning

Machine Learning Data Science Data Lake Modeling

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Data and Metadata: Data inputs and data outputs produced based on the application logic. Introduction.

Metadata

Metadata Cost-Benefit Enterprise Interactive

The Need For Personalized Data Journeys for Your Data Consumers

DataKitchen

OCTOBER 20, 2023

Payload DJs facilitate capturing metadata, lineage, and test results at each phase, enhancing tracking efficiency and reducing the risk of data loss. Example 3: Insurance Card Tracking In the pharmaceutical industry, disjointed business processes can cause data loss as customer information navigates through different systems.

Insurance

Insurance Metadata Data-driven Data Quality

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Cloudera and Cisco have tested together with dense storage nodes to make this a reality. . Can support billions of files ( tested up to 10 billion files) in contrast with HDFS which runs into scalability thresholds at 400 million files. Collects and aggregates metadata from components and present cluster state. Failure Handling.

Data Lake

Data Lake Cost-Benefit Metadata Testing

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

Benchmark setup In our testing, we used the 3 TB dataset stored in Amazon S3 in compressed Parquet format and metadata for databases and tables is stored in the AWS Glue Data Catalog. When statistics aren’t available, Amazon EMR and Athena use S3 file metadata to optimize query plans. With Amazon EMR 6.10.0

Metadata

Metadata Statistics Broadcasting Optimization

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Cloudera

DECEMBER 11, 2020

In a previous blog post on CDW performance, we compared Azure HDInsight to CDW. In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to EMR 6.0 (also powered by Apache Hive-LLAP) on Amazon using the TPC-DS 2.9 More on this later in the blog.

Data Warehouse

Data Warehouse Metadata Machine Learning Measurement

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

This blog post introduces Amazon DataZone and explores how VW used it to build their data mesh to enable streamlined data access across multiple data lakes. This populates the technical metadata in the business data catalog for each data asset. Producers control what to share, for how long, and how consumers interact with it.

Data Lake

Data Lake Publishing Metadata Data-driven

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

This blog post presents an architecture solution that allows customers to extract key insights from Amazon S3 access logs at scale. AWS Glue Data Catalog stores information as metadata tables, where each table specifies a single data store. With exponential growth in data volume, centralized monitoring becomes challenging.

Metadata

Metadata Dashboards Metrics Visualization

erwin Recognized as a March 2020 Gartner Peer Insights Customers’ Choice for Metadata Management Solutions

erwin

APRIL 16, 2020

We’re excited about our recognition as a March 2020 Gartner Peer Insights Customers’ Choice for Metadata Management Solutions. Metadata management is key to sustainable data governance and any other organizational effort that is data-driven. Critical Application for Information Governance ” -Information Scientist, Healthcare Industry.

Metadata

Metadata Management Data Governance Digital Transformation

Announcing Open Source DataOps Data Quality TestGen 3.0

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Webinars

Trending Sources

7 Benefits of Metadata Management

Webinars

Apache Ozone Metadata Explained

Addressing Data Mesh Technical Challenges with DataOps

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Four Use Cases Proving the Benefits of Metadata-Driven Automation

Integrate custom applications with AWS Lake Formation – Part 2

Recap of Amazon Redshift key product announcements in 2024

Doing Cloud Migration and Data Governance Right the First Time

Enhance data governance with enforced metadata rules in Amazon DataZone

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

What is a Data Mesh?

Use Apache Iceberg in a data lake to support incremental data processing

A Day in the Life of a DataOps Engineer

Using the metadata service to identify disks in your VSI with IBM Cloud VPC

A Data Prediction for 2025

5 Ways Data Modeling Is Critical to Data Governance

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

DataOps Facilitates Remote Work

Why data observability is essential to AI governance

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Metadata enrichment – highly scalable data classification and data discovery

2024 Gartner Market Guide To DataOps

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Upgrade Journey: The Path from CDH to CDP Private Cloud

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Introducing Apache Iceberg in Cloudera Data Platform

How Far We Can Go with GenAI as an Information Extraction Tool

6 Case Studies on The Benefits of Business Intelligence And Analytics

What Are ChatGPT and Its Friends?

NVIDIA RAPIDS in Cloudera Machine Learning

How Cloudera Data Flow Enables Successful Data Mesh Architectures

The Need For Personalized Data Journeys for Your Data Consumers

Apache Ozone and Dense Data Nodes

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

erwin Recognized as a March 2020 Gartner Peer Insights Customers’ Choice for Metadata Management Solutions

Stay Connected