Metadata, Optimization and Testing

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Now With Actionable, Automatic, Data Quality Dashboards Imagine a tool that can point at any dataset, learn from your data, screen for typical data quality issues, and then automatically generate and perform powerful tests, analyzing and scoring your data to pinpoint issues before they snowball. DataOps just got more intelligent.

Data Quality

Data Quality Scorecard Testing Dashboards

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Customers maintain multiple MWAA environments to separate development stages, optimize resources, manage versions, enhance security, ensure redundancy, customize settings, improve scalability, and facilitate experimentation. micro, remember to monitor its performance using the recommended metrics to maintain optimal operation.

Metadata

Metadata Cost-Benefit Metrics Optimization

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

Impala Optimizations for Small Queries. We’ll discuss the various phases Impala takes a query through and how small query optimizations are incorporated into the design of each phase. Query optimization in databases is a long standing area of research, with much emphasis on finding near optimal query plans.

Optimization

Optimization Metadata Statistics Cost-Benefit

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations. with Spark 3.3.2,

Optimization

Optimization Snapshot Data Lake Metadata

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

We won’t be writing code to optimize scheduling in a manufacturing plant; we’ll be training ML algorithms to find optimum performance based on historical data. have a large body of tools to choose from: IDEs, CI/CD tools, automated testing tools, and so on. If humans are no longer needed to write enterprise applications, what do we do?

Machine Learning

Machine Learning Software Metadata Testing

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Optimization

Optimization Statistics Metadata Data Lake

What are model governance and model operations?

O'Reilly on Data

JUNE 19, 2019

In a previous post , we noted some key attributes that distinguish a machine learning project: Unlike traditional software where the goal is to meet a functional specification, in ML the goal is to optimize a metric. A catalog or a database that lists models, including when they were tested, trained, and deployed.

Modeling

Modeling Machine Learning Testing Metrics

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant. Data fabric Metadata-rich integration layer across distributed systems. Implementation complexity, relies on robust metadata management.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

How REA Group approaches Amazon MSK cluster capacity planning

AWS Big Data

DECEMBER 5, 2024

As the use of Hydro grows within REA, it’s crucial to perform capacity planning to meet user demands while maintaining optimal performance and cost-efficiency. To address this, we used the AWS performance testing framework for Apache Kafka to evaluate the theoretical performance limits.

Metrics

Metrics Dashboards Testing Optimization

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

First query response times for dashboard queries have significantly improved by optimizing code execution and reducing compilation overhead. We have enhanced autonomics algorithms to generate and implement smarter and quicker optimal data layout recommendations for distribution and sort keys, further optimizing performance.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

5 Ways Data Modeling Is Critical to Data Governance

erwin

JANUARY 9, 2020

For decades, data modeling has been the optimal way to design and deploy new relational databases with high-quality data sources and support application development. That’s because it’s the best way to visualize metadata , and metadata is now the heart of enterprise data management and data governance/ intelligence efforts.

Data Governance

Data Governance Modeling Metadata Unstructured Data

Specialized tools for machine learning development and model governance are becoming essential

O'Reilly on Data

APRIL 2, 2019

Recall the following key attributes of a machine learning project: Unlike traditional software where the goal is to meet a functional specification , in ML the goal is to optimize a metric. A catalog or a database that lists models, including when they were tested, trained, and deployed.

Machine Learning

Machine Learning Modeling Data Science Software

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Cloudinary is a cloud-based media management platform that provides a comprehensive set of tools and services for managing, optimizing, and delivering images, videos, and other media assets on websites and mobile applications. This concept makes Iceberg extremely versatile.

Data Lake

Data Lake Metadata Snapshot Analytics

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

With all these diverse metadata sources, it is difficult to understand the complicated web they form much less get a simple visual flow of data lineage and impact analysis. The metadata-driven suite automatically finds, models, ingests, catalogs and governs cloud data assets. GDPR, CCPA, HIPAA, SOX, PIC DSS).

Data Governance

Data Governance Metadata Testing Data Lake

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

2024 Gartner Market Guide To DataOps

DataKitchen

AUGUST 16, 2024

Data Pipeline Observability: Optimizes pipelines by monitoring data quality, detecting issues, tracing data lineage, and identifying anomalies using live and historical metadata. This capability includes monitoring, logging, and business-rule detection.

Marketing

Marketing Data Quality Testing Metadata

DataOps Facilitates Remote Work

DataKitchen

JANUARY 5, 2021

Data Governance/Catalog (Metadata management) Workflow – Alation, Collibra, Wikis. Tools influence their optimal iteration cycle time, e.g., months/weeks/days. Observability – Testing inputs, outputs, and business logic at each stage of the data analytics pipeline. Tools determine their approach to solving problems.

Testing

Testing Data Governance Metadata Visualization

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

You can now test the newly created application by running the following command: npm run dev By default, the application is available on port 5173 on your local machine. Unfiltered Table Metadata This tab displays the response of the AWS Glue API GetUnfilteredTableMetadata policies for the selected table.

Data Processing

Data Processing Metadata Publishing Testing

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

S3 Tables are specifically optimized for analytics workloads, resulting in up to 3 times faster query throughput and up to 10 times higher transactions per second compared to self-managed tables. These metadata tables are stored in S3 Tables, the new S3 storage offering optimized for tabular data.

Analytics

Analytics Data Lake Metadata Data Warehouse

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake

Data Lake Data Processing Metadata Snapshot

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

DataKitchen

SEPTEMBER 21, 2023

L1 is usually the raw, unprocessed data ingested directly from various sources; L2 is an intermediate layer featuring data that has undergone some form of transformation or cleaning; and L3 contains highly processed, optimized, and typically ready for analytics and decision-making processes.

Testing

Testing Data Quality Predictive Modeling Metrics

6 Case Studies on The Benefits of Business Intelligence And Analytics

datapine

JANUARY 31, 2022

Everything is being tested, and then the campaigns that succeed get more money put into them, while the others aren’t repeated. This methodology of “test, look at the data, adjust” is at the heart and soul of business intelligence. Your Chance: Want to try a professional BI analytics software? Let’s see it with a real-world example.

Business Intelligence

Business Intelligence Analytics Cost-Benefit ROI

Introducing Amazon MWAA larger environment sizes

AWS Big Data

APRIL 16, 2024

Running Apache Airflow at scale puts proportionally greater load on the Airflow metadata database, sometimes leading to CPU and memory issues on the underlying Amazon Relational Database Service (Amazon RDS) cluster. A resource-starved metadata database may lead to dropped connections from your workers, failing tasks prematurely.

Metadata

Metadata Metrics Testing Management

What Are ChatGPT and Its Friends?

O'Reilly on Data

MARCH 23, 2023

Many of these go slightly (but not very far) beyond your initial expectations: you can ask it to generate a list of terms for search engine optimization, you can ask it to generate a reading list on topics that you’re interested in. It was not optimized to provide correct responses. It has helped to write a book.

IT

IT Modeling Testing Risk

Becoming a machine learning company means investing in foundational technologies

O'Reilly on Data

MAY 21, 2019

A catalog or a database that lists models, including when they were tested, trained, and deployed. Metadata and artifacts needed for audits. In particular, auditing and testing machine learning systems will rely on many of the tools I’ve described above. There are real, not just theoretical, risks and considerations.

Machine Learning

Machine Learning Technology Deep Learning Data Science

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Over the last year, Amazon Redshift added several performance optimizations for data lake queries across multiple areas of query engine such as rewrite, planning, scan execution and consuming AWS Glue Data Catalog column statistics. Performance was tested on a Redshift serverless data warehouse with 128 RPU.

Data Lake

Data Lake Statistics Broadcasting Optimization

How AppsFlyer modernized their interactive workload by moving to Amazon Athena and saved 80% of costs

AWS Big Data

AUGUST 8, 2024

We dive into the various optimization techniques AppsFlyer employed, such as partition projection, sorting, parallel query runs, and the use of query result reuse. Additionally, we discuss the thorough testing, monitoring, and rollout process that resulted in a successful transition to the new Athena architecture.

Interactive

Interactive Metadata Optimization Testing

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. This metadata file is later used to read source file names during processing into the staging layer. These files follow the same naming pattern, with a daily system-generated timestamp appended to each file name.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

AWS Big Data

OCTOBER 24, 2024

With its scalability, reliability, and ease of use, Amazon OpenSearch Service helps businesses optimize data-driven decisions and improve operational efficiency. es.amazonaws.com' # e.g. my-test-domain.us-east-1.es.amazonaws.com, Jenkins retrieves JSON files from the GitHub repository and performs validation.

Visualization

Visualization Management Data Processing Testing

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

When you use Trino on Amazon EMR or Athena, you get the latest open source community innovations along with proprietary, AWS developed optimizations. and Athena engine version 2, AWS has been developing query plan and engine behavior optimizations that improve query performance on Trino. Starting from Amazon EMR 6.8.0

Metadata

Metadata Statistics Broadcasting Optimization

Optimized joins & filtering with Bloom filter predicate in Kudu

Cloudera

JANUARY 15, 2021

Pushing down column predicate filters to Kudu allows for optimized execution by skipping reading column values for filtered out rows and reducing network IO between a client, like the distributed query engine Apache Impala, and Kudu. The following test was performed on a 6 node cluster with CDP Runtime 7.1.5. CDP Runtime 7.1.5

Optimization

Optimization Broadcasting Testing Metadata

The Lean Analytics Cycle: Metrics > Hypothesis > Experiment > Act

Occam's Razor

APRIL 8, 2013

Sometimes, we escape the clutches of this sub optimal existence and do pick good metrics or engage in simple A/B testing. You're choosing only one metric because you want to optimize it. Testing out a new feature. Identify, hypothesize, test, react. But it is not routine. So, how do we fix this problem?

Metrics

Metrics KPI Analytics Key Performance Indicator

Available Now! Automated Testing for Data Transformations

Wayne Yaddow

FEBRUARY 18, 2025

However, these two processes are essentially distinct, and their testing needs differ in manyways. As enterprises extend their data pipelines, high-quality, automated testing for both transformations and conversions is critical to assuring data integrity, performance, and compliance across many platforms.

Testing

Testing Data Transformation Data-driven Data Quality

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

By optimizing the various CDP Data Services, including CDW, CDE, and Cloudera Machine Learning (CML) with Iceberg, Cloudera customers can define and manipulate datasets with SQL commands, build complex data pipelines using features like Time Travel operations, and deploy machine learning models built from Iceberg tables. What’s Next.

Snapshot

Snapshot Metadata Cost-Benefit Data Architecture

Run Apache Spark 3.5.1 workloads 4.5 times faster with Amazon EMR runtime for Apache Spark

AWS Big Data

JUNE 21, 2024

The Amazon EMR runtime for Apache Spark is a performance-optimized runtime that is 100% API compatible with open source Apache Spark. Amazon EMR on EC2 , Amazon EMR Serverless , Amazon EMR on Amazon EKS , and Amazon EMR on AWS Outposts all use this optimized runtime, which is 4.5 times faster than Apache Spark 3.5.1 and EMR 7.1.

Cost-Benefit

Cost-Benefit Testing Optimization Statistics

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

They have dev, test, and production clusters running critical workloads and want to upgrade their clusters to CDP Private Cloud Base. Customer Environment: The customer has three environments: development, test, and production. Test and QA. Test and QA. Let’s take a look at one customer’s upgrade journey. Background: .

Testing

Testing Metadata Risk Data Science

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

Despite these capabilities, data lakes are not databases, and object storage does not provide support for ACID processing semantics, which you may require to effectively optimize and manage your data at scale across hundreds or thousands of users using a multitude of different technologies.

Data Lake

Data Lake Metadata Statistics Optimization

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

They understand data modeling, including conceptualization and database optimization, and demonstrate a commitment to continuing education. According to Dataversity , good data architects have a solid understanding of the cloud, databases, and the applications and programs used by those databases.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Themes and Conferences per Pacoid, Episode 11

Domino Data Lab

JULY 2, 2019

In other words, using metadata about data science work to generate code. SQL optimization provides helpful analogies, given how SQL queries get translated into query graphs internally , then the real smarts of a SQL engine work over that graph. On deck this time ’round the Moon: program synthesis. SQL and Spark.

Metadata

Metadata Data Science Machine Learning Data-driven

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports. Also known as data validation, integrity refers to the structural testing of data to ensure that the data complies with procedures. Your Chance: Want to test a professional analytics software?

Data Quality

Data Quality Metrics Data-driven Management

How Far We Can Go with GenAI as an Information Extraction Tool

Ontotext

JANUARY 10, 2025

Our goal is to test whether GenAI can handle diverse domains effectively and determine if its a viable tool for domain-specific graph-building tasks. We also experimented with prompt optimization tools, however these experiments did not yield promising results.

Informatics

Informatics Modeling Metadata Experimentation

Announcing Open Source DataOps Data Quality TestGen 3.0

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Webinars

Trending Sources

Build a high-performance quant research platform with Apache Iceberg

Webinars

Introducing Amazon MWAA micro environments for Apache Airflow

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Deep automation in machine learning

Speed up queries with the cost-based optimizer in Amazon Athena

What are model governance and model operations?

Data’s dark secret: Why poor quality cripples AI and growth

How REA Group approaches Amazon MSK cluster capacity planning

Recap of Amazon Redshift key product announcements in 2024

5 Ways Data Modeling Is Critical to Data Governance

Specialized tools for machine learning development and model governance are becoming essential

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Doing Cloud Migration and Data Governance Right the First Time

Enhance data governance with enforced metadata rules in Amazon DataZone

2024 Gartner Market Guide To DataOps

DataOps Facilitates Remote Work

Integrate custom applications with AWS Lake Formation – Part 2

Top analytics announcements of AWS re:Invent 2024

Use Apache Iceberg in a data lake to support incremental data processing

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

6 Case Studies on The Benefits of Business Intelligence And Analytics

Introducing Amazon MWAA larger environment sizes

What Are ChatGPT and Its Friends?

Becoming a machine learning company means investing in foundational technologies

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

How AppsFlyer modernized their interactive workload by moving to Amazon Athena and saved 80% of costs

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Optimized joins & filtering with Bloom filter predicate in Kudu

The Lean Analytics Cycle: Metrics > Hypothesis > Experiment > Act

Available Now! Automated Testing for Data Transformations

Introducing Apache Iceberg in Cloudera Data Platform

Run Apache Spark 3.5.1 workloads 4.5 times faster with Amazon EMR runtime for Apache Spark

Upgrade Journey: The Path from CDH to CDP Private Cloud

Choosing an open table format for your transactional data lake on AWS

What is a data architect? Skills, salaries, and how to become a data framework master

Themes and Conferences per Pacoid, Episode 11

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

How Far We Can Go with GenAI as an Information Extraction Tool

Stay Connected