Data Integration, Metadata and Testing

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

7 Benefits of Metadata Management

erwin

FEBRUARY 19, 2021

Metadata management is key to wringing all the value possible from data assets. However, most organizations don’t use all the data at their disposal to reach deeper conclusions about how to drive revenue, achieve regulatory compliance or accomplish other strategic objectives. What Is Metadata? Harvest data.

Metadata

Metadata Management Data Quality Cost-Benefit

Four Use Cases Proving the Benefits of Metadata-Driven Automation

erwin

FEBRUARY 7, 2019

Organization’s cannot hope to make the most out of a data-driven strategy, without at least some degree of metadata-driven automation. The volume and variety of data has snowballed, and so has its velocity. As such, traditional – and mostly manual – processes associated with data management and data governance have broken down.

Metadata

Metadata Insurance Data-driven Cost-Benefit

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

We also examine how centralized, hybrid and decentralized data architectures support scalable, trustworthy ecosystems. As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

have a large body of tools to choose from: IDEs, CI/CD tools, automated testing tools, and so on. are only starting to exist; one big task over the next two years is developing the IDEs for machine learning, plus other tools for data management, pipeline management, data cleaning, data provenance, and data lineage.

Machine Learning

Machine Learning Software Metadata Testing

5 Ways Data Modeling Is Critical to Data Governance

erwin

JANUARY 9, 2020

That’s because it’s the best way to visualize metadata , and metadata is now the heart of enterprise data management and data governance/ intelligence efforts. So here’s why data modeling is so critical to data governance. erwin Data Modeler: Where the Magic Happens.

Data Governance

Data Governance Modeling Metadata Unstructured Data

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

These tools range from enterprise service bus (ESB) products, data integration tools; extract, transform and load (ETL) tools, procedural code, application program interfaces (APIs), file transfer protocol (FTP) processes, and even business intelligence (BI) reports that further aggregate and transform data.

Data Governance

Data Governance Metadata Testing Data Lake

Becoming a machine learning company means investing in foundational technologies

O'Reilly on Data

MAY 21, 2019

Not surprisingly, data integration and ETL were among the top responses, with 60% currently building or evaluating solutions in this area. In an age of data-hungry algorithms, everything really begins with collecting and aggregating data. Metadata and artifacts needed for audits. and managed services in the cloud.

Machine Learning

Machine Learning Technology Deep Learning Data Science

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

DataKitchen

SEPTEMBER 21, 2023

In the context of Data in Place, validating data quality automatically with Business Domain Tests is imperative for ensuring the trustworthiness of your data assets. Running these automated tests as part of your DataOps and Data Observability strategy allows for early detection of discrepancies or errors.

Testing

Testing Data Quality Predictive Modeling Metrics

2024 Gartner Market Guide To DataOps

DataKitchen

AUGUST 16, 2024

Data Pipeline Observability: Optimizes pipelines by monitoring data quality, detecting issues, tracing data lineage, and identifying anomalies using live and historical metadata. This capability includes monitoring, logging, and business-rule detection.

Marketing

Marketing Data Quality Testing Metadata

Why data observability is essential to AI governance

erwin

DECEMBER 9, 2024

And if it isnt changing, its likely not being used within our organizations, so why would we use stagnant data to facilitate our use of AI? The key is understanding not IF, but HOW, our data fluctuates, and data observability can help us do just that.

Metadata

Metadata Data Quality Sales Modeling

The Need For Personalized Data Journeys for Your Data Consumers

DataKitchen

OCTOBER 20, 2023

Example 2: The Data Engineering Team Has Many Small, Valuable Files Where They Need Individual Source File Tracking In a typical data processing workflow, tracking individual files as they progress through various stages—from file delivery to data ingestion—is crucial.

Insurance

Insurance Metadata Data-driven Data Quality

Available Now! Automated Testing for Data Transformations

Wayne Yaddow

FEBRUARY 18, 2025

Introduction Data transformations and data conversions are crucial to ensure that raw data is organized, processed, and ready for useful analysis. However, these two processes are essentially distinct, and their testing needs differ in manyways.

Testing

Testing Data Transformation Data-driven Data Quality

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. This approach simplifies your data journey and helps you meet your security requirements.

Visualization

Visualization Data Processing Testing Publishing

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Many of the tests to check performance and volumes of data scanned have used Athena because it provides a simple to use, fully serverless, cost effective, interface without the need to setup infrastructure. When evolving such a partition definition, the data in the table prior to the change is unaffected, as is its metadata.

Data Lake

Data Lake Metadata Snapshot Analytics

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Wayne Yaddow

MARCH 5, 2025

In this post, well see the fundamental procedures, tools, and techniques that data engineers, data scientists, and QA/testing teams use to ensure high-quality data as soon as its deployed. First, we look at how unit and integration tests uncover transformation errors at an early stage. PyTest, JUnit,NUnit).

Testing

Testing Data Transformation Statistics Metadata

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The program must introduce and support standardization of enterprise data. Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

In-place data upgrade In an in-place data migration strategy, existing datasets are upgraded to Apache Iceberg format without first reprocessing or restating existing data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files. This can save time.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

What is a data fabric architecture?

IBM Big Data Hub

MARCH 25, 2022

A data fabric is an architectural approach that enables organizations to simplify data access and data governance across a hybrid multicloud landscape for better 360-degree views of the customer and enhanced MLOps and trustworthy AI. The post What is a data fabric architecture? appeared first on Journey to AI Blog.

Metadata

Metadata Data Quality Data Governance Data Integration

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

We will partition and format the server access logs with Amazon Web Services (AWS) Glue , a serverless data integration service, to generate a catalog for access logs and create dashboards for insights. Both the user data and logs buckets must be in the same AWS Region and owned by the same account.

Metadata

Metadata Dashboards Metrics Visualization

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

S3 Tables integration with the AWS Glue Data Catalog is in preview, allowing you to stream, query, and visualize dataincluding Amazon S3 Metadata tablesusing AWS analytics services such as Amazon Data Firehose , Amazon Athena , Amazon Redshift, Amazon EMR, and Amazon QuickSight. With AWS Glue 5.0, With AWS Glue 5.0,

Analytics

Analytics Data Lake Metadata Data Warehouse

There’s More to erwin Data Governance Automation Than Meets the AI

erwin

NOVEMBER 6, 2020

To better explain our vision for automating data governance, let’s look at some of the different aspects of how the erwin Data Intelligence Suite (erwin DI) incorporates automation. Data Cataloging: Catalog and sync metadata with data management and governance artifacts according to business requirements in real time.

Data Governance

Data Governance Metadata Data-driven Visualization

Five Benefits of an Automation Framework for Data Governance

erwin

JANUARY 24, 2019

In most companies, an incredible amount of data flows from multiple sources in a variety of formats and is constantly being moved and federated across a changing system landscape. With an automation framework, data professionals can meet these needs at a fraction of the cost of the traditional manual way. Governing metadata.

Data Governance

Data Governance Metadata Data-driven Cost-Benefit

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Data and Metadata: Data inputs and data outputs produced based on the application logic.

Metadata

Metadata Cost-Benefit Enterprise Interactive

What is a business intelligence analyst? A key role for data-driven decisions

CIO Business Intelligence

OCTOBER 26, 2023

To earn your CBIP certification, you’ll need two or more years of full-time experience in CIS, data modeling, data planning, data definitions, metadata systems development, enterprise resource planning, systems analysis, application development, and programming or IT management.

Business Intelligence

Business Intelligence Data-driven Statistics Data Warehouse

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. Test access using Athena queries in the consumer account.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

10 master data management certifications that will pay off

CIO Business Intelligence

FEBRUARY 2, 2024

The Art of Service recommends candidates spend a minimum of 18 hours on the course to pass the certification test. It consists of three separate, 90-minute exams: the Information Systems (IS) Core exam, the Data Management Core exam, and the Specialty exam. How to prepare: The fee includes an online training program and PDF textbook.

Management

Management Data Governance Cost-Benefit Testing

At Center Stage IV: Ontotext Webinars About How GraphDB Levels the Field Between RDF and Property Graphs

Ontotext

NOVEMBER 4, 2021

As we’ve said again and again, we believe that knowledge graphs are the next generation tool for helping businesses make critical decisions, based on harmonized knowledge models and data derived from siloed source systems. But these tasks are only part of the story. standard to fully support all significant graph path search use cases.

Metadata

Metadata Visualization Modeling Enterprise

GraphDB: MongoDB Document Store Integration for Large-scale Metadata Management

Ontotext

JUNE 27, 2019

Ontotext’s GraphDB is an enterprise-ready semantic graph database (also called RDF triplestore as it stores data in RDF triples). It provides the core infrastructure for solutions where modeling agility, data integration, relationship exploration, cross-enterprise data publishing and consumption are critical.

Metadata

Metadata Management Enterprise Publishing

From Data Silos to Data Fabric with Knowledge Graphs

Ontotext

SEPTEMBER 15, 2020

Added to this is the increasing demands being made on our data from event-driven and real-time requirements, the rise of business-led use and understanding of data, and the move toward automation of data integration, data and service-level management. Knowledge Graphs are the Warp and Weft of a Data Fabric.

Metadata

Metadata Knowledge Discovery Data Quality Strategy

Access Amazon Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

AWS Big Data

JUNE 25, 2024

It provides secure, real-time access to Redshift data without copying, keeping enterprise data in place. This eliminates replication overhead and ensures access to current information, enhancing data integration while maintaining data integrity and efficiency.

Data Lake

Data Lake Cost-Benefit Data-driven Data Warehouse

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

Its Integrated Process Designer is a visual tool to create data flows that integrate data to produce concise reports. Along the way, metadata is collected, organized, and maintained to help debug and ensure data integrity. Agencies and ad buyers for large clients turn to Simpli.fi Survey CTO.

Management

Management Advertising Data Lake Sales

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

While this approach provides isolation, it creates another significant challenge: duplication of data, metadata, and security policies, or ‘split-brain’ data lake. Now the admins need to synchronize multiple copies of the data and metadata and ensure that users across the many clusters are not viewing stale information.

Metadata

Metadata Data Lake Optimization Strategy

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

AWS Transfer Family seamlessly integrates with other AWS services, automates transfer, and makes sure data is protected with encryption and access controls. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. 2 GB into the landing zone daily.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Simplify and Improve Analytics with Self-Serve Data Prep!

Smarten

JANUARY 30, 2024

Business users cannot even hope to prepare data for analytics – at least not without the right tools. Gartner predicts that, ‘data preparation will be utilized in more than 70% of new data integration projects for analytics and data science.’ So, why is there so much attention paid to the task of data preparation?

Analytics

Analytics Visualization Data Quality Metadata

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

AWS Big Data

MAY 16, 2024

With the new REST API, you can now invoke DAG runs, manage datasets, or get the status of Airflow’s metadata database, trigger, and scheduler—all without relying on the Airflow web UI or CLI. Trigger auto scaling programmatically After you configure auto scaling, you might want to test how it behaves under simulated conditions.

Testing

Testing Metrics Interactive Management

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

Amazon Redshift Serverless, generally available since 2021, allows you to run and scale analytics without having to provision and manage the data warehouse. Amazon Redshift ML large language model (LLM) integration Amazon Redshift ML enables customers to create, train, and deploy machine learning models using familiar SQL commands.

Data Warehouse

Data Warehouse Analytics Data Lake Machine Learning

Ensuring Data Transformation Quality with dbt Core

Wayne Yaddow

MARCH 14, 2025

How dbt Core aids data teams test, validate, and monitor complex data transformations and conversions Photo by NASA on Unsplash Introduction dbt Core, an open-source framework for developing, testing, and documenting SQL-based data transformations, has become a must-have tool for modern data teams as the complexity of data pipelines grows.

Data Transformation

Data Transformation Testing Unstructured Data Data Quality

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Cloudera

AUGUST 31, 2021

Running on CDW is fully integrated with streaming, data engineering, and machine learning analytics. It has a consistent framework that secures and provides governance for all data and metadata on private clouds, multiple public clouds, or hybrid clouds. Smart DwH Mover helps in accelerating data warehouse migration.

Data Warehouse

Data Warehouse Cost-Benefit Metadata Data-driven

What Is a Data Fabric and How Does a Data Catalog Support It?

Alation

JANUARY 25, 2022

As a reminder, here’s Gartner’s definition of data fabric: “A design concept that serves as an integrated layer (fabric) of data and connecting processes. In this blog, we will focus on the “integrated layer” part of this definition by examining each of the key layers of a comprehensive data fabric in more detail.

Metadata

Metadata IT Data-driven Metrics

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities. 05:34:22 Connection test: [OK connection ok] 05:34:22 All checks passed!

Data Lake

Data Lake Management Metrics Data Warehouse

Benchmark Results Position GraphDB As the Most Versatile Graph Database Engine

Ontotext

FEBRUARY 23, 2023

The engines must facilitate the advanced data integration and metadata data management scenarios where an EKG is used for data fabrics or otherwise serves as a data hub between diverse data and content management systems. GraphDB was audited to perform 12 operations/second on an AWS r6id.8xlarge

Publishing

Publishing Metadata Optimization Testing

Build a high-performance quant research platform with Apache Iceberg

7 Benefits of Metadata Management

Webinars

Trending Sources

Four Use Cases Proving the Benefits of Metadata-Driven Automation

Webinars

Data’s dark secret: Why poor quality cripples AI and growth

Deep automation in machine learning

5 Ways Data Modeling Is Critical to Data Governance

Doing Cloud Migration and Data Governance Right the First Time

Becoming a machine learning company means investing in foundational technologies

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

2024 Gartner Market Guide To DataOps

Why data observability is essential to AI governance

The Need For Personalized Data Journeys for Your Data Consumers

Available Now! Automated Testing for Data Transformations

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Recap of Amazon Redshift key product announcements in 2024

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

What is data governance? Best practices for managing data assets

Migrate an existing data lake to a transactional data lake using Apache Iceberg

What is a data fabric architecture?

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

Top analytics announcements of AWS re:Invent 2024

There’s More to erwin Data Governance Automation Than Meets the AI

Five Benefits of an Automation Framework for Data Governance

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

How Cloudera Data Flow Enables Successful Data Mesh Architectures

What is a business intelligence analyst? A key role for data-driven decisions

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

10 master data management certifications that will pay off

At Center Stage IV: Ontotext Webinars About How GraphDB Levels the Field Between RDF and Property Graphs

GraphDB: MongoDB Document Store Integration for Large-scale Metadata Management

From Data Silos to Data Fabric with Knowledge Graphs

Access Amazon Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

Top 15 data management platforms

Improving Multi-tenancy with Virtual Private Clusters

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Simplify and Improve Analytics with Self-Serve Data Prep!

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Ensuring Data Transformation Quality with dbt Core

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

What Is a Data Fabric and How Does a Data Catalog Support It?

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Benchmark Results Position GraphDB As the Most Versatile Graph Database Engine

Stay Connected