Data Architecture, Metadata and Testing

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

This post was co-written with Dipankar Mazumdar, Staff Data Engineering Advocate with AWS Partner OneHouse. Data architecture has evolved significantly to handle growing data volumes and diverse workloads. This allows the existing data to be interpreted as if it were originally written in any of these formats.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

We also examine how centralized, hybrid and decentralized data architectures support scalable, trustworthy ecosystems. As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

The data mesh design pattern breaks giant, monolithic enterprise data architectures into subsystems or domains, each managed by a dedicated team. The communication between business units and data professionals is usually incomplete and inconsistent. Introduction to Data Mesh. Source: Thoughtworks.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

How to Manage Risk with Modern Data Architectures

Cloudera

JUNE 29, 2023

To ensure the stability of the US financial system, the implementation of advanced liquidity risk models and stress testing using (MI/AI) could potentially serve as a protective measure. To improve the way they model and manage risk, institutions must modernize their data management and data governance practices.

Data Architecture

Data Architecture Risk Management Risk Management

5 Ways Data Modeling Is Critical to Data Governance

erwin

JANUARY 9, 2020

That’s because it’s the best way to visualize metadata , and metadata is now the heart of enterprise data management and data governance/ intelligence efforts. So here’s why data modeling is so critical to data governance. erwin Data Modeler: Where the Magic Happens.

Data Governance

Data Governance Modeling Metadata Unstructured Data

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

This post describes how HPE Aruba automated their Supply Chain management pipeline, and re-architected and deployed their data solution by adopting a modern data architecture on AWS. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

First, you must understand the existing challenges of the data team, including the data architecture and end-to-end toolchain. The data engineer then emails the BI Team, who refreshes a Tableau dashboard. Figure 1: Example data pipeline with manual processes. Adding Tests to Reduce Stress.

Testing

Testing Metadata Dashboards Statistics

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architecture is a complex and varied field and different organizations and industries have unique needs when it comes to their data architects. Solutions data architect: These individuals design and implement data solutions for specific business needs, including data warehouses, data marts, and data lakes.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Many of the tests to check performance and volumes of data scanned have used Athena because it provides a simple to use, fully serverless, cost effective, interface without the need to setup infrastructure. When evolving such a partition definition, the data in the table prior to the change is unaffected, as is its metadata.

Data Lake

Data Lake Metadata Snapshot Analytics

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

Over the past decade, the successful deployment of large scale data platforms at our customers has acted as a big data flywheel driving demand to bring in even more data, apply more sophisticated analytics, and on-board many new data practitioners from business analysts to data scientists.

Snapshot

Snapshot Metadata Cost-Benefit Data Architecture

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

Over the years, data lakes on Amazon Simple Storage Service (Amazon S3) have become the default repository for enterprise data and are a common choice for a large set of users who query data for a variety of analytics and machine leaning use cases. Analytics use cases on data lakes are always evolving.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

Cloudera has found that customers have spent many years investing in their big data assets and want to continue to build on that investment by moving towards a more modern architecture that helps leverage the multiple form factors. Customer Environment: The customer has three environments: development, test, and production.

Testing

Testing Metadata Risk Data Science

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The program must introduce and support standardization of enterprise data. Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

Amazon SageMaker Lakehouse provides an open data architecture that reduces data silos and unifies data across Amazon Simple Storage Service (Amazon S3) data lakes, Redshift data warehouses, and third-party and federated data sources. connection testing, metadata retrieval, and data preview.

Analytics

Analytics Data Lake Metadata Data Warehouse

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Data domain producers publish data assets using datasource run to Amazon DataZone in the Central Governance account. This populates the technical metadata in the business data catalog for each data asset. Data ownership remains with the producer. Download the Online Retail.csv file from Kaggle dataset.

Data Lake

Data Lake Publishing Metadata Data-driven

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Those decentralization efforts appeared under different monikers through time, e.g., data marts versus data warehousing implementations (a popular architectural debate in the era of structured data) then enterprise-wide data lakes versus smaller, typically BU-Specific, “data ponds”.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

AWS Glue Data Catalog stores information as metadata tables, where each table specifies a single data store. The AWS Glue crawler writes metadata to the Data Catalog by classifying the data to determine the format, schema, and associated properties of the data.

Metadata

Metadata Dashboards Metrics Visualization

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. Test access using Athena queries in the consumer account.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

With Cloudera’s vision of hybrid data , enterprises adopting an open data lakehouse can easily get application interoperability and portability to and from on premises environments and any public cloud without worrying about data scaling. Why integrate Apache Iceberg with Cloudera Data Platform?

Data Lake

Data Lake Data Warehouse Data Architecture Metadata

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

AWS Big Data

SEPTEMBER 6, 2023

When the Lambda function is triggered, the data sent to the function includes an array of records from the Kafka topic—no need for direct contact with Amazon MSK. For testing, this post includes a sample AWS Cloud Development Kit (AWS CDK) application. Prerequisites The example has the following prerequisites: An AWS account.

Testing

Testing Metadata Cost-Benefit Internet of Things

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. We begin with a Data lake reference architecture followed by an overview of operational data processing framework.

Data Lake

Data Lake Data Processing Metadata Snapshot

Processing large records with Amazon Kinesis Data Streams

AWS Big Data

OCTOBER 16, 2023

To meet this need, AWS offers Amazon Kinesis Data Streams , a powerful and scalable real-time data streaming service. With Kinesis Data Streams, you can effortlessly collect, process, and analyze streaming data in real time at any scale. Therefore, these functions need thorough testing to prevent any loss of data.

Cost-Benefit

Cost-Benefit Testing Optimization Strategy

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Performance was tested on a Redshift serverless data warehouse with 128 RPU. In our testing, the dataset was stored in Amazon S3 in Parquet format and AWS Glue Data Catalog was used to manage external databases and tables. This can have a significant impact on overall query performance.

Data Lake

Data Lake Statistics Broadcasting Optimization

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

It seamlessly consolidates data from various data sources within AWS, including AWS Cost Explorer (and forecasting with Cost Explorer ), AWS Trusted Advisor , and AWS Compute Optimizer. Data providers and consumers are the two fundamental users of a CDH dataset.

Dashboards

Dashboards Analytics Metadata Data Warehouse

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

AWS Big Data

JUNE 12, 2024

One Data Platform The ODP architecture is based on the AWS Well Architected Framework Analytics Lens and follows the pattern of having raw, standardized, conformed, and enriched layers as described in Modern data architecture.

Data Architecture

Data Architecture Cost-Benefit Data-driven Experimentation

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale. This post is not intended to provide detailed technical guidance (e.g.

Data Lake

Data Lake Metadata Statistics Optimization

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

The business end-users were given a tool to discover data assets produced within the mesh and seamlessly self-serve on their data sharing needs. This separation means changes can be tested thoroughly before being deployed to live operations. Amazon DataZone is the central piece in this architecture.

Data Governance

Data Governance Publishing Data-driven Metadata

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

Overview of solution As a data-driven company, smava relies on the AWS Cloud to power their analytics use cases. smava ingests data from various external and internal data sources into a landing stage on the data lake based on Amazon Simple Storage Service (Amazon S3).

Data Lake

Data Lake Data Warehouse Data-driven B2B

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane. By selecting the corresponding asset, you can understand its content through the readme, glossary terms , and technical and business metadata.

Data Quality

Data Quality Visualization Metadata Metrics

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Download the IAM Identity Center SAML metadata file to use in a later step. This is for testing only; do not use this for production environments. Choose Import from XML file and import the IAM Identity Center SAML metadata file that you downloaded in an earlier step. Take note of the group ID. Create a new custom SAML 2.0

Dashboards

Dashboards Data Processing Metadata Consulting

The Cloud Connection: How Governance Supports Security

Alation

APRIL 14, 2022

In today’s AI/ML-driven world of data analytics, explainability needs a repository just as much as those doing the explaining need access to metadata, EG, information about the data being used. The Cloud Data Migration Challenge. A useful feature for exposing patterns in the data. Visual Profiling. Scheduling.

Metadata

Metadata Data Governance Data-driven Modeling

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Building a starter version of anything can often be straightforward, but building something with enterprise-grade scale, security, resiliency, and performance typically requires knowledge of and adherence to battle-tested best practices, and using the right tools and features in the right scenario.

Enterprise

Enterprise Data Warehouse Data Lake Optimization

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

Amazon Redshift Serverless, generally available since 2021, allows you to run and scale analytics without having to provision and manage the data warehouse. Amazon Redshift ML large language model (LLM) integration Amazon Redshift ML enables customers to create, train, and deploy machine learning models using familiar SQL commands.

Data Warehouse

Data Warehouse Analytics Data Lake Machine Learning

Educating ChatGPT on Data Lakehouse

Cloudera

MARCH 17, 2023

As the use of ChatGPT becomes more prevalent, I frequently encounter customers and data users citing ChatGPT’s responses in their discussions. I love the enthusiasm surrounding ChatGPT and the eagerness to learn about modern data architectures such as data lakehouses, data meshes, and data fabrics.

Unstructured Data

Unstructured Data Data Lake Data Warehouse Machine Learning

What Is a Data Fabric and How Does a Data Catalog Support It?

Alation

JANUARY 25, 2022

As a reminder, here’s Gartner’s definition of data fabric: “A design concept that serves as an integrated layer (fabric) of data and connecting processes. In this blog, we will focus on the “integrated layer” part of this definition by examining each of the key layers of a comprehensive data fabric in more detail.

Metadata

Metadata IT Data-driven Metrics

How Zurich Insurance Group built a log management solution on AWS

AWS Big Data

JULY 16, 2024

Priority 2 logs, such as operating system security logs, firewall, identity provider (IdP), email metadata, and AWS CloudTrail , are ingested into Amazon OpenSearch Service to enable the following capabilities. Historic data analysis – Data stored in Amazon S3 can be queried to satisfy one-time audit or analysis tasks.

Insurance

Insurance Management Cost-Benefit Optimization

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

AWS Big Data

OCTOBER 9, 2024

For Data quality transform output , specify your data to output: Original data – This output includes all rows and columns in original data. In addition, you can select Add new columns to indicate data quality errors. Arunabha Datta is a Senior Data Architect at AWS Professional Services.

Data Quality

Data Quality Data Lake Data Warehouse Metrics

Data platform trinity: Competitive or complementary?

IBM Big Data Hub

JANUARY 18, 2023

Furthermore, data warehouse storage cannot support workloads like Artificial Intelligence (AI) or Machine Learning (ML), which require huge amounts of data for model training. For these workloads, data lake vendors usually recommend extracting data into flat files to be used solely for model training and testing purposes.

Data Lake

Data Lake Data Warehouse Data-driven Metadata

The Cycle of Change

TDAN

MAY 31, 2022

Interactions between hardware and software are cautiously investigated, operating systems and network connections are carefully tested, […]. It is common to take great care in the selection and implementation of new technology.

Testing

Testing Interactive Strategy Software

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

AWS Big Data

JULY 21, 2023

This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts. See Managing LF-Tags for metadata access control for more details. Many organizations have a distributed tools and infrastructure across various business units.

Data Lake

Data Lake Data Warehouse Marketing Management

GraphDB Empowers Scientific Projects to Fight COVID-19 and Publish Knowledge Graphs

Ontotext

APRIL 15, 2020

As all this progresses, the scientific community races against time to respond to the pandemic by developing diagnostic tests, therapies, pre-clinical and clinical research and vaccines. Ontotext’s knowledge graph technology is at the core of Cochrane’s data architecture developed by our partners from Data Language.

Publishing

Publishing Metadata Data mining Data Architecture

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

Data Environment First off, the solutions you consider should be compatible with your current data architecture. We have outlined the requirements that most providers ask for: Data Sources Strategic Objective Use native connectivity optimized for the data source. addresses). Build your first set of reports.

Analytics

Analytics Cost-Benefit Visualization Dashboards

Introducing the HubSpot connector for AWS Glue

AWS Big Data

DECEMBER 2, 2024

AWS Glue also supports the ability to apply complex data transformations, enabling efficient data integration and preparation to meet your needs. Schema and other metadata will be registered in the AWS Glue Data Catalog, a centralized metadata repository for all your data assets. Choose Next. Choose Next.

Data Lake

Data Lake Testing Data Integration Metadata

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

REST Catalog Value Proposition It provides open, metastore-agnostic APIs for Iceberg metadata operations, dramatically simplifying the Iceberg client and metastore/engine integration. It provides real time metadata access by directly integrating with the Iceberg-compatible metastore. spark.sql(SELECT * FROM airlines_data.carriers).show()

Metadata

Metadata Data Warehouse ROI Snapshot

Run Apache XTable in AWS Lambda for background conversion of open table formats

Data’s dark secret: Why poor quality cripples AI and growth

Webinars

Trending Sources

What is a Data Mesh?

Webinars

How to Manage Risk with Modern Data Architectures

5 Ways Data Modeling Is Critical to Data Governance

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

A Day in the Life of a DataOps Engineer

What is a data architect? Skills, salaries, and how to become a data framework master

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Introducing Apache Iceberg in Cloudera Data Platform

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Upgrade Journey: The Path from CDH to CDP Private Cloud

What is data governance? Best practices for managing data assets

Top analytics announcements of AWS re:Invent 2024

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Processing large records with Amazon Kinesis Data Streams

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

Choosing an open table format for your transactional data lake on AWS

HEMA accelerates their data governance journey with Amazon DataZone

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

The Cloud Connection: How Governance Supports Security

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Educating ChatGPT on Data Lakehouse

What Is a Data Fabric and How Does a Data Catalog Support It?

How Zurich Insurance Group built a log management solution on AWS

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

Data platform trinity: Competitive or complementary?

The Cycle of Change

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

GraphDB Empowers Scientific Projects to Fight COVID-19 and Publish Knowledge Graphs

What Is Embedded Analytics?

Introducing the HubSpot connector for AWS Glue

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Stay Connected