Events, Metadata and Testing - Data Leaders Brief

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

AWS Big Data

OCTOBER 23, 2024

The Airflow REST API facilitates a wide range of use cases, from centralizing and automating administrative tasks to building event-driven, data-aware data pipelines. Event-driven architectures – The enhanced API facilitates seamless integration with external events, enabling the triggering of Airflow DAGs based on these events.

Interactive

Interactive Testing Data-driven Data Lake

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Central to a transactional data lake are open table formats (OTFs) such as Apache Hudi , Apache Iceberg , and Delta Lake , which act as a metadata layer over columnar formats. XTable isn’t a new table format but provides abstractions and tools to translate the metadata associated with existing formats.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Rocket-Powered Data Science

FEBRUARY 15, 2023

Know thy data: understand what it is (formats, types, sampling, who, what, when, where, why), encourage the use of data across the enterprise, and enrich your datasets with searchable (semantic and content-based) metadata (labels, annotations, tags). Do not covet thy data’s correlations: a random six-sigma event is one-in-a-million.

Strategy

Strategy Experimentation Uncertainty Machine Learning

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

have a large body of tools to choose from: IDEs, CI/CD tools, automated testing tools, and so on. We have great tools for working with code: creating it, managing it, testing it, and deploying it. Metadata analysis makes it possible to build data catalogs, which in turn allow humans to discover data that’s relevant to their projects.

Machine Learning

Machine Learning Software Metadata Testing

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

The proposed solution involves creating a custom subscription workflow that uses the event-driven architecture of Amazon DataZone. Amazon DataZone keeps you informed of key activities (events) within your data portal, such as subscription requests, updates, comments, and system events. Enter a name for the asset.

Publishing

Publishing Unstructured Data Metadata Data-driven

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

O'Reilly on Data

JUNE 14, 2024

It offers a wealth of books, on-demand courses, live events, short-form posts, interactive labs, expert playlists, and more—formed from the proprietary content of thousands of independent authors, industry experts, and several of the largest education publishers in the world.

Metadata

Metadata Publishing Data-driven Modeling

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

Upon successful authentication, the custom claims provider triggers the custom authentication extensions token issuance start event listener. The custom authentication extension calls an Azure function (your REST API endpoint) with information about the event, user profile, session data, and other context. Choose Test this application.

Sales

Sales Metadata Enterprise Testing

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant. Data fabric Metadata-rich integration layer across distributed systems. Implementation complexity, relies on robust metadata management.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

AWS Big Data

MARCH 22, 2023

If your organization uses Microsoft Azure Active Directory (Azure AD) for centralized authentication and utilizes its user attributes to organize the users, you can enable federation across all QuickSight accounts as well as manage users and their group membership in QuickSight using events generated in the AWS platform.

Management

Management Metadata Enterprise Testing

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

There are no automated tests , so errors frequently pass through the pipeline. There is no process to spin up an isolated dev environment to quickly add a feature, test it with actual data and deploy it to production. The pipeline has automated tests at each step, making sure that each step completes successfully.

Testing

Testing Metadata Dashboards Statistics

How REA Group approaches Amazon MSK cluster capacity planning

AWS Big Data

DECEMBER 5, 2024

Hydro is powered by Amazon MSK and other tools with which teams can move, transform, and publish data at low latency using event-driven architectures. To address this, we used the AWS performance testing framework for Apache Kafka to evaluate the theoretical performance limits.

Metrics

Metrics Dashboards Testing Optimization

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations. with Spark 3.3.2,

Optimization

Optimization Snapshot Data Lake Metadata

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

You can now test the newly created application by running the following command: npm run dev By default, the application is available on port 5173 on your local machine. Unfiltered Table Metadata This tab displays the response of the AWS Glue API GetUnfilteredTableMetadata policies for the selected table.

Data Processing

Data Processing Metadata Publishing Testing

The Need For Personalized Data Journeys for Your Data Consumers

DataKitchen

OCTOBER 20, 2023

’ It assigns unique identifiers to each data item—referred to as ‘payloads’—related to each event. Payload DJs facilitate capturing metadata, lineage, and test results at each phase, enhancing tracking efficiency and reducing the risk of data loss.

Insurance

Insurance Metadata Data-driven Data Quality

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Data-driven decisions lead to more effective responses to unexpected events, increase innovation and allow organizations to create better experiences for their customers. Short overview of Cloudinary’s infrastructure Cloudinary infrastructure handles over 20 billion requests daily with every request generating event logs.

Data Lake

Data Lake Metadata Snapshot Analytics

How AppsFlyer modernized their interactive workload by moving to Amazon Athena and saved 80% of costs

AWS Big Data

AUGUST 8, 2024

AppsFlyer develops a leading measurement solution focused on privacy, which enables marketers to gauge the effectiveness of their marketing activities and integrates them with the broader marketing world, managing a vast volume of 100 billion events every day. This led the team to examine partition indexing.

Interactive

Interactive Metadata Optimization Testing

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

AWS Big Data

JUNE 19, 2023

When it comes to near-real-time analysis of data as it arrives in Security Lake and responding to security events your company cares about, Amazon OpenSearch Service provides the necessary tooling to help you make sense of the data found in Security Lake. Under Log and event sources , specify what the subscriber is authorized to ingest.

Publishing

Publishing Dashboards Visualization Management

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

DataKitchen

SEPTEMBER 21, 2023

In the context of Data in Place, validating data quality automatically with Business Domain Tests is imperative for ensuring the trustworthiness of your data assets. Running these automated tests as part of your DataOps and Data Observability strategy allows for early detection of discrepancies or errors.

Testing

Testing Data Quality Predictive Modeling Metrics

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

This may require frequent truncation in certain tables to retain only the latest stream of events. The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. Agent states are reported in agent-state events.

Management

Management Metadata Analytics Dashboards

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

This premier event showcased groundbreaking advancements, keynotes from AWS leadership, hands-on technical sessions, and exciting product launches. S3 Metadata is designed to automatically capture metadata from objects as they are uploaded into a bucket, and to make that metadata queryable in a read-only table.

Analytics

Analytics Data Lake Metadata Data Warehouse

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

This populates the technical metadata in the business data catalog for each data asset. The business metadata, can be added by business users to provide business context, tags, and data classification for the datasets. The successful completion of the AWS Glue crawler generates an event in the default event bus of Amazon EventBridge.

Data Lake

Data Lake Publishing Metadata Data-driven

Why data observability is essential to AI governance

erwin

DECEMBER 9, 2024

Maybe your AI model monitors sales data, and the data is spiking for one region of the country due to a world event. Metadata is the basis of trust for data forensics as we answer the questions of fact or fiction when it comes to the data we see. Lets give a for instance.

Metadata

Metadata Data Quality Sales Modeling

What Are ChatGPT and Its Friends?

O'Reilly on Data

MARCH 23, 2023

But Transformers have some other important advantages: Transformers don’t require training data to be labeled; that is, you don’t need metadata that specifies what each sentence in the training data means. It’s by far the most convincing example of a conversation with a machine; it has certainly passed the Turing test.

IT

IT Modeling Testing Risk

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview

AWS Big Data

JULY 8, 2024

With the Amazon DataZone OpenLineage-compatible API, domain administrators and data producers can capture and store lineage events beyond what is available in Amazon DataZone, including transformations in Amazon Simple Storage Service (Amazon S3), AWS Glue , and other AWS services.

Visualization

Visualization Metadata Publishing Sales

Disaster recovery strategies for Amazon MWAA – Part 1

AWS Big Data

JANUARY 16, 2024

Within Airflow, the metadata database is a core component storing configuration variables, roles, permissions, and DAG run histories. A healthy metadata database is therefore critical for your Airflow environment. The AWS Health Dashboard provides information about AWS Health events that can affect your account.

Strategy

Strategy Metadata Metrics Dashboards

Implement Apache Flink near-online data enrichment patterns

AWS Big Data

NOVEMBER 15, 2023

This post covers how you can implement data enrichment for near-online streaming events with Apache Flink and how you can optimize performance. To compare the performance of the enrichment patterns, we ran performance testing based on synthetic data. The result of this test is useful as a general reference.

Testing

Testing Optimization Management Metadata

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Data and Metadata: Data inputs and data outputs produced based on the application logic. Also included, business and technical metadata, related to both data inputs / data outputs, that enable data discovery and achieving cross-organizational consensus on the definitions of data assets.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

AWS Big Data

SEPTEMBER 6, 2023

The rising trend in today’s tech landscape is the use of streaming data and event-oriented structures. You can do this by setting Amazon MSK as an event source for a Lambda function. For testing, this post includes a sample AWS Cloud Development Kit (AWS CDK) application. API Gateway forwards the request to a Lambda function.

Testing

Testing Metadata Cost-Benefit Internet of Things

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

These logs can track activity, such as data access patterns, lifecycle and management activity, and security events. AWS Glue Data Catalog stores information as metadata tables, where each table specifies a single data store. Running the crawler on a schedule updates AWS Glue Data Catalog with new partitions and metadata.

Metadata

Metadata Dashboards Metrics Visualization

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports. Also known as data validation, integrity refers to the structural testing of data to ensure that the data complies with procedures. Your Chance: Want to test a professional analytics software?

Data Quality

Data Quality Metrics Data-driven Management

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Cloudera

DECEMBER 9, 2022

After all, it’s very likely that you are developing your flow against test systems but in production it needs to run against production systems, meaning that your source and destination connection configuration has to be adjusted. To meet this need we’ve introduced a new concept called test sessions with the DataFlow Designer. .

Testing

Testing Cost-Benefit Interactive Visualization

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

Figure 1: Flow of actions for self-service analytics around data assets stored in relational databases First, the data producer needs to capture and catalog the technical metadata of the data asset. Second, the data producer needs to consolidate the data asset’s metadata in the business catalog and enrich it with business metadata.

Metadata

Metadata Data Lake Data Processing Data-driven

Processing large records with Amazon Kinesis Data Streams

AWS Big Data

OCTOBER 16, 2023

client('kinesis', region_name='ap-southeast-2') def lambda_handler(event, context): try: response = client.put_record( StreamName='test', Data=b'Sample 1 MB.', The following is sample code for reference: import json import base64 import json def lambda_handler(event, context): item = None decoded_record_data = [base64.b64decode(record['kinesis']['data']).decode().replace('n','')

Cost-Benefit

Cost-Benefit Testing Optimization Strategy

6 Case Studies on The Benefits of Business Intelligence And Analytics

datapine

JANUARY 31, 2022

Everything is being tested, and then the campaigns that succeed get more money put into them, while the others aren’t repeated. This methodology of “test, look at the data, adjust” is at the heart and soul of business intelligence. Your Chance: Want to try a professional BI analytics software?

Business Intelligence

Business Intelligence Analytics Cost-Benefit ROI

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata. The program must introduce and support standardization of enterprise data. IBM Data Governance IBM Data Governance leverages machine learning to collect and curate data assets.

Data Governance

Data Governance Management Metadata Data Quality

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

Metadata Caching. This is used to provide very low latency access to table metadata and file locations in order to avoid making expensive remote RPCs to services like the Hive Metastore (HMS) or the HDFS Name Node, which can be busy with JVM garbage collection or handling requests for other high latency batch workloads.

Optimization

Optimization Metadata Statistics Cost-Benefit

Integrate custom applications with AWS Lake Formation – Part 1

AWS Big Data

NOVEMBER 19, 2024

With Lake Formation, you can centralize data security and governance using the AWS Glue Data Catalog , letting you manage metadata and data permissions in one place with familiar database-style features. glue:GetUnfilteredTableMetadata – Allows a third-party analytical engine to retrieve unfiltered table metadata from the Data Catalog.

Data Lake

Data Lake Metadata Testing Data Processing

Amazon OpenSearch Service Under the Hood: Multi-AZ with Standby

AWS Big Data

MAY 10, 2023

Additionally, shard redistribution during failure events causes increased resource utilization, leading to increased latencies and overloaded nodes, further impacting availability and effectively defeating the purpose of fault-tolerant, multi-AZ clusters. This event is referred to as a zonal failover.

Snapshot

Snapshot Testing Metadata Management

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure. Finally, by testing the framework, we summarize how it meets the aforementioned requirements. The event rule forwards the object event notifications to the SQS queue as messages.

Data Lake

Data Lake Data Processing Metadata Snapshot

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

Allows them to iteratively develop processing logic and test with as little overhead as possible. With the general availability of DataFlow Designer, developers can now implement their data pipelines by building, testing, deploying, and monitoring data flows in one unified user interface that meets all their requirements.

Testing

Testing Publishing Metadata Interactive

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

AWS Big Data

JUNE 2, 2023

This JSON file contains the migration metadata, namely the following: A list of Google BigQuery projects and datasets. The state machine iterates on the metadata from this DynamoDB table to run the table migration in parallel, based on the maximum number of migration jobs without incurring limits or quotas on Google BigQuery.

Metadata

Metadata Data Warehouse Big Data Analytics

Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams

AWS Big Data

MAY 24, 2024

If it’s a restart of an existing job, it’s read from last record metadata checkpoint from storage (for this post, DynamoDB) and ignores kinesis.startingPosition. At the end of each task, the corresponding executor process saves the metadata (checkpoint) about the last record read for each shard in the offset storage (for this post, DynamoDB).

Metadata

Metadata Interactive Business Objectives Management

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Build a high-performance quant research platform with Apache Iceberg

Webinars

Trending Sources

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Business Strategies for Deploying Disruptive Tech: Generative AI and ChatGPT

Deep automation in machine learning

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Data’s dark secret: Why poor quality cripples AI and growth

Enhance data governance with enforced metadata rules in Amazon DataZone

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

A Day in the Life of a DataOps Engineer

How REA Group approaches Amazon MSK cluster capacity planning

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Integrate custom applications with AWS Lake Formation – Part 2

The Need For Personalized Data Journeys for Your Data Consumers

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

How AppsFlyer modernized their interactive workload by moving to Amazon Athena and saved 80% of costs

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Top analytics announcements of AWS re:Invent 2024

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Why data observability is essential to AI governance

What Are ChatGPT and Its Friends?

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview

Disaster recovery strategies for Amazon MWAA – Part 1

Implement Apache Flink near-online data enrichment patterns

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design

Governing data in relational databases using Amazon DataZone

Processing large records with Amazon Kinesis Data Streams

6 Case Studies on The Benefits of Business Intelligence And Analytics

What is data governance? Best practices for managing data assets

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Integrate custom applications with AWS Lake Formation – Part 1

Amazon OpenSearch Service Under the Hood: Multi-AZ with Standby

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams

Stay Connected