Metadata, Reference and Testing - Data Leaders Brief

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Central to a transactional data lake are open table formats (OTFs) such as Apache Hudi , Apache Iceberg , and Delta Lake , which act as a metadata layer over columnar formats. For more examples and references to other posts, refer to the following GitHub repository. This post is one of multiple posts about XTable on AWS.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

AWS Big Data

NOVEMBER 19, 2024

Solution overview By combining the powerful vector search capabilities of OpenSearch Service with the access control features provided by Amazon Cognito , this solution enables organizations to manage access controls based on custom user attributes and document metadata. Refer to Service Quotas for more details.

Management

Management Metadata Manufacturing Testing

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

These organizations often maintain multiple AWS accounts for development, testing, and production stages, leading to increased complexity and cost. This micro environment is particularly well-suited for development, testing, or small production workloads where resource optimization and cost-efficiency are primary concerns.

Metadata

Metadata Cost-Benefit Metrics Optimization

Apache Ozone Metadata Explained

Cloudera

JUNE 2, 2021

As an important part of achieving better scalability, Ozone separates the metadata management among different services: . Ozone Manager (OM) service manages the metadata of the namespace such as volume, bucket and keys. Datanode service manages the metadata of blocks, containers and pipelines running on the datanode. .

Metadata

Metadata Snapshot Testing Management

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

O'Reilly on Data

JUNE 14, 2024

At the same time, Miso went about an in-depth chunking and metadata-mapping of every book in the O’Reilly catalog to generate enriched vector snippet embeddings of each work. If the original Answers release was a LLM-driven retrieval engine, today’s new version of Answers is an LLM-driven research engine (in the truest sense).

Metadata

Metadata Publishing Data-driven Modeling

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

AWS Big Data

NOVEMBER 29, 2023

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift. This post discusses restricting listing of data catalog metadata as per the granted permissions.

Metadata

Metadata Data Warehouse Analytics Data Analytics

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. In internal tests, AI-driven scaling and optimizations showcased up to 10 times price-performance improvements for variable workloads.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Cloudera

JULY 15, 2021

For customers to gain the maximum benefits from these features, Cloudera best practice reflects the success of thousands of -customer deployments, combined with release testing to ensure customers can successfully deploy their environments and minimize risk. Traditional data clusters for workloads not ready for cloud. Networking .

Data Processing

Data Processing Metadata Testing Management

Four Use Cases Proving the Benefits of Metadata-Driven Automation

erwin

FEBRUARY 7, 2019

Organization’s cannot hope to make the most out of a data-driven strategy, without at least some degree of metadata-driven automation. Metadata-Driven Automation in the BFSI Industry. Metadata-Driven Automation in the Pharmaceutical Industry. Metadata-Driven Automation in the Insurance Industry.

Metadata

Metadata Insurance Data-driven Cost-Benefit

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

You can now test the newly created application by running the following command: npm run dev By default, the application is available on port 5173 on your local machine. Unfiltered Table Metadata This tab displays the response of the AWS Glue API GetUnfilteredTableMetadata policies for the selected table.

Data Processing

Data Processing Metadata Publishing Testing

Bringing an AI Product to Market

O'Reilly on Data

JULY 28, 2020

Product Managers are responsible for the successful development, testing, release, and adoption of a product, and for leading the team that implements those milestones. Some of the best lessons are captured in Ron Kohavi, Diane Tang, and Ya Xu’s book: Trustworthy Online Controlled Experiments : A Practical Guide to A/B Testing.

Marketing

Marketing Experimentation Metrics Testing

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

To learn more about this process, refer to Enabling SAML 2.0 Save the federation metadata XML file You use the federation metadata file to configure the IAM IdP in a later step. In the Single sign-on section , under SAML Certificates , choose Download for Federation Metadata XML. Choose Test this application.

Sales

Sales Metadata Enterprise Testing

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

To learn more about working with events using EventBridge, refer to Events via Amazon EventBridge default bus. After you create the asset, you can add glossaries or metadata forms, but its not necessary for this post. We refer to this role as the instance-role throughout the post. Enter a name for the asset.

Publishing

Publishing Unstructured Data Metadata Data-driven

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

There are no automated tests , so errors frequently pass through the pipeline. There is no process to spin up an isolated dev environment to quickly add a feature, test it with actual data and deploy it to production. The pipeline has automated tests at each step, making sure that each step completes successfully.

Testing

Testing Metadata Dashboards Statistics

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Wayne Yaddow

MARCH 5, 2025

In this post, well see the fundamental procedures, tools, and techniques that data engineers, data scientists, and QA/testing teams use to ensure high-quality data as soon as its deployed. First, we look at how unit and integration tests uncover transformation errors at an early stage. Key Tools & Processes Testing frameworks (e.g.,

Testing

Testing Data Transformation Statistics Metadata

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

AWS Big Data

NOVEMBER 17, 2023

These include internet-scale web and mobile applications, low-latency metadata stores, high-traffic retail websites, Internet of Things (IoT) and time series data, online gaming, and more. Table metadata, such as column names and data types, is stored using the AWS Glue Data Catalog. To create an S3 bucket, refer to Creating a bucket.

Visualization

Visualization Metadata Testing Internet of Things

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations. with Spark 3.3.2,

Optimization

Optimization Snapshot Data Lake Metadata

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Data quality refers to the assessment of the information you have, relative to its purpose and its ability to serve that purpose. While the digital age has been successful in prompting innovation far and wide, it has also facilitated what is referred to as the “data crisis” – low-quality data. 2 – Data profiling.

Data Quality

Data Quality Metrics Data-driven Management

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections. To learn more, refer to Amazon SageMaker Unified Studio.

Visualization

Visualization Data Processing Testing Publishing

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. For the template and setup information, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator. We use two datasets in this post.

Management

Management Metadata Analytics Dashboards

Introducing Amazon MWAA larger environment sizes

AWS Big Data

APRIL 16, 2024

Running Apache Airflow at scale puts proportionally greater load on the Airflow metadata database, sometimes leading to CPU and memory issues on the underlying Amazon Relational Database Service (Amazon RDS) cluster. A resource-starved metadata database may lead to dropped connections from your workers, failing tasks prematurely.

Metadata

Metadata Metrics Testing Management

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake

Data Lake Data Processing Metadata Snapshot

What Are ChatGPT and Its Friends?

O'Reilly on Data

MARCH 23, 2023

There’s a very important difference between these two almost identical sentences: in the first, “it” refers to the cup. In the second, “it” refers to the pitcher. It’s by far the most convincing example of a conversation with a machine; it has certainly passed the Turing test. Ethan Mollick says that it is “only OK at search.

IT

IT Modeling Testing Risk

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

DataKitchen

SEPTEMBER 21, 2023

Data in Place refers to the organized structuring and storage of data within a specific storage medium, be it a database, bucket store, files, or other storage platforms. In the context of Data in Place, validating data quality automatically with Business Domain Tests is imperative for ensuring the trustworthiness of your data assets.

Testing

Testing Data Quality Predictive Modeling Metrics

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

They have dev, test, and production clusters running critical workloads and want to upgrade their clusters to CDP Private Cloud Base. Customer Environment: The customer has three environments: development, test, and production. Test and QA. Test and QA. Let’s take a look at one customer’s upgrade journey. Background: .

Testing

Testing Metadata Risk Data Science

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

AWS Big Data

JUNE 2, 2023

We use AWS Glue , a fully managed, serverless, ETL (extract, transform, and load) service, and the Google BigQuery Connector for AWS Glue (for more information, refer to Migrating data from Google BigQuery to Amazon S3 using AWS Glue custom connectors ). If you don’t have one, refer to Amazon Redshift Serverless. An S3 bucket.

Metadata

Metadata Data Warehouse Big Data Analytics

The Need For Personalized Data Journeys for Your Data Consumers

DataKitchen

OCTOBER 20, 2023

’ It assigns unique identifiers to each data item—referred to as ‘payloads’—related to each event. Payload DJs facilitate capturing metadata, lineage, and test results at each phase, enhancing tracking efficiency and reducing the risk of data loss.

Insurance

Insurance Metadata Data-driven Data Quality

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows. For detailed steps to create an Amazon MWAA environment using the Amazon MWAA console, refer to Introducing Amazon Managed Workflows for Apache Airflow (MWAA). Add the constraints-3.11-updated.txt

Metadata

Metadata Data Processing Management Testing

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

Benchmark setup In our testing, we used the 3 TB dataset stored in Amazon S3 in compressed Parquet format and metadata for databases and tables is stored in the AWS Glue Data Catalog. When statistics aren’t available, Amazon EMR and Athena use S3 file metadata to optimize query plans. With Amazon EMR 6.10.0

Metadata

Metadata Statistics Broadcasting Optimization

Disaster recovery strategies for Amazon MWAA – Part 2

AWS Big Data

JUNE 17, 2024

Backup and restore architecture The backup and restore strategy involves periodically backing up Amazon MWAA metadata to Amazon Simple Storage Service (Amazon S3) buckets in the primary Region. Refer to the detailed deployment steps in the README file to deploy it in your own accounts. The steps are as follows: [1.a]

Strategy

Strategy Metadata Recreation/Entertainment Metrics

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

According to Bob Lambert , analytics delivery lead at Anthem and former director of CapTech Consulting, important data architect skills include: A foundation in systems development: Data architects must understand the system development life cycle, project management approaches, and requirements, design, and test techniques.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Processing large records with Amazon Kinesis Data Streams

AWS Big Data

OCTOBER 16, 2023

The individual pieces of data within these streams are often referred to as records. client('kinesis', region_name='ap-southeast-2') def lambda_handler(event, context): try: response = client.put_record( StreamName='test', Data=b'Sample 1 MB.', To help you understand better, we experimented by trying to send a record of 1.5

Cost-Benefit

Cost-Benefit Testing Optimization Strategy

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

While data management has become a common term for the discipline, it is sometimes referred to as data resource management or enterprise information management (EIM). Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

Role-based access control in Amazon OpenSearch Service via SAML integration with AWS IAM Identity Center

AWS Big Data

MARCH 14, 2023

If you have integrated IAM Identity Center with your Identity Provider (IdP), you can use existing users and groups mapped to your IdP for this test. Test your users in IAM Identity Center (to create users, refer to Add users ). For more information, refer to SAML authentication for OpenSearch Dashboards.

Metadata

Metadata Dashboards Testing Management

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

Apache Iceberg manages these schema changes in a backward-compatible way through its innovative metadata table evolution architecture. With Lake Formation, you can manage fine-grained access control for your data lake data on Amazon S3 and its metadata in the Data Catalog. Iceberg maintains the table state in metadata files.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Data Lake Testing Consulting

Implement Apache Flink near-online data enrichment patterns

AWS Big Data

NOVEMBER 15, 2023

Pre-loading of reference data provides low latency and high throughput. For a general overview of data enrichment patterns, refer to Common streaming data enrichment patterns in Amazon Managed Service for Apache Flink. To compare the performance of the enrichment patterns, we ran performance testing based on synthetic data.

Testing

Testing Optimization Management Metadata

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

Figure 1: Flow of actions for self-service analytics around data assets stored in relational databases First, the data producer needs to capture and catalog the technical metadata of the data asset. Second, the data producer needs to consolidate the data asset’s metadata in the business catalog and enrich it with business metadata.

Metadata

Metadata Data Lake Data Processing Data-driven

Improve Apache Kafka scalability and resiliency using Amazon MSK tiered storage

AWS Big Data

AUGUST 2, 2024

Zookeeper or KRaft for cluster coordination – Kafka relies on Apache ZooKeeper or KRaft for cluster coordination and metadata management. To learn more about the core components of Amazon MSK tiered storage, refer to Deep dive on Amazon MSK tiered storage. To test it, we created a three-node cluster with the new m7g instance type.

Metrics

Metrics Testing Cost-Benefit Management

PyCaret 2.2: Efficient Pipelines for Model Development

Domino Data Lab

JANUARY 11, 2021

You can list all the datasets available in the repository, and see associated metadata: all_datasets = pycaret.datasets.get_data('index'). handling missing values with various imputation methods available), splitting into train and test sets, as well as some aspects of feature engineering and training. Domino Reference Project.

Modeling

Modeling Metrics Data Science Testing

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Trending Sources

Build a high-performance quant research platform with Apache Iceberg

Webinars

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

Introducing Amazon MWAA micro environments for Apache Airflow

Apache Ozone Metadata Explained

The New O’Reilly Answers: The R in “RAG” Stands for “Royalties”

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Recap of Amazon Redshift key product announcements in 2024

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Four Use Cases Proving the Benefits of Metadata-Driven Automation

Enhance data governance with enforced metadata rules in Amazon DataZone

Integrate custom applications with AWS Lake Formation – Part 2

Bringing an AI Product to Market

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

A Day in the Life of a DataOps Engineer

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Introducing Amazon MWAA larger environment sizes

Use Apache Iceberg in a data lake to support incremental data processing

What Are ChatGPT and Its Friends?

Bridging the Gap: How ‘Data in Place’ and ‘Data in Use’ Define Complete Data Observability

Upgrade Journey: The Path from CDH to CDP Private Cloud

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

The Need For Personalized Data Journeys for Your Data Consumers

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Disaster recovery strategies for Amazon MWAA – Part 2

What is a data architect? Skills, salaries, and how to become a data framework master

Processing large records with Amazon Kinesis Data Streams

What is data governance? Best practices for managing data assets

Role-based access control in Amazon OpenSearch Service via SAML integration with AWS IAM Identity Center

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Implement Apache Flink near-online data enrichment patterns

Governing data in relational databases using Amazon DataZone

Improve Apache Kafka scalability and resiliency using Amazon MSK tiered storage

PyCaret 2.2: Efficient Pipelines for Model Development

Stay Connected