Machine Learning, Metadata and Reference

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

In a previous post , we talked about applications of machine learning (ML) to software development, which included a tour through sample tools in data science and for managing data infrastructure. However, machine learning isn’t possible without data, and our tools for working with data aren’t adequate.

Machine Learning

Machine Learning Software Metadata Testing

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Amazon EMR provides a big data environment for data processing, interactive analysis, and machine learning using open source frameworks such as Apache Spark, Apache Hive, and Presto. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Central to a transactional data lake are open table formats (OTFs) such as Apache Hudi , Apache Iceberg , and Delta Lake , which act as a metadata layer over columnar formats. In practice, OTFs are used in a broad range of analytical workloads, from business intelligence to machine learning.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Icebergs table format separates data files from metadata files, enabling efficient data modifications without full dataset rewrites.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

If you’re already a software product manager (PM), you have a head start on becoming a PM for artificial intelligence (AI) or machine learning (ML). AI products are automated systems that collect and learn from data to make user-facing decisions. We won’t go into the mathematics or engineering of modern machine learning here.

Management

Management Machine Learning Experimentation Metrics

Best Practices for Metadata Management

Alation

JULY 19, 2021

What Is Metadata? Metadata is information about data. A clothing catalog or dictionary are both examples of metadata repositories. Indeed, a popular online catalog, like Amazon, offers rich metadata around products to guide shoppers: ratings, reviews, and product details are all examples of metadata.

Metadata

Metadata Management Data Governance Machine Learning

Proposals for model vulnerability and security

O'Reilly on Data

MARCH 20, 2019

Apply fair and private models, white-hat and forensic model debugging, and common sense to protect machine learning models from malicious actors. Like many others, I’ve known for some time that machine learning models themselves could pose security risks. Data poisoning attacks. Watermark attacks.

Modeling

Modeling Machine Learning Predictive Modeling Consulting

The Power of Graph Databases, Linked Data, and Graph Algorithms

Rocket-Powered Data Science

MARCH 10, 2020

The book is awesome, an absolute must-have reference volume, and it is free (for now, downloadable from Neo4j ). Finally, in Chapter 8, the connection between graph algorithms and machine learning that was implicit throughout the book now becomes explicit. Graph Algorithms book.

Metadata

Metadata Machine Learning Prescriptive Analytics ROI

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

AWS Big Data

NOVEMBER 19, 2024

Solution overview By combining the powerful vector search capabilities of OpenSearch Service with the access control features provided by Amazon Cognito , this solution enables organizations to manage access controls based on custom user attributes and document metadata. Refer to Service Quotas for more details.

Management

Management Metadata Manufacturing Testing

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization. Industry-leading price-performance: Amazon Redshift launches RA3.large

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets. Key benefits The feature benefits multiple stakeholders.

Metadata

Metadata Data Governance Metrics Marketing

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

AWS Big Data

MAY 2, 2025

The ingest flow includes a ML Inference Ingest Processor , which generates machine learning (ML) model outputs such as embeddings (vectors) as your data is ingested into OpenSearch. We will use generative multimodal AI to modernize image search, eliminating the need for labor to maintain image tags and other metadata.

Machine Learning

Machine Learning Visualization Dashboards Metadata

Amazon SageMaker Lakehouse now supports attribute-based access control

AWS Big Data

APRIL 24, 2025

You can secure and centrally manage your data in the lakehouse by defining fine-grained permissions with Lake Formation that are consistently applied across all analytics and machine learning(ML) tools and engines. For more details, refer to Tags for AWS Identity and Access Management resources and Pass session tags in AWS STS.

Sales

Sales Data Lake Management Data-driven

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

They’re taking data they’ve historically used for analytics or business reporting and putting it to work in machine learning (ML) models and AI-powered applications. They aren’t using analytics and AI tools in isolation. Having confidence in your data is key.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Building Custom Runtimes with Editors in Cloudera Machine Learning

Cloudera

AUGUST 24, 2022

Cloudera Machine Learning (CML) is a cloud-native and hybrid-friendly machine learning platform. CML empowers organizations to build and deploy machine learning and AI capabilities for business at scale, efficiently and securely, anywhere they want. References. Cloudera Machine Learning.

Machine Learning

Machine Learning Metadata Testing Data Science

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

Extract, transform, and load (ETL) is the process of combining, cleaning, and normalizing data from different sources to prepare it for analytics, artificial intelligence (AI), and machine learning (ML) workloads. The data is also registered in the Glue Data Catalog , a metadata repository. Kamen Sharlandjiev is a Sr.

Data Integration

Data Integration Data Lake Statistics Data-driven

Alation and Salesforce partner on data governance for Data Cloud

CIO Business Intelligence

SEPTEMBER 19, 2024

This enables companies to directly access key metadata (tags, governance policies, and data quality indicators) from over 100 data sources in Data Cloud, it said. Additional to that, we are also allowing the metadata inside of Alation to be read into these agents.” That work takes a lot of machine learning and AI to accomplish.

Data Governance

Data Governance Metadata Unstructured Data Structured Data

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

AWS Big Data

FEBRUARY 6, 2023

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. For instructions, refer to How to Set Up a MongoDB Cluster. Choose the table to view the schema and other metadata.

Metadata

Metadata Data Lake Machine Learning Big Data

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Three Emerging Analytics Products Derived from Value-driven Data Innovation and Insights Discovery in the Enterprise

Rocket-Powered Data Science

JULY 19, 2023

If my explanation above is the correct interpretation of the high percentage, and if the statement refers to successfully deployed applications (i.e., One could say that sentinel analytics is more like unsupervised machine learning, while precursor analytics is more like supervised machine learning.

Data-driven

Data-driven Enterprise Analytics Machine Learning

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

However, with the help of AI and machine learning (ML), new software tools are now available to unearth the value of unstructured data. But in the case of unstructured data, metadata discovery is challenging because the raw data isn’t easily readable. You can integrate different technologies or tools to build a solution.

Unstructured Data

Unstructured Data Metadata Management Analytics

Maximize your data dividends with active metadata

IBM Big Data Hub

NOVEMBER 28, 2022

Metadata management performs a critical role within the modern data management stack. However, as data volumes continue to grow, manual approaches to metadata management are sub-optimal and can result in missed opportunities. This puts into perspective the role of active metadata management. What is Active Metadata management?

Metadata

Metadata Data Quality Data-driven Data Governance

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

AWS Big Data

NOVEMBER 17, 2023

These include internet-scale web and mobile applications, low-latency metadata stores, high-traffic retail websites, Internet of Things (IoT) and time series data, online gaming, and more. Table metadata, such as column names and data types, is stored using the AWS Glue Data Catalog. To create an S3 bucket, refer to Creating a bucket.

Visualization

Visualization Metadata Testing Internet of Things

Top 10 Data Governance Trends for 2020: Data’s Real Value Comes Into Focus

erwin

JANUARY 3, 2020

Gartner even refers to them as “the new black in data management and analytics.”. In addition, ethical artificial intelligence (AI) and machine learning (ML) applications will be used by organizations to ensure their training data sets are well-defined, consistent and of high quality.

Data Governance

Data Governance Digital Transformation IoT Metadata

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Iceberg tables maintain metadata to abstract large collections of files, providing data management features including time travel, rollback, data compaction, and full schema evolution, reducing management overhead. Snowflake writes Iceberg tables to Amazon S3 and updates metadata automatically with every transaction.

Data Lake

Data Lake Snapshot Metadata Data Architecture

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

While data management has become a common term for the discipline, it is sometimes referred to as data resource management or enterprise information management (EIM). Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

This fragmented, repetitive, and error-prone experience for data connectivity is a significant obstacle to data integration, analysis, and machine learning (ML) initiatives. To learn more, refer to Amazon SageMaker Unified Studio. This approach simplifies your data journey and helps you meet your security requirements.

Visualization

Visualization Data Processing Testing Publishing

Building Your Human Benchmark with Ontotext Metadata Studio

Ontotext

FEBRUARY 16, 2023

This data can then be easily analyzed to provide insights or used to train machine learning models. To be able to annotate the specified content consistently and unambiguously, these experts usually follow a set of specific conventions, which are referred to as “annotation guidelines”. What Is A Human Benchmark?

Metadata

Metadata Measurement Metrics Modeling

The Benefits of a Knowledge Graph-based Metadata Hub

Ontotext

DECEMBER 15, 2022

Enter metadata. Metadata describes data and includes information such as how old data is, where it was created, who owns it, and what concepts (or other data) it relates to. As a result, leveraging metadata has become a core capability for businesses trying to extract value from their data. Knowledge (metadata) layer.

Metadata

Metadata Unstructured Data Structured Data Enterprise

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

Themes and Conferences per Pacoid, Episode 11

Domino Data Lab

JULY 2, 2019

In other words, using metadata about data science work to generate code. One of the longer-term trends that we’re seeing with Airflow , and so on, is to externalize graph-based metadata and leverage it beyond the lifecycle of a single SQL query, making our workflows smarter and more robust. BTW, videos for Rev2 are up: [link].

Metadata

Metadata Data Science Machine Learning Data-driven

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Analytics/data science architect: These data architects design and implement data architecture supporting advanced analytics and data science applications, including machine learning and artificial intelligence. Information/data governance architect: These individuals establish and enforce data governance policies and procedures.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

The need for an end-to-end strategy for data management and data governance at every step of the journey—from ingesting, storing, and querying data to analyzing, visualizing, and running artificial intelligence (AI) and machine learning (ML) models—continues to be of paramount importance for enterprises.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

Apache Iceberg manages these schema changes in a backward-compatible way through its innovative metadata table evolution architecture. Lake Formation helps you centrally manage, secure, and globally share data for analytics and machine learning. Iceberg maintains the table state in metadata files.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

PyCaret 2.2: Efficient Pipelines for Model Development

Domino Data Lab

JANUARY 11, 2021

PyCaret is a convenient entree into machine learning and a productivity tool for experienced practitioners. You can list all the datasets available in the repository, and see associated metadata: all_datasets = pycaret.datasets.get_data('index'). Domino Reference Project. Image from github.com/pycaret.

Modeling

Modeling Metrics Data Science Testing

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard. Additionally, it incorporates BMW Group’s internal system to integrate essential metadata, offering a comprehensive view of the data across various dimensions, such as group, department, product, and applications.

Analytics

Analytics Dashboards Metadata Data Warehouse

AI recommendations for descriptions in Amazon DataZone for enhanced business data cataloging and discovery is now generally available

AWS Big Data

APRIL 2, 2024

Without the right metadata and documentation, data consumers overlook valuable datasets relevant to their use case or spend more time going back and forth with data producers to understand the data and its relevance for their use case—or worse, misuse the data for a purpose it was not intended for.

Metadata

Metadata Metrics Data-driven Contextual Data

Spark on Kubernetes – Gang Scheduling with YuniKorn

Cloudera

MAY 5, 2021

To learn more about what is YuniKorn, please read our previous article: YuniKorn – a universal resources scheduler and Spark on Kubernetes – how YuniKorn helps. In the distributed computing world, this refers to the mechanism to schedule correlated tasks in an All or Nothing manner. What is Gang Scheduling?

Metadata

Metadata Machine Learning Big Data IT

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

Figure 1: Flow of actions for self-service analytics around data assets stored in relational databases First, the data producer needs to capture and catalog the technical metadata of the data asset. Second, the data producer needs to consolidate the data asset’s metadata in the business catalog and enrich it with business metadata.

Metadata

Metadata Data Lake Data Processing Data-driven

Build multimodal search with Amazon OpenSearch Service

AWS Big Data

JUNE 18, 2024

To enable multimodal search across text, images, and combinations of the two, you generate embeddings for both text-based image metadata and the image itself. In addition, OpenSearch Service supports neural search , which provides out-of-the-box machine learning (ML) connectors. OpenSearch version is 2.13

Dashboards

Dashboards Metadata Modeling Visualization

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

AWS Big Data

OCTOBER 2, 2023

For a deeper exploration on configuring and using streaming ingestion in Amazon Redshift , refer to Real-time analytics with Amazon Redshift streaming ingestion. For more information on using the SUPER data type, refer to Ingesting and querying semistructured data in Amazon Redshift.

Cost-Benefit

Cost-Benefit Metadata Structured Data Data-driven

AWS re:Invent 2023 Amazon Redshift Sessions Recap

AWS Big Data

DECEMBER 18, 2023

Customers use Amazon Redshift as a key component of their data architecture to drive use cases from typical dashboarding to self-service analytics, real-time analytics, machine learning (ML), data sharing and monetization, and more. In this session, learn about Redshift Serverless new AI-driven scaling and optimization functionality.

Data Warehouse

Data Warehouse Machine Learning Data-driven Data Lake

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

This post also discusses the art of the possible with newer innovations in AWS services around streaming, machine learning (ML), data sharing, and serverless capabilities. The following diagram is a conceptual analytics data hub reference architecture. They should also provide optimal performance with low or no tuning.

Analytics

Analytics Data Warehouse Data Lake Metadata

Deep automation in machine learning

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Webinars

Trending Sources

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Build a high-performance quant research platform with Apache Iceberg

Write queries faster with Amazon Q generative SQL for Amazon Redshift

What you need to know about product management for AI

Best Practices for Metadata Management

Proposals for model vulnerability and security

The Power of Graph Databases, Linked Data, and Graph Algorithms

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

Recap of Amazon Redshift key product announcements in 2024

Enhance data governance with enforced metadata rules in Amazon DataZone

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

Amazon SageMaker Lakehouse now supports attribute-based access control

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Building Custom Runtimes with Editors in Cloudera Machine Learning

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Alation and Salesforce partner on data governance for Data Cloud

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Three Emerging Analytics Products Derived from Value-driven Data Innovation and Insights Discovery in the Enterprise

Unstructured data management and governance using AWS AI/ML and analytics services

Maximize your data dividends with active metadata

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

Top 10 Data Governance Trends for 2020: Data’s Real Value Comes Into Focus

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

What is data governance? Best practices for managing data assets

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Building Your Human Benchmark with Ontotext Metadata Studio

The Benefits of a Knowledge Graph-based Metadata Hub

How Cargotec uses metadata replication to enable cross-account data sharing

Themes and Conferences per Pacoid, Episode 11

What is a data architect? Skills, salaries, and how to become a data framework master

Data governance in the age of generative AI

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

PyCaret 2.2: Efficient Pipelines for Model Development

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AI recommendations for descriptions in Amazon DataZone for enhanced business data cataloging and discovery is now generally available

Spark on Kubernetes – Gang Scheduling with YuniKorn

Governing data in relational databases using Amazon DataZone

Build multimodal search with Amazon OpenSearch Service

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

AWS re:Invent 2023 Amazon Redshift Sessions Recap

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Stay Connected