Data Processing, Management and Metadata

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

If you’re already a software product manager (PM), you have a head start on becoming a PM for artificial intelligence (AI) or machine learning (ML). But there’s a host of new challenges when it comes to managing AI projects: more unknowns, non-deterministic outcomes, new infrastructures, new processes and new tools.

Management

Management Machine Learning Experimentation Metrics

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

It is appealing to migrate from self-managed OpenSearch and Elasticsearch clusters in legacy versions to Amazon OpenSearch Service to enjoy the ease of use, native integration with AWS services, and rich features from the open-source environment ( OpenSearch is now part of Linux Foundation ).

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

AWS Big Data

OCTOBER 24, 2024

Amazon OpenSearch Service is a fully managed service for search and analytics. AWS handles the heavy lifting of managing the underlying infrastructure, including service installation, configuration, replication, and backups, so you can focus on the business side of your application. Make sure the Python version is later than 2.7.0:

Visualization

Visualization Management Data Processing Testing

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

secret_id – The ID of the AWS Secrets Manager secret for the source database credentials. format(host, port, dbname) connectionProperties = { "user" : username, "password" : password } spark.read.jdbc(url=jdbc_url, table='INFORMATION_SCHEMA.TABLE_CONSTRAINTS', properties=connectionProperties).createOrReplaceTempView("TABLE_CONSTRAINTS")

Data Lake

Data Lake Data Processing Optimization Machine Learning

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight. This led to inefficiencies in data governance and access control.

Data Lake

Data Lake Sales Metadata Machine Learning

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Recognizing this paradigm shift, ANZ Institutional Division has embarked on a transformative journey to redefine its approach to data management, utilization, and extracting significant business value from data insights. This principle makes sure data accountability remains close to the source, fostering higher data quality and relevance.

Metadata

Metadata Data Governance Data Quality Data-driven

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

Their terminal operations rely heavily on seamless data flows and the management of vast volumes of data. Thus, managing data at scale and establishing data-driven decision support across different companies and departments within the EUROGATE Group remains a challenge. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

Kinesis Data Streams is a fully managed, serverless data streaming service that stores and ingests various streaming data in real time at any scale. To create an OpenSearch domain, see Creating and managing Amazon OpenSearch domains. To create a Kinesis Data Stream, see Create a data stream.

Metadata

Metadata Metrics Analytics Data Processing

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

1) What Is Data Quality Management? However, with all good things comes many challenges and businesses often struggle with managing their information in the correct way. Enters data quality management. What Is Data Quality Management (DQM)? Why Do You Need Data Quality Management? Table of Contents.

Data Quality

Data Quality Metrics Data-driven Management

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Organizations with legacy, on-premises, near-real-time analytics solutions typically rely on self-managed relational databases as their data store for analytics workloads. We introduce you to Amazon Managed Service for Apache Flink Studio and get started querying streaming data interactively using Amazon Kinesis Data Streams.

Management

Management Metadata Analytics Dashboards

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

When building custom stream processing applications, developers typically face challenges with managing distributed computing at scale that is required to process high throughput data in real time. reduces the Amazon DynamoDB cost associated with KCL by optimizing read operations on the DynamoDB table storing metadata.

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

A data management platform (DMP) is a group of tools designed to help organizations collect and manage data from a wide array of sources and to create reports that help explain what is happening in those data streams. All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all.

Management

Management Advertising Data Lake Sales

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

Third, some services require you to set up and manage compute resources used for federated connectivity, and capabilities like connection testing and data preview arent available in all services. For Host , enter your host name of your Aurora PostgreSQL database cluster. On your project, in the navigation pane, choose Data.

Visualization

Visualization Data Processing Testing Publishing

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

As organizations increasingly adopt cloud-based solutions and centralized identity management, the need for seamless and secure access to data warehouses like Amazon Redshift becomes crucial. federated users to access the AWS Management Console. Select the Consumption hosting plan and then choose Select.

Sales

Sales Metadata Enterprise Testing

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Amazon DataZone , a data management service, helps you catalog, discover, share, and govern data stored across AWS, on-premises systems, and third-party sources. This Lambda function contains the logic to manage access policies for the subscribed unmanaged asset, automating the subscription process for unstructured S3 assets.

Publishing

Publishing Unstructured Data Metadata Data-driven

Disaster recovery strategies for Amazon MWAA – Part 2

AWS Big Data

JUNE 17, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a fully managed orchestration service that makes it straightforward to run data processing workflows at scale. The solution for this post is hosted on GitHub. The pipeline includes a DAG deployed to the DAGs S3 bucket, which performs backup of your Airflow metadata.

Strategy

Strategy Metadata Recreation/Entertainment Metrics

Have we reached the end of ‘too expensive’ for enterprise software?

CIO Business Intelligence

JANUARY 9, 2025

These required specialized roles and teams to collect domain-specific data, prepare features, label data, retrain and manage the entire lifecycle of a model. Take, for example, an app for recording and managing travel expenses. A manager wants to assess the general mood of the team during a specific week.

Software

Software Enterprise Key Performance Indicator Machine Learning

Configure a custom domain name for your Amazon MSK cluster

AWS Big Data

JUNE 24, 2024

Amazon Managed Streaming for Kafka (Amazon MSK) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data. These end customers manage Kafka clients, which are deployed in AWS, other managed cloud providers, or on premises.

Advertising

Advertising Data Processing Metadata Management

Gartner Data & Analytics Summit – May 12-13 in London

Octopai

MAY 4, 2025

Octopai, CEO will be hosting a session on leveraging Metadata to build knowledge hub for successful enterprise AI implementation. If you are attending the event, visit us and learn more about how Cloudera and Octopai are leading the data management revolution.

Data Analytics

Data Analytics Metadata Data Processing Analytics

How Zurich Insurance Group built a log management solution on AWS

AWS Big Data

JULY 16, 2024

The Zurich Cyber Fusion Center management team faced similar challenges, such as balancing licensing costs to ingest and long-term retention requirements for both business application log and security log data within the existing SIEM architecture. Previously, P2 logs were ingested into the SIEM.

Insurance

Insurance Management Cost-Benefit Optimization

CIOs are (still) closer than ever to their dream data lakehouse

CIO Business Intelligence

OCTOBER 15, 2024

And now that it’s established as the default table format, the REST catalog layer above – that is, the APIs that help define just how far and wide Iceberg can stretch, and what management capabilities data professionals will have – is becoming the new battleground. But the metadata turf war is just getting started.”

Metadata

Metadata Data Processing Uncertainty Data Warehouse

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Ali Tore, Senior Vice President of Advanced Analytics at Salesforce, highlighting the value of this integration, says “We’re excited to partner with Amazon to bring Tableau’s powerful data exploration and AI-driven analytics capabilities to customers managing data across organizational boundaries with Amazon DataZone.

Visualization

Visualization Data Lake Testing Data Governance

How REA Group approaches Amazon MSK cluster capacity planning

AWS Big Data

DECEMBER 5, 2024

REA Group, a digital business that specializes in real estate property, solved this problem using Amazon Managed Streaming for Apache Kafka (Amazon MSK) and a data streaming platform called Hydro. In each environment, Hydro manages a single MSK cluster that hosts multiple tenants with differing workload requirements.

Metrics

Metrics Dashboards Testing Optimization

Data confidence begins at the edge

CIO Business Intelligence

SEPTEMBER 23, 2024

The transition to a clean energy grid requires advanced solutions for energy management and storage as well as power conversion. Leveraging data-driven insights can help utilities design, implement, and manage more efficient and reliable grids. To learn more about the solution, read the white paper or watch the video.

Manufacturing

Manufacturing Internet of Things Metadata Risk

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

Amazon DataZone is a fully managed data management service that makes it faster and easier for customers to catalog, discover, share, and govern data stored across Amazon Web Services (AWS), on premises, and on third-party sources. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

Top 15 data management platforms available today

CIO Business Intelligence

SEPTEMBER 22, 2023

Data management platform definition A data management platform (DMP) is a suite of tools that helps organizations to collect and manage data from a wide array of first-, second-, and third-party sources and to create reports and build customer profiles as part of targeted personalization campaigns.

Management

Management Advertising Data Lake Sales

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

Designing for high throughput with 11 9s of durability OpenSearch Service manages tens of thousands of OpenSearch clusters. The following diagram illustrates the recovery flow in OR1 instances OR1 instances persist not only the data, but the cluster metadata like index mappings, templates, and settings in Amazon S3.

Optimization

Optimization Snapshot Metadata Cost-Benefit

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

The Ozone Manager is a critical component of Ozone. It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. As Ozone scales to exabytes of data, it is important to ensure that Ozone Manager can perform at scale.

Management

Management Metadata Metrics Optimization

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery. One of the key challenges in modern big data management is facilitating efficient data sharing and access control across multiple EMR clusters.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the cloud at scale. By using multiple AWS accounts, organizations can effectively scale their workloads and manage their complexity as they grow.

Metadata

Metadata Data Processing Management Testing

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

We needed a solution to manage our data at scale, to provide greater experiences to our customers. CDP Private Cloud’s new approach to data management and analytics would allow HBL to access powerful self-service analytics. The post Habib Bank manages data at scale with Cloudera Data Platform appeared first on Cloudera Blog.

Management

Management Data Lake Consulting Unstructured Data

Data Governance Maturity and Tracking Progress

erwin

APRIL 16, 2021

Data governance is best defined as the strategic, ongoing and collaborative processes involved in managing data’s access, availability, usability, quality and security in line with established internal policies and relevant data regulations. Managed : Dedicated resources, managed and adjusted with KPIs.

Data Governance

Data Governance Metadata Cost-Benefit Data-driven

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

AWS Big Data

FEBRUARY 27, 2024

Customers prefer to let the service manage its capacity automatically rather than having to manually provision capacity. OSI is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to OpenSearch Service domains and OpenSearch Serverless collections.

Metadata

Metadata Data Processing Dashboards IoT

What Is Data Governance? (And Why Your Organization Needs It)

erwin

AUGUST 28, 2020

With well governed data, organizations can get more out of their data by making it easier to manage, interpret and use. Better customer satisfaction/trust and reputation management: Use data to provide a consistent, efficient and personalized customer experience, while avoiding the pitfalls and scandals of breaches and non-compliance.

Data Governance

Data Governance IT Cost-Benefit Metadata

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

In this blog, we discuss the technical challenges faced by Cargotec in replicating their AWS Glue metadata across AWS accounts, and how they navigated these challenges successfully to enable cross-account data sharing. Solution overview Cargotec required a single catalog per account that contained metadata from their other AWS accounts.

Metadata

Metadata Data Lake Machine Learning Big Data

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

Apache Iceberg enables transactions on data lakes and can simplify data storage, management, ingestion, and processing. This means the data files in the data lake aren’t modified during the migration and all Apache Iceberg metadata files (manifests, manifest files, and table metadata files) are generated outside the purview of the data.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Amazon OpenSearch Service is a fully managed search and analytics service powered by the Apache Lucene search library that can be operated within a virtual private cloud (VPC). Create an Amazon Route 53 public hosted zone such as mydomain.com to be used for routing internet traffic to your domain. Take note of the group ID.

Dashboards

Dashboards Data Processing Metadata Consulting

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

Apache Ozone is a scalable distributed object store that can efficiently manage billions of small and large files. Before we jump into the data ingestion step, here is a quick overview of how Ozone manages its metadata namespace through volumes, buckets and keys. . awsAccessKey=s3-spark-user/HOST@REALM.COM. import boto3.

Data Science

Data Science Forecasting Metadata Machine Learning

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

The need for a decentralized data mesh architecture stems from the challenges organizations faced when implementing more centralized data management architectures – challenges that can attributed to both technology (e.g., In the Enterprise Data Management realm, such a data domain is called an Authoritative Data Domain (ADD).

Metadata

Metadata Cost-Benefit Enterprise Interactive

Why Replicating HBase Data Using Replication Manager is the Best Choice

Cloudera

JULY 13, 2022

In this article we discuss the various methods to replicate HBase data and explore why Replication Manager is the best choice for the job with the help of a use case. Cloudera Replication Manager is a key Cloudera Data Platform (CDP) service, designed to copy and migrate data between environments and infrastructures across hybrid clouds.

Snapshot

Snapshot Management Cost-Benefit Metadata

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

Producers prioritized ownership, governance, access management, and reuse of their datasets. We decided to build a scalable data management product that is based on the best practices of modern data architecture. With this structure in place, we then needed to add governance and access management.

Finance

Finance Metadata Big Data Recreation/Entertainment

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

To address these growing data management challenges, AWS customers are using Amazon DataZone , a data management service that makes it fast and effortless to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources. The following figure illustrates the data mesh architecture.

Data Governance

Data Governance Publishing Data-driven Metadata

What you need to know about product management for AI

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Webinars

Trending Sources

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

Webinars

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

How BMW streamlined data access using AWS Lake Formation fine-grained access control

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

How EUROGATE established a data mesh architecture using Amazon DataZone

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Top 15 data management platforms

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Disaster recovery strategies for Amazon MWAA – Part 2

Have we reached the end of ‘too expensive’ for enterprise software?

Configure a custom domain name for your Amazon MSK cluster

Gartner Data & Analytics Summit – May 12-13 in London

How Zurich Insurance Group built a log management solution on AWS

CIOs are (still) closer than ever to their dream data lakehouse

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

How REA Group approaches Amazon MSK cluster capacity planning

Data confidence begins at the edge

Governing data in relational databases using Amazon DataZone

Top 15 data management platforms available today

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Boosting Object Storage Performance with Ozone Manager

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Habib Bank manages data at scale with Cloudera Data Platform

Data Governance Maturity and Tracking Progress

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

What Is Data Governance? (And Why Your Organization Needs It)

How Cargotec uses metadata replication to enable cross-account data sharing

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Apache Ozone Powers Data Science in CDP Private Cloud

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Why Replicating HBase Data Using Replication Manager is the Best Choice

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

HEMA accelerates their data governance journey with Amazon DataZone

Stay Connected