Blog, Data Processing and Metadata

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This is a guest blog post co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland. In this blog, we discuss the technical challenges faced by Cargotec in replicating their AWS Glue metadata across AWS accounts, and how they navigated these challenges successfully to enable cross-account data sharing.

Metadata

Metadata Data Lake Machine Learning Big Data

How Volkswagen Autoeuropa built a data mesh to accelerate digital transformation using Amazon DataZone

AWS Big Data

OCTOBER 31, 2024

This is a joint blog post co-authored with Martin Mikoleizig from Volkswagen Autoeuropa. Absence of data catalog and metadata management – Data didn’t have any metadata associated with it, and so use cases couldn’t consume the data without further explanation from the data source owners and specialists.

Digital Transformation

Digital Transformation Metadata Manufacturing Data Quality

Top 10 Data Lineage Podcasts, Blogs, and Magazines

Octopai

JANUARY 31, 2021

Our list of Top 10 Data Lineage Podcasts, Blogs, and Websites To Follow in 2021. The host is Tobias Macey, an engineer with many years of experience. The particular episode we recommend looks at how WeWork struggled with understanding their data lineage so they created a metadata repository to increase visibility. Agile Data.

Data Governance

Data Governance Data Processing Data Quality Metadata

Webinars

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Marketing Operations in 2025: A New Framework for Success

MORE WEBINARS

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

Each Lucene index (and, therefore, each OpenSearch shard) represents a completely independent search and storage capability hosted on a single machine. How RFS works OpenSearch and Elasticsearch snapshots are a directory tree that contains both data and metadata. The following is an example for the structure of an Elasticsearch 7.10

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

This means the data files in the data lake aren’t modified during the migration and all Apache Iceberg metadata files (manifests, manifest files, and table metadata files) are generated outside the purview of the data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. In this blog post, we will highlight the work done recently to improve the performance of Ozone Manager to scale to exabytes of data. The hardware specifications are included at the end of this blog.

Management

Management Metadata Metrics Optimization

Introducing erwin Data Intelligence 14: Dive into data quality, ensure data reliability and leverage new deployment flexibility

erwin

SEPTEMBER 2, 2024

Leveraging the metadata within the erwin Data Intelligence data catalog, erwin Data Quality automates data profiling and quality assessment and then leverages the resulting quality scoring to provide intelligence-integrated data quality visibility throughout erwin Data Intelligence. Register Now!

Data Quality

Data Quality Data Processing Measurement Metadata

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

AWS Big Data

JULY 30, 2024

Two private subnets are used to set up the Amazon MWAA environment, and the third private subnet is used to host the AWS Lambda authorizer function. For Bucket name , enter a name for your bucket (for this post, mwaa-sso-blog- ). Review the metadata about your certificate and choose Import. Choose Create bucket. Choose Next.

Metadata

Metadata Enterprise Management Data Lake

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

In this blog post, we are going to share with you how Cloudera Stream Processing ( CSP ) is integrated with Apache Iceberg and how you can use the SQL Stream Builder ( SSB ) interface in CSP to create stateful stream processing jobs using SQL. To provide the CM host we can copy the FQDN of the node where Cloudera Manager is running.

Snapshot

Snapshot Data Processing Metadata Management

New Features in Cloudera Streams Messaging Public Cloud 7.2.12

Cloudera

OCTOBER 25, 2021

Cruise Control will automatically rebalance the partition replicas on the cluster making use of the newly added brokers in the event of an up scale, or down scaling will move replicas off the hosts that are targeted to be decommissioned. an Atlas hook was provided that once configured allows for Kafka metadata to be collected.

Metrics

Metrics Data Processing Metadata Management

5G network rollout using DevOps: Myth or reality?

IBM Big Data Hub

JUNE 12, 2023

Public cloud support: Many CSPs use hyperscalers like AWS to host their 5G network functions, which requires automated deployment and lifecycle management. Hybrid cloud support: Some network functions must be hosted on a private data center, but that also the requires ability to automatically place network functions dynamically.

Testing

Testing Data Processing Metadata Management

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Cloudera

JULY 15, 2021

This blog post provides an overview of best practice for the design and deployment of clusters incorporating hardware and operating system configuration, along with guidance for networking and security as well as integration with existing enterprise infrastructure. Introduction and Rationale. Networking . Clocks must also be synchronized.

Data Processing

Data Processing Metadata Testing Management

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight.

Data Lake

Data Lake Sales Metadata Machine Learning

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

AWS Big Data

SEPTEMBER 12, 2024

In this blog post, we will highlight how ZS Associates used multiple AWS services to build a highly scalable, highly performant, clinical document search platform. We developed and host several applications for our customers on Amazon Web Services (AWS). We use various chunking strategies to enhance text comprehension.

Unstructured Data

Unstructured Data Metadata Consulting Machine Learning

6 benefits of data lineage for financial services

IBM Big Data Hub

FEBRUARY 26, 2024

Download the Gartner® Market Guide for Active Metadata Management 1. Efficient cloud migrations McKinsey predicts that $8 out of every $10 for IT hosting will go toward the cloud by 2024. The post 6 benefits of data lineage for financial services appeared first on IBM Blog.

Cost-Benefit

Cost-Benefit Metadata Data Governance Reporting

Mastering Ingress in the UI: Elevating your app visibility

IBM Big Data Hub

NOVEMBER 3, 2023

v1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: ALB generation: 1 name: echo-ingress namespace: echo-namespace spec: rules: - host: techcorp.com // 1. Domain http: paths: - backend: service: name: echo-service port: number: 8080 path: /echo pathType: Prefix tls: - hosts: - techcorp.com secretName: echo-secret // 3.

Data Processing

Data Processing Metadata Management Testing

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Cloudera

JANUARY 19, 2024

As described in our recent blog post , an SQL AI Assistant has been integrated into Hue with the capability to leverage the power of large language models (LLMs) for a number of SQL tasks. This blog post aims to help you understand what you can do to get started with generative AI assisted SQL using Hue image version 2023.0.16.0

Data Warehouse

Data Warehouse Data Processing Optimization Modeling

How Backstage streamlines software development and increases efficiency

IBM Big Data Hub

APRIL 1, 2024

GitOps for repo data Backstage allows developers and teams to express the metadata about their projects from yaml files. ” Rather than paying a cloud to host that proxy for you, you can move that proxy into Backstage and present it as a single product. This is like APIGEE or APIM, but “in-house.”

Software

Software Advertising Data Processing Metadata

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

AWS Big Data

MAY 4, 2023

Amazon’s Open Data Sponsorship Program allows organizations to host free of charge on AWS. These datasets are distributed across the world and hosted for public use. Data scientists have access to the Jupyter notebook hosted on SageMaker. The OpenSearch Service domain stores metadata on the datasets connected at the Regions.

Data Processing

Data Processing Metadata Informatics Interactive

The importance of data ingestion and integration for enterprise AI

IBM Big Data Hub

JANUARY 9, 2024

Data ingestion must be done properly from the start, as mishandling it can lead to a host of new issues. 4 key components to ensure reliable data ingestion Data quality and governance: Data quality means ensuring the security of data sources, maintaining holistic data and providing clear metadata.

Enterprise

Enterprise Data Integration Data Quality Contextual Data

Upgrade Hortonworks Data Platform (HDP) to Cloudera Data Platform (CDP) Private Cloud Base

Cloudera

FEBRUARY 17, 2022

One of our previous blogs discussed the four paths to get from legacy platforms to CDP Private Cloud Base. In this blog and accompanying video, we deep dive into the mechanics of running an in-place upgrade from HDP3 to CDP Private Cloud Base. After Ambari has been upgraded, download the cluster blueprints with hosts.

Testing

Testing Data Processing Metadata Management

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

MARCH 9, 2021

In this blog, we’ll highlight the key CDP aspects that provide data governance and lineage and show how they can be extended to incorporate metadata for non-CDP systems from across the enterprise. Extending Atlas’ metadata model. The example 1_typedef-server.json describes the server typedef used in this blog. .

Data Governance

Data Governance Metadata Enterprise Data Processing

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

AWS Big Data

AUGUST 6, 2024

At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. Web UI Amazon MWAA comes with a managed web server that hosts the Airflow UI.

Cost-Benefit

Cost-Benefit Metadata Snapshot Metrics

A Lifetime of Data: Departments of Defense and Veterans Affairs Journey to Genesis

Cloudera

APRIL 21, 2022

The centerpiece of MHS Genesis is Cerner’s Millennium services management platform, which provides hosted software-as-a-service functionality in the cloud. A key reason for selecting Cerner, the DoD said , was the company’s data center allows direct access to proprietary data that it couldn’t obtain from a government-hosted environment.

Informatics

Informatics Metadata Insurance Data Processing

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. The Sentry service serves authorization metadata from the database backed storage; it does not handle actual privilege validation. This blog post is not a substitute for that.

Data Lake

Data Lake Metadata Unstructured Data Management

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

The platform’s capabilities in security, metadata, and governance will provide robust support to HBL’s focus on compliance and keeping data clean and safe in an increasingly complex regulatory and threat environment. The post Habib Bank manages data at scale with Cloudera Data Platform appeared first on Cloudera Blog.

Management

Management Data Lake Consulting Unstructured Data

Data Catalog: Part of the Solution – or Part of the Problem?

Alation

DECEMBER 13, 2022

Today a modern catalog hosts a wide range of users (like business leaders, data scientists and engineers) and supports an even wider set of use cases (like data governance , self-service , and cloud migration ). Active governance learns from user behavior, captured in metadata. Casting a wide metadata net is important.

Metadata

Metadata Data Governance Enterprise Insurance

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Cloudera

SEPTEMBER 29, 2022

The data product is not just the data itself, but a bunch of metadata that surrounds it — the simple stuff like schema is a given. It is also agnostic to where the different domains are hosted. There are tons of blogs/videos etc about data mesh. This team or domain expert will be responsible for the data produced by the team.

Data Architecture

Data Architecture Data Warehouse Metadata Sales

Data Governance Maturity and Tracking Progress

erwin

APRIL 16, 2021

erwin recently hosted the third in its six-part webinar series on the practice of data governance and how to proactively deal with its complexities. This webinar will discuss how to answer critical questions through data catalogs and business glossaries, powered by effective metadata management. erwin Data Intelligence.

Data Governance

Data Governance Metadata Cost-Benefit Digital Transformation

From Data Silos to Data Fabric with Knowledge Graphs

Ontotext

SEPTEMBER 15, 2020

This means the creation of reusable data services, machine-readable semantic metadata and APIs that ensure the integration and orchestration of data across the organization and with third-party external data. This means having the ability to define and relate all types of metadata.

Metadata

Metadata Knowledge Discovery Data Quality Strategy

Apache HBase online migration to Amazon EMR

AWS Big Data

OCTOBER 23, 2024

HBase can run on Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3) , and can host very large tables with billions of rows and millions of columns. This blog post introduces a set of typical HBase migration solutions with best practices based on real-world customers’ migration case studies.

Snapshot

Snapshot Recreation/Entertainment Testing Data Processing

Do Large Language Models Dream of Knowledge Graphs – Impressions from Day 2 At SEMANTiCS 2023

Ontotext

OCTOBER 12, 2023

Both speakers talked about common metadata standards and adequate language resources as key enablers of efficient interoperable, multilingual projects. Just like the typewriter in the hall hosting the Poster’s park , LLMs are yet another tool poised to change the way we work with language.

Modeling

Modeling Recreation/Entertainment Data Processing Metadata

Security Reference Architecture Summary for Cloudera Data Platform

Cloudera

JANUARY 21, 2022

This blog will summarise the security architecture of a CDP Private Cloud Base cluster. System metadata is reviewed and updated regularly. Similarly, Cloudera Manager Auto TLS enables per host certificates to be generated and signed by established certificate authorities. Sensitive data is encrypted.

Data Processing

Data Processing Management Finance Cost-Benefit

Combining the Flexibility of Knowledge Graphs with the Power of Semantic Tagging: The Enterprise PowerPack

Ontotext

JULY 12, 2024

This enables our customers to work with a rich, user-friendly toolset to manage a graph composed of billions of edges hosted in data centers around the world. This affected the metadata quality and the ability to publish data on time. As a result, enterprises can fully unlock the potential hidden knowledge that they already have.

Enterprise

Enterprise Cost-Benefit Metadata Data Integration

HDFS Data Encryption at Rest on Cloudera Data Platform

Cloudera

APRIL 23, 2021

To prevent the management of these keys (which can run in the millions) from becoming a performance bottleneck, the encryption key itself is stored in the file metadata. Each file will have an EDEK which is stored in the file’s metadata. Select hosts for Active and Passive KTS servers. Data in the file is encrypted with DEK.

Data Processing

Data Processing Metadata Testing Management

AI governance is rapidly evolving — Here’s how government agencies must prepare

IBM Big Data Hub

APRIL 11, 2024

We recommend that these hackathons be extended in scope to address the challenges of AI governance, through these steps: Step 1: Three months before the pilots are presented, have a candidate governance leader host a keynote on AI ethics to hackathon participants. We find that most are disincentivized because they have quotas to meet.

Risk

Risk Consulting Data Processing Modeling

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

Ontotext

DECEMBER 1, 2023

Atanas Kiryakov presenting at KGF 2023 about Where Shall and Enterprise Start their Knowledge Graph Journey Only data integration through semantic metadata can drive business efficiency as “it’s the glue that turns knowledge graphs into hubs of metadata and content”.

Metadata

Metadata Sales Consulting Machine Learning

Build multimodal search with Amazon OpenSearch Service

AWS Big Data

JUNE 18, 2024

To enable multimodal search across text, images, and combinations of the two, you generate embeddings for both text-based image metadata and the image itself. This blog post provides a step-by-step guide for building a multimodal search solution using OpenSearch Service.

Dashboards

Dashboards Metadata Modeling Visualization

What Is Data Governance? (And Why Your Organization Needs It)

erwin

AUGUST 28, 2020

These include data catalog , data literacy and a host of built-in automation capabilities that take the pain out of data preparation. With the broadest set of metadata connectors, erwin DI combines data management and DG processes to fuel an automated, real-time, high-quality data pipeline.

Data Governance

Data Governance IT Cost-Benefit Metadata

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

The workflow consists of the following high level steps: Cataloging the Amazon S3 Bucket: Utilize AWS Glue Crawler to crawl the designated Amazon S3 bucket, extracting metadata, and seamlessly storing it in the AWS Glue data catalog. We’ll query these tables using Amazon Athena and Amazon Redshift Spectrum. Keep the default option.

Statistics

Statistics Data Lake Optimization Data-driven

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

Protecting what traditionally has been considered personally identifiable information (PII) — people’s names, addresses, government identification numbers and so forth — that a business collects, and hosts is just the beginning of GDPR mandates.

Data Governance

Data Governance Cost-Benefit Risk Metadata

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports. Many companies use so-called “legacy systems” for their databases that are decades old, and when the inevitable transition time comes, there’s a whole host of problems to deal with.

Data Quality

Data Quality Metrics Data-driven Management

Announcing Alation 4.0 with Alation Connect

Alation

FEBRUARY 20, 2020

What the mapping is of technical metadata to business descriptions. Alation Connect synchronizes metadata, sample data, and query logs into the Alation Data Catalog. All connections allow for Alation Data Catalog to automatically inventory & catalog queries and these engines may be hosted and operated on-premise or in the cloud.

Metadata

Metadata Enterprise Data Processing Data Architecture

OpenTelemetry vs. Prometheus: You can’t fix what you can’t see

IBM Big Data Hub

MARCH 29, 2024

Benefits of OpenTelemetry The OpenTelemetry protocol (OTLP) simplifies observability by collecting telemetry data, like metrics, logs and traces, without changing code or metadata. Once integrated with a host, Prometheus gathers application metrics that are related to dedicated functions that DevOps teams want to monitor.

Metrics

Metrics Visualization Measurement Optimization

How Cargotec uses metadata replication to enable cross-account data sharing

How Volkswagen Autoeuropa built a data mesh to accelerate digital transformation using Amazon DataZone

Webinars

Trending Sources

Top 10 Data Lineage Podcasts, Blogs, and Magazines

Webinars

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Boosting Object Storage Performance with Ozone Manager

Introducing erwin Data Intelligence 14: Dive into data quality, ensure data reliability and leverage new deployment flexibility

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

New Features in Cloudera Streams Messaging Public Cloud 7.2.12

5G network rollout using DevOps: Myth or reality?

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

How BMW streamlined data access using AWS Lake Formation fine-grained access control

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

6 benefits of data lineage for financial services

Mastering Ingress in the UI: Elevating your app visibility

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

How Backstage streamlines software development and increases efficiency

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

The importance of data ingestion and integration for enterprise AI

Upgrade Hortonworks Data Platform (HDP) to Cloudera Data Platform (CDP) Private Cloud Base

Data governance beyond SDX: Adding third party assets to Apache Atlas

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

A Lifetime of Data: Departments of Defense and Veterans Affairs Journey to Genesis

Migrate Hive data from CDH to CDP public cloud

Habib Bank manages data at scale with Cloudera Data Platform

Data Catalog: Part of the Solution – or Part of the Problem?

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Data Governance Maturity and Tracking Progress

From Data Silos to Data Fabric with Knowledge Graphs

Apache HBase online migration to Amazon EMR

Do Large Language Models Dream of Knowledge Graphs – Impressions from Day 2 At SEMANTiCS 2023

Security Reference Architecture Summary for Cloudera Data Platform

Combining the Flexibility of Knowledge Graphs with the Power of Semantic Tagging: The Enterprise PowerPack

HDFS Data Encryption at Rest on Cloudera Data Platform

AI governance is rapidly evolving — Here’s how government agencies must prepare

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

Build multimodal search with Amazon OpenSearch Service

What Is Data Governance? (And Why Your Organization Needs It)

Enhance query performance using AWS Glue Data Catalog column-level statistics

How Data Governance Protects Sensitive Data

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Announcing Alation 4.0 with Alation Connect

OpenTelemetry vs. Prometheus: You can’t fix what you can’t see

Stay Connected