Blog, Data Processing and Metadata

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

Each Lucene index (and, therefore, each OpenSearch shard) represents a completely independent search and storage capability hosted on a single machine. How RFS works OpenSearch and Elasticsearch snapshots are a directory tree that contains both data and metadata. The following is an example for the structure of an Elasticsearch 7.10

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Add Amplify hosting Amplify can host applications using either the Amplify console or Amazon CloudFront and Amazon Simple Storage Service (Amazon S3) with the option to have manual or continuous deployment. For simplicity, we use the Hosting with Amplify Console and Manual Deployment options.

Data Processing

Data Processing Metadata Publishing Testing

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Next, we focus on building the enterprise data platform where the accumulated data will be hosted. Business analysts enhance the data with business metadata/glossaries and publish the same as data assets or data products. The enterprise data platform is used to host and analyze the sales data and identify the customer demand.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

The Struggle Between Data Dark Ages and LLM Accuracy

Cloudera

DECEMBER 6, 2024

Hosted weekly by Paul Muller, The AI Forecast speaks to experts in the space to understand the ins and outs of AI in the enterprise, the kinds of data architectures and infrastructures that support it, the guardrails that should be put in place, and the success stories to emulateor cautionary tales to learn from.

Manufacturing

Manufacturing Forecasting Metadata Data Processing

Top 10 Data Lineage Podcasts, Blogs, and Magazines

Octopai

JANUARY 31, 2021

Our list of Top 10 Data Lineage Podcasts, Blogs, and Websites To Follow in 2021. The host is Tobias Macey, an engineer with many years of experience. The particular episode we recommend looks at how WeWork struggled with understanding their data lineage so they created a metadata repository to increase visibility. Agile Data.

Data Governance

Data Governance Data Processing Data Quality Metadata

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight.

Data Lake

Data Lake Sales Metadata Machine Learning

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of. For Host , enter your host name of your Aurora PostgreSQL database cluster. On your project, in the navigation pane, choose Data.

Visualization

Visualization Data Processing Testing Publishing

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This is a guest blog post co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland. In this blog, we discuss the technical challenges faced by Cargotec in replicating their AWS Glue metadata across AWS accounts, and how they navigated these challenges successfully to enable cross-account data sharing.

Metadata

Metadata Data Lake Machine Learning Big Data

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In this blog post, we will ingest a real world dataset into Ozone, create a Hive table on top of it and analyze the data to study the correlation between new vaccinations and new cases per country using a Spark ML Jupyter notebook in CML. Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange.

Data Science

Data Science Forecasting Metadata Machine Learning

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

This means the data files in the data lake aren’t modified during the migration and all Apache Iceberg metadata files (manifests, manifest files, and table metadata files) are generated outside the purview of the data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

For this use case, create a data source and import the technical metadata of four data assets— customers , order_items , orders , products , reviews , and shipments —from AWS Glue Data Catalog. See the Amazon DataZone and Tableau blog post for step-by-step instructions. OutputLocation : Amazon S3 path for storing query results.

Visualization

Visualization Data Lake Testing Data Governance

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

After you create the asset, you can add glossaries or metadata forms, but its not necessary for this post. Create it as a JSON file on your workstation (for this post, we call it blog-sub-target.json ). Delete the S3 bucket that hosted the unstructured asset. Enter a name for the asset. Delete the Lambda function.

Publishing

Publishing Unstructured Data Metadata Data-driven

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Cloudera

JULY 15, 2021

This blog post provides an overview of best practice for the design and deployment of clusters incorporating hardware and operating system configuration, along with guidance for networking and security as well as integration with existing enterprise infrastructure. Introduction and Rationale. Networking . Clocks must also be synchronized.

Data Processing

Data Processing Metadata Testing Management

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Data and Metadata: Data inputs and data outputs produced based on the application logic. Introduction.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

In this blog post, we are going to share with you how Cloudera Stream Processing ( CSP ) is integrated with Apache Iceberg and how you can use the SQL Stream Builder ( SSB ) interface in CSP to create stateful stream processing jobs using SQL. To provide the CM host we can copy the FQDN of the node where Cloudera Manager is running.

Snapshot

Snapshot Data Processing Metadata Data Processing

New Features in Cloudera Streams Messaging Public Cloud 7.2.12

Cloudera

OCTOBER 25, 2021

Cruise Control will automatically rebalance the partition replicas on the cluster making use of the newly added brokers in the event of an up scale, or down scaling will move replicas off the hosts that are targeted to be decommissioned. an Atlas hook was provided that once configured allows for Kafka metadata to be collected.

Metrics

Metrics Data Processing Metadata Management

What Is Data Governance? (And Why Your Organization Needs It)

erwin

AUGUST 28, 2020

These include data catalog , data literacy and a host of built-in automation capabilities that take the pain out of data preparation. With the broadest set of metadata connectors, erwin DI combines data management and DG processes to fuel an automated, real-time, high-quality data pipeline.

Data Governance

Data Governance IT Cost-Benefit Metadata

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports. Many companies use so-called “legacy systems” for their databases that are decades old, and when the inevitable transition time comes, there’s a whole host of problems to deal with.

Data Quality

Data Quality Metrics Data-driven Management

Data Governance Maturity and Tracking Progress

erwin

APRIL 16, 2021

erwin recently hosted the third in its six-part webinar series on the practice of data governance and how to proactively deal with its complexities. This webinar will discuss how to answer critical questions through data catalogs and business glossaries, powered by effective metadata management. erwin Data Intelligence.

Data Governance

Data Governance Metadata Cost-Benefit Data-driven

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

AWS Big Data

JULY 30, 2024

Two private subnets are used to set up the Amazon MWAA environment, and the third private subnet is used to host the AWS Lambda authorizer function. For Bucket name , enter a name for your bucket (for this post, mwaa-sso-blog- ). Review the metadata about your certificate and choose Import. Choose Create bucket. Choose Next.

Metadata

Metadata Enterprise Management Data Lake

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

MARCH 9, 2021

In this blog, we’ll highlight the key CDP aspects that provide data governance and lineage and show how they can be extended to incorporate metadata for non-CDP systems from across the enterprise. Extending Atlas’ metadata model. The example 1_typedef-server.json describes the server typedef used in this blog. .

Data Governance

Data Governance Metadata Enterprise Data Processing

Upgrade Hortonworks Data Platform (HDP) to Cloudera Data Platform (CDP) Private Cloud Base

Cloudera

FEBRUARY 17, 2022

One of our previous blogs discussed the four paths to get from legacy platforms to CDP Private Cloud Base. In this blog and accompanying video, we deep dive into the mechanics of running an in-place upgrade from HDP3 to CDP Private Cloud Base. After Ambari has been upgraded, download the cluster blueprints with hosts.

Testing

Testing Data Processing Metadata Management

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

AWS Big Data

MAY 4, 2023

Amazon’s Open Data Sponsorship Program allows organizations to host free of charge on AWS. These datasets are distributed across the world and hosted for public use. Data scientists have access to the Jupyter notebook hosted on SageMaker. The OpenSearch Service domain stores metadata on the datasets connected at the Regions.

Data Processing

Data Processing Metadata Informatics Interactive

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. In this blog post, we will highlight the work done recently to improve the performance of Ozone Manager to scale to exabytes of data. The hardware specifications are included at the end of this blog.

Management

Management Metadata Metrics Optimization

Security Reference Architecture Summary for Cloudera Data Platform

Cloudera

JANUARY 21, 2022

This blog will summarise the security architecture of a CDP Private Cloud Base cluster. System metadata is reviewed and updated regularly. Similarly, Cloudera Manager Auto TLS enables per host certificates to be generated and signed by established certificate authorities. Sensitive data is encrypted.

Data Processing

Data Processing Management Cost-Benefit Finance

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

Protecting what traditionally has been considered personally identifiable information (PII) — people’s names, addresses, government identification numbers and so forth — that a business collects, and hosts is just the beginning of GDPR mandates.

Data Governance

Data Governance Cost-Benefit Metadata Risk

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

AWS Big Data

SEPTEMBER 12, 2024

In this blog post, we will highlight how ZS Associates used multiple AWS services to build a highly scalable, highly performant, clinical document search platform. We developed and host several applications for our customers on Amazon Web Services (AWS). We use various chunking strategies to enhance text comprehension.

Unstructured Data

Unstructured Data Metadata Machine Learning Consulting

Cloudera DataFlow for the Public Cloud: A technical deep dive

Cloudera

AUGUST 16, 2021

In this blog post we’re revisiting the challenges that come with running Apache NiFi at scale before we take a closer look at the architecture and core features of CDF-PC. Users access the CDF-PC service through the hosted CDP Control Plane. This will create a JSON file containing the flow metadata. and later).

Dashboards

Dashboards Metrics KPI Data-driven

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. While this approach provides isolation, it creates another significant challenge: duplication of data, metadata, and security policies, or ‘split-brain’ data lake.

Metadata

Metadata Data Lake Optimization Strategy

5G network rollout using DevOps: Myth or reality?

IBM Big Data Hub

JUNE 12, 2023

Public cloud support: Many CSPs use hyperscalers like AWS to host their 5G network functions, which requires automated deployment and lifecycle management. Hybrid cloud support: Some network functions must be hosted on a private data center, but that also the requires ability to automatically place network functions dynamically.

Testing

Testing Data Processing Metadata Management

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. The Sentry service serves authorization metadata from the database backed storage; it does not handle actual privilege validation. This blog post is not a substitute for that.

Data Lake

Data Lake Metadata Unstructured Data Management

6 benefits of data lineage for financial services

IBM Big Data Hub

FEBRUARY 26, 2024

Download the Gartner® Market Guide for Active Metadata Management 1. Efficient cloud migrations McKinsey predicts that $8 out of every $10 for IT hosting will go toward the cloud by 2024. The post 6 benefits of data lineage for financial services appeared first on IBM Blog.

Cost-Benefit

Cost-Benefit Metadata Data Governance Reporting

Empowering data mesh: The tools to deliver BI excellence

erwin

APRIL 16, 2024

In this blog, we’ll delve into the critical role of governance and data modeling tools in supporting a seamless data mesh implementation and explore how erwin tools can be used in that role. erwin also provides data governance, metadata management and data lineage software called erwin Data Intelligence by Quest.

Metadata

Metadata Data Quality Data Governance Modeling

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

Limited flexibility to use more complex hosting models (e.g., Increased integration costs using different loose or tight coupling approaches between disparate analytical technologies and hosting environments. The post Addressing the Three Scalability Challenges in Modern Data Platforms appeared first on Cloudera Blog.

Data Processing

Data Processing Data Warehouse Enterprise Visualization

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

AWS Big Data

AUGUST 6, 2024

At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. Web UI Amazon MWAA comes with a managed web server that hosts the Airflow UI.

Cost-Benefit

Cost-Benefit Snapshot Metadata Metrics

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

The platform’s capabilities in security, metadata, and governance will provide robust support to HBL’s focus on compliance and keeping data clean and safe in an increasingly complex regulatory and threat environment. The post Habib Bank manages data at scale with Cloudera Data Platform appeared first on Cloudera Blog.

Management

Management Data Lake Consulting Unstructured Data

From Data Silos to Data Fabric with Knowledge Graphs

Ontotext

SEPTEMBER 15, 2020

This means the creation of reusable data services, machine-readable semantic metadata and APIs that ensure the integration and orchestration of data across the organization and with third-party external data. This means having the ability to define and relate all types of metadata.

Metadata

Metadata Knowledge Discovery Data Quality Strategy

HDFS Data Encryption at Rest on Cloudera Data Platform

Cloudera

APRIL 23, 2021

To prevent the management of these keys (which can run in the millions) from becoming a performance bottleneck, the encryption key itself is stored in the file metadata. Each file will have an EDEK which is stored in the file’s metadata. Select hosts for Active and Passive KTS servers. Data in the file is encrypted with DEK.

Data Processing

Data Processing Metadata Testing Management

Sovereign Clouds: Partner Perspectives on Safeguarding Critical Customer Data

CIO Business Intelligence

APRIL 27, 2022

Rajgopal adds that all customer data, metadata, and escalation data are kept on Indian soil at all times in an ironclad environment. For more perspectives on Sovereign Cloud solutions, read the latest partner blogs from AU Cloud , NxtGen , ThinkOn and Tieto. These are questions and thoughts for all CIOs to ponder.

Digital Transformation

Digital Transformation Metadata Risk Enterprise

Apache Ozone Fault Injection Framework

Cloudera

AUGUST 14, 2020

One key part of the fault injection service is a very lightweight passthrough fuse file system that is used by Ozone for storing all its persistent data and metadata. The APIs are generic enough that we could target both Ozone data and metadata for failure/corruption/delays. NetFilter Extension. Fault Injection Framework: Github.

Testing

Testing Metadata Data Processing IT

Mastering Ingress in the UI: Elevating your app visibility

IBM Big Data Hub

NOVEMBER 3, 2023

v1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: ALB generation: 1 name: echo-ingress namespace: echo-namespace spec: rules: - host: techcorp.com // 1. Domain http: paths: - backend: service: name: echo-service port: number: 8080 path: /echo pathType: Prefix tls: - hosts: - techcorp.com secretName: echo-secret // 3.

Data Processing

Data Processing Metadata Management Testing

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

The workflow consists of the following high level steps: Cataloging the Amazon S3 Bucket: Utilize AWS Glue Crawler to crawl the designated Amazon S3 bucket, extracting metadata, and seamlessly storing it in the AWS Glue data catalog. We’ll query these tables using Amazon Athena and Amazon Redshift Spectrum. Keep the default option.

Statistics

Statistics Data Lake Optimization Data-driven

The importance of data ingestion and integration for enterprise AI

IBM Big Data Hub

JANUARY 9, 2024

Data ingestion must be done properly from the start, as mishandling it can lead to a host of new issues. 4 key components to ensure reliable data ingestion Data quality and governance: Data quality means ensuring the security of data sources, maintaining holistic data and providing clear metadata.

Enterprise

Enterprise Data Integration Data Quality Contextual Data

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Cloudera

SEPTEMBER 29, 2022

The data product is not just the data itself, but a bunch of metadata that surrounds it — the simple stuff like schema is a given. It is also agnostic to where the different domains are hosted. There are tons of blogs/videos etc about data mesh. This team or domain expert will be responsible for the data produced by the team.

Data Architecture

Data Architecture Data Warehouse Metadata Sales

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Integrate custom applications with AWS Lake Formation – Part 2

Webinars

Trending Sources

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Webinars

The Struggle Between Data Dark Ages and LLM Accuracy

Top 10 Data Lineage Podcasts, Blogs, and Magazines

How BMW streamlined data access using AWS Lake Formation fine-grained access control

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

How Cargotec uses metadata replication to enable cross-account data sharing

Apache Ozone Powers Data Science in CDP Private Cloud

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

New Features in Cloudera Streams Messaging Public Cloud 7.2.12

What Is Data Governance? (And Why Your Organization Needs It)

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Data Governance Maturity and Tracking Progress

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

Data governance beyond SDX: Adding third party assets to Apache Atlas

Upgrade Hortonworks Data Platform (HDP) to Cloudera Data Platform (CDP) Private Cloud Base

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

Boosting Object Storage Performance with Ozone Manager

Security Reference Architecture Summary for Cloudera Data Platform

How Data Governance Protects Sensitive Data

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

Cloudera DataFlow for the Public Cloud: A technical deep dive

Improving Multi-tenancy with Virtual Private Clusters

5G network rollout using DevOps: Myth or reality?

Migrate Hive data from CDH to CDP public cloud

6 benefits of data lineage for financial services

Empowering data mesh: The tools to deliver BI excellence

Addressing the Three Scalability Challenges in Modern Data Platforms

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

Habib Bank manages data at scale with Cloudera Data Platform

From Data Silos to Data Fabric with Knowledge Graphs

HDFS Data Encryption at Rest on Cloudera Data Platform

Sovereign Clouds: Partner Perspectives on Safeguarding Critical Customer Data

Apache Ozone Fault Injection Framework

Mastering Ingress in the UI: Elevating your app visibility

Enhance query performance using AWS Glue Data Catalog column-level statistics

The importance of data ingestion and integration for enterprise AI

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Stay Connected