Data Processing, Document and Metadata

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

For agent-based solutions, see the agent-specific documentation for integration with OpenSearch Ingestion, such as Using an OpenSearch Ingestion pipeline with Fluent Bit. This includes adding common fields to associate metadata with the indexed documents, as well as parsing the log data to make data more searchable.

Metadata

Metadata Metrics Data Processing Analytics

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

AWS Big Data

SEPTEMBER 12, 2024

In this blog post, we will highlight how ZS Associates used multiple AWS services to build a highly scalable, highly performant, clinical document search platform. We developed and host several applications for our customers on Amazon Web Services (AWS). The document processing layer supports document ingestion and orchestration.

Unstructured Data

Unstructured Data Metadata Consulting Machine Learning

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

The following diagram illustrates an indexing flow involving a metadata update in OR1 During indexing operations, individual documents are indexed into Lucene and also appended to a write-ahead log also known as a translog. In the event of an infrastructure failure, an OpenSearch domain can end up losing one or more nodes.

Optimization

Optimization Snapshot Metadata Cost-Benefit

Webinars

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Marketing Operations in 2025: A New Framework for Success

MORE WEBINARS

How to build a safe path to AI in Healthcare

CIO Business Intelligence

AUGUST 5, 2024

Some important steps that need to be taken to monitor and address these issues include specific communication and documentation regarding GenAI usage parameters, real-time input and output logging, and consistent evaluation against performance metrics and benchmarks.

Experimentation

Experimentation Risk Metadata Data-driven

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

AWS Big Data

FEBRUARY 27, 2024

Migration of metadata such as security roles and dashboard objects will be covered in another subsequent post. Update the following information for the source: Uncomment hosts and specify the endpoint of the existing OpenSearch Service endpoint. For now, you can leave the default minimum as 1 and maximum as 4.

Metadata

Metadata Data Processing Dashboards IoT

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

format(dbname, table_name)) except Exception as ex: print(ex) failed_table = {"table_name": table_name, "Reason": ex} unprocessed_tables.append(failed_table) def get_table_key(host, port, username, password, dbname): jdbc_url = "jdbc:sqlserver://{0}:{1};databaseName={2}".format(host, To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

AVB accelerates search in LINQ with Amazon OpenSearch Service

AWS Big Data

MAY 21, 2024

Initially, searches from Hub queried LINQ’s Microsoft SQL Server database hosted on Amazon Elastic Compute Cloud (Amazon EC2), with search times averaging 3 seconds, leading to reduced adoption and negative feedback. The LINQ team exposes access to the OpenSearch Service index through a search API hosted on Amazon EC2.

Manufacturing

Manufacturing Sales Optimization Data Processing

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

Organizations are collecting and storing vast amounts of structured and unstructured data like reports, whitepapers, and research documents. End-users often struggle to find relevant information buried within extensive documents housed in data lakes, leading to inefficiencies and missed opportunities.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Create an Amazon Route 53 public hosted zone such as mydomain.com to be used for routing internet traffic to your domain. For instructions, refer to Creating a public hosted zone. Request an AWS Certificate Manager (ACM) public certificate for the hosted zone. hosted_zone_id – The Route 53 public hosted zone ID.

Dashboards

Dashboards Data Processing Metadata Consulting

Hybrid Search with Amazon OpenSearch Service

AWS Big Data

MARCH 19, 2024

These traditional lexical search algorithms match user queries with exact words or phrases in your documents. After the scores are all on the same scale, they’re combined for every document. Reorder the documents based on the new combined score and render the documents as a response to the query.

Data Processing

Data Processing Modeling Machine Learning Metadata

Combining the Flexibility of Knowledge Graphs with the Power of Semantic Tagging: The Enterprise PowerPack

Ontotext

JULY 12, 2024

This enables our customers to work with a rich, user-friendly toolset to manage a graph composed of billions of edges hosted in data centers around the world. The bundle focuses on tagging documents from a single data source and makes it easy for customers to build smart applications or support existing systems and processes.

Enterprise

Enterprise Cost-Benefit Metadata Data Integration

SAP enhances Datasphere and SAC for AI-driven transformation

CIO Business Intelligence

MARCH 6, 2024

SAP announced today a host of new AI copilot and AI governance features for SAP Datasphere and SAP Analytics Cloud (SAC). We have cataloging inside Datasphere: It allows you to catalog, manage metadata, all the SAP data assets we’re seeing,” said JG Chirapurath, chief marketing and solutions officer for SAP. “We

Unstructured Data

Unstructured Data Dashboards Business Intelligence Data Governance

Build multimodal search with Amazon OpenSearch Service

AWS Big Data

JUNE 18, 2024

To enable multimodal search across text, images, and combinations of the two, you generate embeddings for both text-based image metadata and the image itself. Text embeddings capture document semantics, while image embeddings capture visual attributes that help you build rich image search applications.

Dashboards

Dashboards Metadata Modeling Visualization

6 benefits of data lineage for financial services

IBM Big Data Hub

FEBRUARY 26, 2024

Download the Gartner® Market Guide for Active Metadata Management 1. Efficient cloud migrations McKinsey predicts that $8 out of every $10 for IT hosting will go toward the cloud by 2024. With data lineage, every object in the migrated system is mapped and dependencies are documented.

Cost-Benefit

Cost-Benefit Metadata Data Governance Reporting

Achieve high availability in Amazon OpenSearch Multi-AZ with Standby enabled domains: A deep dive into failovers

AWS Big Data

JANUARY 10, 2024

During the query phase of a search request, the coordinator determines the shards to be queried and sends a request to the data node hosting the shard copy. The query is run locally on each shard and matched documents are returned to the coordinator node.

Metadata

Metadata Broadcasting Data Processing Modeling

Amazon OpenSearch Service search enhancements: 2023 roundup

AWS Big Data

JANUARY 9, 2024

Now users seek methods that allow them to get even more relevant results through semantic understanding or even search through image visual similarities instead of textual search of metadata. Lexical search In lexical search, the search engine compares the words in the search query to the words in the documents, matching word for word.

Visualization

Visualization Cost-Benefit Modeling Machine Learning

Upgrade Hortonworks Data Platform (HDP) to Cloudera Data Platform (CDP) Private Cloud Base

Cloudera

FEBRUARY 17, 2022

Before proceeding with the upgrade, review the CDP Private Cloud Base prerequisites as specified in the documentation. Finally we also recommend that you take a full backup of your cluster configurations, metadata, other supporting details, and backend databases. The end-to-end process is relatively straightforward and well documented.

Testing

Testing Data Processing Metadata Management

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

AWS Big Data

AUGUST 6, 2024

At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. Web UI Amazon MWAA comes with a managed web server that hosts the Airflow UI. For example, mw1.small

Cost-Benefit

Cost-Benefit Snapshot Metadata Metrics

How Backstage streamlines software development and increases efficiency

IBM Big Data Hub

APRIL 1, 2024

GitOps for repo data Backstage allows developers and teams to express the metadata about their projects from yaml files. This automation of updating documentation and advertising gives time back to the developers. Hosting an API proxy in Backstage will solve these issues for you, letting you focus more on development.

Software

Software Advertising Data Processing Metadata

The importance of data ingestion and integration for enterprise AI

IBM Big Data Hub

JANUARY 9, 2024

Data ingestion must be done properly from the start, as mishandling it can lead to a host of new issues. 4 key components to ensure reliable data ingestion Data quality and governance: Data quality means ensuring the security of data sources, maintaining holistic data and providing clear metadata.

Enterprise

Enterprise Data Integration Data Quality Contextual Data

Octopai Users Do More with Enhanced Data Lineage Capabilities + Complete BI Data Catalog

Octopai

AUGUST 30, 2020

Manually add objects and or links to represent metadata that wasn’t included in the extraction and document descriptions for user visualization. Download upper and column-to-column lineage to Excel/CSV in order to document, verify development and change requests. We call this feature: Expand. Column-to-column lineage.

OLAP

OLAP Metadata Visualization Data Processing

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

To follow along with this post, you should have the following prerequisites: Three AWS accounts as follows: Source account: Hosts the source Amazon RDS for PostgreSQL database. Crawlers explore data stores and auto-generate metadata to populate the Data Catalog, registering discovered tables in the Data Catalog.

Visualization

Visualization Metadata Data Transformation Testing

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

AWS Big Data

MAY 31, 2024

In this post, you use common methods for searching documents in OpenSearch Service that improve the search experience, such as request body searches using domain-specific language (DSL) for queries. The Lambda function queries OpenSearch Serverless and returns the metadata for the search. The user signs in with their credentials.

Metadata

Metadata Management Data-driven Testing

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

MARCH 9, 2021

In this blog, we’ll highlight the key CDP aspects that provide data governance and lineage and show how they can be extended to incorporate metadata for non-CDP systems from across the enterprise. Atlas provides open metadata management and governance capabilities to build a catalog of all assets, and also classify and govern these assets.

Data Governance

Data Governance Metadata Enterprise Data Processing

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Cloudera

JULY 15, 2021

Further information and documentation [link] . All three will be quorums of Zookeepers and HDFS Journal nodes to track changes to HDFS Metadata stored on the Namenodes. CDP is particularly sensitive to host name resolution, therefore it’s vital that the DNS servers have been properly configured and hostnames are fully qualified.

Data Processing

Data Processing Metadata Testing Management

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

AWS Big Data

FEBRUARY 1, 2024

The Common Crawl corpus contains petabytes of data, regularly collected since 2008, and contains raw webpage data, metadata extracts, and text extracts. Common Crawl data The Common Crawl raw dataset includes three types of data files: raw webpage data (WARC), metadata (WAT), and text extraction (WET).

Metadata

Metadata Modeling Data Processing Unstructured Data

What Is Data Governance? (And Why Your Organization Needs It)

erwin

AUGUST 28, 2020

A greater ability to respond to compliance audits: Take the pain out of preparing reports and respond more quickly to audits with better documentation of data lineage. These include data catalog , data literacy and a host of built-in automation capabilities that take the pain out of data preparation.

Data Governance

Data Governance IT Cost-Benefit Metadata

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

This feature lets users query AWS Glue databases and tables in one Region from another Region using resource links, without copying the metadata in the Data Catalog or the data in Amazon Simple Storage Service (Amazon S3). See the API documentation for GetTable() and GetDatabase( ) for additional details.

Data Lake

Data Lake Metadata Management Data Processing

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Cloudera

SEPTEMBER 29, 2022

Instead of having a central team that manages all the data for a company, the thinking is that the responsibility of generating, curating, documenting, updating, and managing data should be distributed across the company based on whichever team is best suited to produce and own that data. Data mesh conceptual hierarchy. Miro: [link].

Data Architecture

Data Architecture Data Warehouse Metadata Sales

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

Download the SAML metadata file. In the navigation pane under Clients , import the SAML metadata file. Insert your specific host domain name where the Keycloak application resides in the following URL: [link] /realms/aws-realm/protocol/saml/descriptor. Download the Keycloak IdP SAML metadata file from that URL location.

Metadata

Metadata Dashboards Business Intelligence Data Lake

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

Protecting what traditionally has been considered personally identifiable information (PII) — people’s names, addresses, government identification numbers and so forth — that a business collects, and hosts is just the beginning of GDPR mandates.

Data Governance

Data Governance Cost-Benefit Risk Metadata

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports. These processes could include reports, campaigns, or financial documentation. Accuracy should be measured through source documentation (i.e., 2 – Data profiling.

Data Quality

Data Quality Metrics Data-driven Management

Apache HBase online migration to Amazon EMR

AWS Big Data

OCTOBER 23, 2024

HBase can run on Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3) , and can host very large tables with billions of rows and millions of columns. hbase-site hbase.rs.prefetchblocksonopen TRUE Whether the server should asynchronously load all the blocks when a store file is opened (data, metadata, and index).

Snapshot

Snapshot Recreation/Entertainment Testing Data Processing

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

The Replication Manager support matrix is documented in our public docs. While using CDH on-premises cluster or CDP Private Cloud Base cluster, make sure that the following ports are open and accessible on the source hosts to allow communication between the source on-premise cluster and CDP Data Lake cluster.

Data Lake

Data Lake Metadata Unstructured Data Management

AI governance is rapidly evolving — Here’s how government agencies must prepare

IBM Big Data Hub

APRIL 11, 2024

We recommend that these hackathons be extended in scope to address the challenges of AI governance, through these steps: Step 1: Three months before the pilots are presented, have a candidate governance leader host a keynote on AI ethics to hackathon participants. We find that most are disincentivized because they have quotas to meet.

Risk

Risk Consulting Data Processing Modeling

HDFS Data Encryption at Rest on Cloudera Data Platform

Cloudera

APRIL 23, 2021

To prevent the management of these keys (which can run in the millions) from becoming a performance bottleneck, the encryption key itself is stored in the file metadata. Encryption & Decryption Flow: The way HDFS encrypts data is explained very well in Cloudera documentation and many articles. Check entropy using the command : . $

Data Processing

Data Processing Metadata Testing Management

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Cloudera

JANUARY 19, 2024

Please refer to the product documentation for more information about specific releases. Supported AI models and services The SQL AI Assistant is not bundled with a specific LLM; instead it supports various LLMs and hosting services. or higher on the public cloud. Both Hive and Impala dialects are supported.

Data Warehouse

Data Warehouse Data Processing Optimization Modeling

Providing fine-grained, trusted access to enterprise datasets with Okera and Domino

Domino Data Lab

OCTOBER 1, 2020

Additionally, Okera connects to a company’s existing technical and business metadata catalogs (such as Collibra), making it easy for data scientists to discover, access and utilize new, approved sources of information. For the compliance team, the combination of Okera and Domino Data Lab is extremely powerful.

Enterprise

Enterprise Metadata Cost-Benefit Data Processing

Introducing the vector engine for Amazon OpenSearch Serverless, now in preview

AWS Big Data

JULY 26, 2023

This enables you to process a user’s query to find the closest vectors and combine them with additional metadata without relying on external data sources or additional application code to integrate the results. You can choose to host your collection on a public endpoint or within a VPC. The vector index supports up to 1,000 fields.

Metadata

Metadata Cost-Benefit Testing Metrics

CIOs rise to the ESG reporting challenge

CIO Business Intelligence

JANUARY 30, 2024

“Always the gatekeepers of much of the data necessary for ESG reporting, CIOs are finding that companies are even more dependent on them,” says Nancy Mentesana, ESG executive director at Labrador US, a global communications firm focused on corporate disclosure documents. The complexity is at a much higher level.”

Reporting

Reporting Data Quality Strategy Data-driven

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

To develop your disaster recovery plan, you should complete the following tasks: Define your recovery objectives for downtime and data loss (RTO and RPO) for data and metadata. Document the entire disaster recovery process. Choose your hosted zone. On the Route 53 console, choose Hosted zones in the navigation pane.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Security Reference Architecture Summary for Cloudera Data Platform

Cloudera

JANUARY 21, 2022

System metadata is reviewed and updated regularly. For the purposes of this document we are going to focus on the most secure level 3 security. Similarly, Cloudera Manager Auto TLS enables per host certificates to be generated and signed by established certificate authorities. Sensitive data is encrypted.

Data Processing

Data Processing Management Finance Cost-Benefit

What’s new with Amazon MWAA support for Apache Airflow version 2.4.3

AWS Big Data

MAY 2, 2023

The workflow steps are as follows: The producer DAG makes an API call to a publicly hosted API to retrieve data. For detailed release documentation with sample code, visit the Apache Airflow v2.4.0 How dynamic task mapping works Let’s see an example using the reference code available in the Airflow documentation. Release Notes.

Testing

Testing Experimentation Management Metadata

Announcing Alation 4.0 with Alation Connect

Alation

FEBRUARY 20, 2020

Our approach was contrasted with the traditional manual wiki of notes and documentation and labeled as a modern data catalog. What the mapping is of technical metadata to business descriptions. Alation Connect synchronizes metadata, sample data, and query logs into the Alation Data Catalog. How recently the data was used.

Metadata

Metadata Enterprise Data Processing Data Architecture

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

Webinars

Trending Sources

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Webinars

How to build a safe path to AI in Healthcare

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AVB accelerates search in LINQ with Amazon OpenSearch Service

Enrich your serverless data lake with Amazon Bedrock

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

Hybrid Search with Amazon OpenSearch Service

Combining the Flexibility of Knowledge Graphs with the Power of Semantic Tagging: The Enterprise PowerPack

SAP enhances Datasphere and SAC for AI-driven transformation

Build multimodal search with Amazon OpenSearch Service

6 benefits of data lineage for financial services

Achieve high availability in Amazon OpenSearch Multi-AZ with Standby enabled domains: A deep dive into failovers

Amazon OpenSearch Service search enhancements: 2023 roundup

Upgrade Hortonworks Data Platform (HDP) to Cloudera Data Platform (CDP) Private Cloud Base

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

How Backstage streamlines software development and increases efficiency

The importance of data ingestion and integration for enterprise AI

Octopai Users Do More with Enhanced Data Lineage Capabilities + Complete BI Data Catalog

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

Data governance beyond SDX: Adding third party assets to Apache Atlas

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

What Is Data Governance? (And Why Your Organization Needs It)

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Federate Amazon QuickSight access with open-source identity provider Keycloak

How Data Governance Protects Sensitive Data

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Apache HBase online migration to Amazon EMR

Migrate Hive data from CDH to CDP public cloud

AI governance is rapidly evolving — Here’s how government agencies must prepare

HDFS Data Encryption at Rest on Cloudera Data Platform

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Providing fine-grained, trusted access to enterprise datasets with Okera and Domino

Introducing the vector engine for Amazon OpenSearch Serverless, now in preview

CIOs rise to the ESG reporting challenge

Implement disaster recovery with Amazon Redshift

Security Reference Architecture Summary for Cloudera Data Platform

What’s new with Amazon MWAA support for Apache Airflow version 2.4.3

Announcing Alation 4.0 with Alation Connect

Stay Connected