Data Processing, Metadata and Modeling

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

EUROGATEs data science team aims to create machine learning models that integrate key data sources from various AWS accounts, allowing for training and deployment across different container terminals. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

As a producer, you can also monetize your data through the subscription model using AWS Data Exchange. To achieve this, they plan to use machine learning (ML) models to extract insights from data. Next, we focus on building the enterprise data platform where the accumulated data will be hosted.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

format(dbname, table_name)) except Exception as ex: print(ex) failed_table = {"table_name": table_name, "Reason": ex} unprocessed_tables.append(failed_table) def get_table_key(host, port, username, password, dbname): jdbc_url = "jdbc:sqlserver://{0}:{1};databaseName={2}".format(host, To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

This model balances node or domain-level autonomy with enterprise-level oversight, creating a scalable and consistent framework across ANZ. The Institutional Data & AI platform adopts a federated approach to data while centralizing the metadata to facilitate simpler discovery and sharing of data products.

Metadata

Metadata Data Governance Data Quality Data-driven

Have we reached the end of ‘too expensive’ for enterprise software?

CIO Business Intelligence

JANUARY 9, 2025

Generative artificial intelligence ( genAI ) and in particular large language models ( LLMs ) are changing the way companies develop and deliver software. The commodity effect of LLMs over specialized ML models One of the most notable transformations generative AI has brought to IT is the democratization of AI capabilities.

Software

Software Enterprise Key Performance Indicator Machine Learning

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight.

Data Lake

Data Lake Sales Metadata Machine Learning

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

But there’s a host of new challenges when it comes to managing AI projects: more unknowns, non-deterministic outcomes, new infrastructures, new processes and new tools. For machine learning systems used in consumer internet companies, models are often continuously retrained many times a day using billions of entirely new input-output pairs.

Management

Management Machine Learning Experimentation Metrics

The Struggle Between Data Dark Ages and LLM Accuracy

Cloudera

DECEMBER 6, 2024

Hosted weekly by Paul Muller, The AI Forecast speaks to experts in the space to understand the ins and outs of AI in the enterprise, the kinds of data architectures and infrastructures that support it, the guardrails that should be put in place, and the success stories to emulateor cautionary tales to learn from.

Manufacturing

Manufacturing Forecasting Metadata Data Processing

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

The following diagram illustrates an indexing flow involving a metadata update in OR1 During indexing operations, individual documents are indexed into Lucene and also appended to a write-ahead log also known as a translog. The replica copies subsequently download newer segments and make them searchable.

Optimization

Optimization Snapshot Metadata Cost-Benefit

Disaster recovery strategies for Amazon MWAA – Part 2

AWS Big Data

JUNE 17, 2024

The solution for this post is hosted on GitHub. Backup and restore architecture The backup and restore strategy involves periodically backing up Amazon MWAA metadata to Amazon Simple Storage Service (Amazon S3) buckets in the primary Region. This is the bucket where you host all of your DAGs for your environment. [1.b]

Strategy

Strategy Metadata Recreation/Entertainment Metrics

CIOs are (still) closer than ever to their dream data lakehouse

CIO Business Intelligence

OCTOBER 15, 2024

“The data catalog is critical because it’s where business manages its metadata,” said Venkat Rajaji, Senior Vice President of Product Management at Cloudera. But the metadata turf war is just getting started.” That put them in a better position to keep data under management – and possibly to host processing as well.

Metadata

Metadata Data Processing Uncertainty Data Warehouse

Data confidence begins at the edge

CIO Business Intelligence

SEPTEMBER 23, 2024

Without a way to define and measure data confidence, AI model training environments, data analytics systems, automation engines, and so on must simply trust that the data has not been simulated, corrupted, poisoned, or otherwise maliciously generated—increasing the risks of downtime and other disasters.

Manufacturing

Manufacturing Internet of Things Metadata Risk

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

Select the Consumption hosting plan and then choose Select. In the Create function pane, provide the following information: For Select a template , choose v2 Programming Model. For Programming Model , choose the HTTP trigger template. Log in with your Azure account credentials. Choose Create a resource. choose Next.

Sales

Sales Metadata Enterprise Testing

Data Governance Maturity and Tracking Progress

erwin

APRIL 16, 2021

erwin recently hosted the third in its six-part webinar series on the practice of data governance and how to proactively deal with its complexities. This model has been dubbed the Medici maturity model – named after Romina Medici , head of data management and governance for global energy provider E.ON. erwin Data Intelligence.

Data Governance

Data Governance Metadata Cost-Benefit Data-driven

What Is Data Governance? (And Why Your Organization Needs It)

erwin

AUGUST 28, 2020

Data has always been important to erwin; we’ve been a trusted data modeling brand for more than 30 years. These include data catalog , data literacy and a host of built-in automation capabilities that take the pain out of data preparation. The Best Data Governance Solution.

Data Governance

Data Governance IT Cost-Benefit Metadata

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

Data governance is a key enabler for teams adopting a data-driven culture and operational model to drive innovation with data. This post explains how you can extend the governance capabilities of Amazon DataZone to data assets hosted in relational databases based on MySQL, PostgreSQL, Oracle or SQL Server engines.

Metadata

Metadata Data Lake Data Processing Data-driven

Do Large Language Models Dream of Knowledge Graphs – Impressions from Day 2 At SEMANTiCS 2023

Ontotext

OCTOBER 12, 2023

Did you know that, if you add “take a deep breath” to a prompt, chances are you will get more accurate results from Large Language Models (LLMs)? Do Knowledge Graphs Dream of Large Language Models? I didn’t either. He shared the need for more research at the intersection of LLMs and knowledge graphs.

Modeling

Modeling Recreation/Entertainment Data Processing Metadata

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

Before we jump into the data ingestion step, here is a quick overview of how Ozone manages its metadata namespace through volumes, buckets and keys. . If created using the Filesystem interface, the intermediate prefixes ( application-1 & application-1/instance-1 ) are created as directories in the Ozone metadata store. s3 = boto3.resource('s3',

Data Science

Data Science Forecasting Metadata Machine Learning

Themes and Conferences per Pacoid, Episode 11

Domino Data Lab

JULY 2, 2019

Paco Nathan ‘s latest article covers program synthesis, AutoPandas, model-driven data queries, and more. In other words, using metadata about data science work to generate code. Using ML models to search more effectively brought the search space down to 102—which can run on modest hardware. Model-Driven Data Queries.

Metadata

Metadata Data Science Machine Learning Data-driven

Choosing the Right Cloud for Data Sovereignty

CIO Business Intelligence

SEPTEMBER 23, 2022

To better understand why a business may choose one cloud model over another, let’s look at the common types of cloud architectures: Public – on-demand computing services and infrastructure managed by a third-party provider and shared with multiple organizations using the public Internet. Public clouds offer large scale at low cost.

Data Processing

Data Processing Metadata Cost-Benefit Risk Management

SAP enhances Datasphere and SAC for AI-driven transformation

CIO Business Intelligence

MARCH 6, 2024

SAP announced today a host of new AI copilot and AI governance features for SAP Datasphere and SAP Analytics Cloud (SAC). We have cataloging inside Datasphere: It allows you to catalog, manage metadata, all the SAP data assets we’re seeing,” said JG Chirapurath, chief marketing and solutions officer for SAP. “We

Unstructured Data

Unstructured Data Dashboards Business Intelligence Data Governance

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Also, a data model that allows table truncations at a regular frequency (for example, every 15 seconds) to store only relevant data in tables can cause locking and performance issues. The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day.

Management

Management Metadata Analytics Dashboards

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

These needs are then quantified into data models for acquisition and delivery. It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports. The captured data points should be modeled and defined based on specific characteristics (e.g.,

Data Quality

Data Quality Metrics Data-driven Management

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. The onboarding of producers is facilitated by sharing metadata, whereas the onboarding of consumers is based on granting permission to access this metadata. The producer account will host the EMR cluster and S3 buckets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Build and share a business capability model with Amazon QuickSight

AWS Big Data

JULY 14, 2023

Defining and capturing a business capability model If an enterprise doesn’t have a system to capture the business capability model, consider defining and finding a way to capture the model for better insight and visibility, and then map it with digital assets like APIs. It keeps evolving with business requirements and usage.

Modeling

Modeling Visualization Reporting Measurement

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

AWS Big Data

SEPTEMBER 12, 2024

We developed and host several applications for our customers on Amazon Web Services (AWS). ZS unlocked new value from unstructured data for evidence generation leads by applying large language models (LLMs) and generative artificial intelligence (AI) to power advanced semantic search on evidence protocols.

Unstructured Data

Unstructured Data Metadata Machine Learning Consulting

How can CIOs safely unleash generative AI on their company’s data?

CIO Business Intelligence

JUNE 14, 2024

However, it’s also a precious resource that must be safeguarded, and large language models (LLMs) have been known to compromise that safety. If it isn’t hosted on your infrastructure, you can’t be as certain about its security posture. If data is the new oil, it’s only useful once it’s been refined.

Dashboards

Dashboards Visualization Business Intelligence Risk

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

difficulty to achieve cross-organizational governance model). Data and Metadata: Data inputs and data outputs produced based on the application logic. Infrastructure Environment: The infrastructure (including private cloud, public cloud or a combination of both) that hosts application logic and data.

Metadata

Metadata Cost-Benefit Enterprise Interactive

How REA Group approaches Amazon MSK cluster capacity planning

AWS Big Data

DECEMBER 5, 2024

In each environment, Hydro manages a single MSK cluster that hosts multiple tenants with differing workload requirements. In the future, we plan to profile workloads based on metadata, cross-check them with capacity metrics, and place them in the appropriate MSK cluster.

Metrics

Metrics Dashboards Testing Optimization

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

AWS Big Data

MAY 4, 2023

Amazon’s Open Data Sponsorship Program allows organizations to host free of charge on AWS. After deployment, the user will have access to a Jupyter notebook, where they can interact with two datasets from ASDI on AWS: Coupled Model Intercomparison Project 6 (CMIP6) and ECMWF ERA5 Reanalysis.

Data Processing

Data Processing Metadata Informatics Interactive

Achieve high availability in Amazon OpenSearch Multi-AZ with Standby enabled domains: A deep dive into failovers

AWS Big Data

JANUARY 10, 2024

During the query phase of a search request, the coordinator determines the shards to be queried and sends a request to the data node hosting the shard copy. One notable drawback of this replication model is its susceptibility to slowdowns in the event of any impairment in the write path.

Metadata

Metadata Broadcasting Data Processing Modeling

Business Intelligence for Fairs, Congresses and Exhibitions

Smart Data Collective

APRIL 14, 2021

With this tool, analysts are able to visualize complex data models in Python, SQL, and R. it offers data connectors, visualization layers, and hosting all in one package, making it ideal for teams that are data-driven with limited resources. It also comes with data caching capabilities that enable fast querying.

Business Intelligence

Business Intelligence Dashboards Visualization Big Data

Build multimodal search with Amazon OpenSearch Service

AWS Big Data

JUNE 18, 2024

To enable multimodal search across text, images, and combinations of the two, you generate embeddings for both text-based image metadata and the image itself. Amazon Titan Multimodal Embeddings G1 is a multimodal embedding model that generates embeddings to facilitate multimodal search. Add model access in Amazon Bedrock.

Dashboards

Dashboards Metadata Modeling Visualization

Providing fine-grained, trusted access to enterprise datasets with Okera and Domino

Domino Data Lab

OCTOBER 1, 2020

Combining the power of Domino Data Labs with Okera, your data scientists only get access to the columns, rows, and cells allowed, easily removing or redacting sensitive data such as PII and PHI not relevant to training models. So what does this look like? client('s3') obj = s3.get_object(Bucket='clinical-trials',

Enterprise

Enterprise Metadata Cost-Benefit Data Science

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

MARCH 9, 2021

In this blog, we’ll highlight the key CDP aspects that provide data governance and lineage and show how they can be extended to incorporate metadata for non-CDP systems from across the enterprise. Atlas provides open metadata management and governance capabilities to build a catalog of all assets, and also classify and govern these assets.

Data Governance

Data Governance Metadata Enterprise Data Processing

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

AWS Big Data

FEBRUARY 1, 2024

Large language models (LLMs) are becoming increasing popular, with new use cases constantly being explored. This is where model fine-tuning can help. Before you can fine-tune a model, you need to find a task-specific dataset. Next, we use Amazon SageMaker JumpStart to fine-tune the Llama 2 model with the preprocessed dataset.

Metadata

Metadata Modeling Data Processing Unstructured Data

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

The FinAuto team built AWS Cloud Development Kit (AWS CDK), AWS CloudFormation , and API tools to maintain a metadata store that ingests from domain owner catalogs into the global catalog. The global catalog is also periodically fully refreshed to resolve issues during metadata sync processes to maintain resiliency.

Finance

Finance Metadata Big Data Recreation/Entertainment

AVB accelerates search in LINQ with Amazon OpenSearch Service

AWS Big Data

MAY 21, 2024

Initially, searches from Hub queried LINQ’s Microsoft SQL Server database hosted on Amazon Elastic Compute Cloud (Amazon EC2), with search times averaging 3 seconds, leading to reduced adoption and negative feedback. The LINQ team exposes access to the OpenSearch Service index through a search API hosted on Amazon EC2.

Manufacturing

Manufacturing Sales Optimization Data Processing

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Each service is hosted in a dedicated AWS account and is built and maintained by a product owner and a development team, as illustrated in the following figure. Delta tables technical metadata is stored in the Data Catalog, which is a native source for creating assets in the Amazon DataZone business catalog.

Data Governance

Data Governance Publishing Data-driven Metadata

From Data Silos to Data Fabric with Knowledge Graphs

Ontotext

SEPTEMBER 15, 2020

This means the creation of reusable data services, machine-readable semantic metadata and APIs that ensure the integration and orchestration of data across the organization and with third-party external data. This means having the ability to define and relate all types of metadata. Create a human AND machine-meaningful data model.

Metadata

Metadata Knowledge Discovery Data Quality Data-driven

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

Open source frameworks such as Apache Impala, Apache Hive and Apache Spark offer a highly scalable programming model that is capable of processing massive volumes of structured and unstructured data by means of parallel execution on a large number of commodity computing nodes. . Limited flexibility to use more complex hosting models (e.g.,

Data Processing

Data Processing Data Warehouse Enterprise Visualization

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. While this approach provides isolation, it creates another significant challenge: duplication of data, metadata, and security policies, or ‘split-brain’ data lake.

Metadata

Metadata Data Lake Optimization Strategy

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

AWS Big Data

AUGUST 6, 2024

At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. Web UI Amazon MWAA comes with a managed web server that hosts the Airflow UI.

Cost-Benefit

Cost-Benefit Metadata Snapshot Metrics

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

You will learn how to prepare a multi-account environment to access the databases from AWS Glue, and how to model an ETL data flow that automatically masks PII as part of the transfer process, so that no sensitive information will be copied to the target database in its original form. See JDBC connections for further details.

Visualization

Visualization Metadata Data Transformation Testing

How EUROGATE established a data mesh architecture using Amazon DataZone

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Webinars

Trending Sources

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Webinars

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Have we reached the end of ‘too expensive’ for enterprise software?

How BMW streamlined data access using AWS Lake Formation fine-grained access control

What you need to know about product management for AI

The Struggle Between Data Dark Ages and LLM Accuracy

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Disaster recovery strategies for Amazon MWAA – Part 2

CIOs are (still) closer than ever to their dream data lakehouse

Data confidence begins at the edge

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Data Governance Maturity and Tracking Progress

What Is Data Governance? (And Why Your Organization Needs It)

Governing data in relational databases using Amazon DataZone

Do Large Language Models Dream of Knowledge Graphs – Impressions from Day 2 At SEMANTiCS 2023

Apache Ozone Powers Data Science in CDP Private Cloud

Themes and Conferences per Pacoid, Episode 11

Choosing the Right Cloud for Data Sovereignty

SAP enhances Datasphere and SAC for AI-driven transformation

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Build and share a business capability model with Amazon QuickSight

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

How can CIOs safely unleash generative AI on their company’s data?

How Cloudera Data Flow Enables Successful Data Mesh Architectures

How REA Group approaches Amazon MSK cluster capacity planning

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

Achieve high availability in Amazon OpenSearch Multi-AZ with Standby enabled domains: A deep dive into failovers

Business Intelligence for Fairs, Congresses and Exhibitions

Build multimodal search with Amazon OpenSearch Service

Providing fine-grained, trusted access to enterprise datasets with Okera and Domino

Data governance beyond SDX: Adding third party assets to Apache Atlas

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AVB accelerates search in LINQ with Amazon OpenSearch Service

HEMA accelerates their data governance journey with Amazon DataZone

From Data Silos to Data Fabric with Knowledge Graphs

Addressing the Three Scalability Challenges in Modern Data Platforms

Improving Multi-tenancy with Virtual Private Clusters

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

Stay Connected