Data Processing, Metadata and Optimization

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

We show how to build data pipelines using AWS Glue jobs, optimize them for both cost and performance, and implement schema evolution to automate manual tasks. Because a CDC file can contain data for multiple tables, the job loops over the tables in a file and loads the table metadata from the source table ( RDS column names).

Data Lake

Data Lake Data Processing Optimization Machine Learning

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

Amazon OpenSearch Service recently introduced the OpenSearch Optimized Instance family (OR1), which delivers up to 30% price-performance improvement over existing memory optimized instances in internal benchmarks, and uses Amazon Simple Storage Service (Amazon S3) to provide 11 9s of durability.

Optimization

Optimization Snapshot Metadata Cost-Benefit

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

Each Lucene index (and, therefore, each OpenSearch shard) represents a completely independent search and storage capability hosted on a single machine. How RFS works OpenSearch and Elasticsearch snapshots are a directory tree that contains both data and metadata. The following is an example for the structure of an Elasticsearch 7.10

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

Load balancing challenges with operating custom stream processing applications Customers processing real-time data streams typically use multiple compute hosts such as Amazon Elastic Compute Cloud (Amazon EC2) to handle the high throughput in parallel. KCL uses DynamoDB to store metadata such as shard-worker mapping and checkpoints.

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

For container terminal operators, data-driven decision-making and efficient data sharing are vital to optimizing operations and boosting supply chain efficiency. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This post is co-written by Dr. Leonard Heilig and Meliena Zlotos from EUROGATE.

IoT

IoT Machine Learning Metadata Data-driven

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Next, we focus on building the enterprise data platform where the accumulated data will be hosted. In this context, Amazon DataZone is the optimal choice for managing the enterprise data platform. Business analysts enhance the data with business metadata/glossaries and publish the same as data assets or data products.

Sales

Sales Data-driven Data Processing Key Performance Indicator

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Within the ANZ enterprise data mesh strategy, aligning data mesh nodes with the ANZ Group’s divisional structure provides optimal alignment between data mesh principles and organizational structure, as shown in the following diagram. A data portal for consumers to discover data products and access associated metadata.

Metadata

Metadata Data Governance Data Quality Data-driven

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

This can help you optimize long-term cost for high-throughput use cases. This includes adding common fields to associate metadata with the indexed documents, as well as parsing the log data to make data more searchable. In general, we recommend using one Kinesis data stream for your log aggregation workload.

Metadata

Metadata Metrics Analytics Data Processing

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Add Amplify hosting Amplify can host applications using either the Amplify console or Amazon CloudFront and Amazon Simple Storage Service (Amazon S3) with the option to have manual or continuous deployment. For simplicity, we use the Hosting with Amplify Console and Manual Deployment options.

Data Processing

Data Processing Metadata Publishing Testing

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

AWS Big Data

OCTOBER 24, 2024

With its scalability, reliability, and ease of use, Amazon OpenSearch Service helps businesses optimize data-driven decisions and improve operational efficiency. Launch an EC2 instance Note : Make sure to deploy the EC2 instance for hosting Jenkins in the same VPC as the OpenSearch domain. es.amazonaws.com' # e.g. my-test-domain.us-east-1.es.amazonaws.com,

Visualization

Visualization Management Data Processing Testing

Data confidence begins at the edge

CIO Business Intelligence

SEPTEMBER 23, 2024

Optimizing device performance within the typical edge constraints of power, energy, latency, space, weight, and cost is essential. Specifically, what the DCF does is capture metadata related to the application and compute stack. Addressing this complex issue requires a multi-pronged approach.

Manufacturing

Manufacturing Internet of Things Metadata Risk

How REA Group approaches Amazon MSK cluster capacity planning

AWS Big Data

DECEMBER 5, 2024

As the use of Hydro grows within REA, it’s crucial to perform capacity planning to meet user demands while maintaining optimal performance and cost-efficiency. In each environment, Hydro manages a single MSK cluster that hosts multiple tenants with differing workload requirements. Khizer Naeem is a Technical Account Manager at AWS.

Metrics

Metrics Dashboards Testing Optimization

Configure a custom domain name for your Amazon MSK cluster

AWS Big Data

JUNE 24, 2024

For the client to resolve DNS queries for the custom domain, an Amazon Route 53 private hosted zone is used to host the DNS records, and is associated with the client’s VPC to enable DNS resolution from the Route 53 VPC resolver. The Kafka client uses the custom domain bootstrap address to send a get metadata request to the NLB.

Advertising

Advertising Data Processing Metadata Management

Optimized joins & filtering with Bloom filter predicate in Kudu

Cloudera

JANUARY 15, 2021

Pushing down column predicate filters to Kudu allows for optimized execution by skipping reading column values for filtered out rows and reducing network IO between a client, like the distributed query engine Apache Impala, and Kudu. One of the ways Apache Kudu achieves this is by supporting column predicates with scanners. Before 7.1.5,

Optimization

Optimization Broadcasting Testing Metadata

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Through their unique position in ports, at sea, and on roads, they optimize global cargo flows and create sustainable customer value. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views. An AWS Glue job (metadata exporter) runs daily on the source account.

Metadata

Metadata Data Lake Machine Learning Big Data

Themes and Conferences per Pacoid, Episode 11

Domino Data Lab

JULY 2, 2019

In other words, using metadata about data science work to generate code. SQL optimization provides helpful analogies, given how SQL queries get translated into query graphs internally , then the real smarts of a SQL engine work over that graph. On deck this time ’round the Moon: program synthesis. SQL and Spark.

Metadata

Metadata Data Science Machine Learning Data-driven

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

But there’s a host of new challenges when it comes to managing AI projects: more unknowns, non-deterministic outcomes, new infrastructures, new processes and new tools. You might have millions of short videos , with user ratings and limited metadata about the creators or content.

Management

Management Machine Learning Experimentation Metrics

Business Intelligence for Fairs, Congresses and Exhibitions

Smart Data Collective

APRIL 14, 2021

Tableau is the right tool for creating rich, in-depth analytics or dashboards that can be optimized for tablets, phones, and desktops. it offers data connectors, visualization layers, and hosting all in one package, making it ideal for teams that are data-driven with limited resources.

Business Intelligence

Business Intelligence Dashboards Visualization Big Data

AVB accelerates search in LINQ with Amazon OpenSearch Service

AWS Big Data

MAY 21, 2024

Initially, searches from Hub queried LINQ’s Microsoft SQL Server database hosted on Amazon Elastic Compute Cloud (Amazon EC2), with search times averaging 3 seconds, leading to reduced adoption and negative feedback. The LINQ team exposes access to the OpenSearch Service index through a search API hosted on Amazon EC2.

Manufacturing

Manufacturing Sales Optimization Data Processing

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Data has become an invaluable asset for businesses, offering critical insights to drive strategic decision-making and operational optimization. Each service is hosted in a dedicated AWS account and is built and maintained by a product owner and a development team, as illustrated in the following figure.

Data Governance

Data Governance Publishing Data-driven Metadata

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Data and Metadata: Data inputs and data outputs produced based on the application logic. Also included, business and technical metadata, related to both data inputs / data outputs, that enable data discovery and achieving cross-organizational consensus on the definitions of data assets.

Metadata

Metadata Cost-Benefit Enterprise Interactive

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

The FinAuto team built AWS Cloud Development Kit (AWS CDK), AWS CloudFormation , and API tools to maintain a metadata store that ingests from domain owner catalogs into the global catalog. The global catalog is also periodically fully refreshed to resolve issues during metadata sync processes to maintain resiliency.

Finance

Finance Metadata Big Data Recreation/Entertainment

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Additionally, it enables cost optimization by aligning resources with specific use cases, making sure that expenses are well controlled. In the second account, Amazon MWAA is hosted in one VPC and Redshift Serverless in a different VPC, which are connected through VPC peering. secretsmanager ). redshift-serverless.amazonaws.com:5439?

Metadata

Metadata Data Processing Management Testing

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports. This is essential to measure and optimize this time, as it has many repercussions on the success of a business. Metadata management: Good data quality control starts with metadata management.

Data Quality

Data Quality Metrics Data-driven Management

How Zurich Insurance Group built a log management solution on AWS

AWS Big Data

JULY 16, 2024

The new approach would need to offer the flexibility to integrate new technologies such as machine learning (ML), scalability to handle long-term retention at forecasted growth levels, and provide options for cost optimization. Previously, P2 logs were ingested into the SIEM. We discuss these key benefits in the following sections.

Insurance

Insurance Management Cost-Benefit Optimization

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. While this approach provides isolation, it creates another significant challenge: duplication of data, metadata, and security policies, or ‘split-brain’ data lake.

Metadata

Metadata Data Lake Optimization Strategy

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. The hardware certification includes high density nodes with close to 500 TB per node optimized for performance and TCO. Optimize the Ozone Client to Ozone Manager protocols for reduced network round trips.

Management

Management Metadata Metrics Optimization

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

AWS Big Data

SEPTEMBER 12, 2024

We developed and host several applications for our customers on Amazon Web Services (AWS). These embeddings, along with metadata such as the document ID and page number, are stored in OpenSearch Service. Each search method complemented the other, leading to optimal results. We’re using different models for different use cases.

Unstructured Data

Unstructured Data Metadata Machine Learning Consulting

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Analyzing historical patterns allows you to optimize performance, identify issues proactively, and improve planning. An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog. All of the resources are defined in a sample AWS Cloud Development Kit (AWS CDK) template.

Metrics

Metrics Visualization Dashboards Publishing

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

AWS Big Data

AUGUST 6, 2024

At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. Web UI Amazon MWAA comes with a managed web server that hosts the Airflow UI.

Cost-Benefit

Cost-Benefit Metadata Snapshot Metrics

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

AWS Big Data

MAY 31, 2024

This encompasses tasks such as integrating diverse data from various sources with distinct formats and structures, optimizing the user experience for performance and security, providing multilingual support, and optimizing for cost, operations, and reliability. Based on metadata, content is returned from Amazon S3 to the user.

Metadata

Metadata Data-driven Management Testing

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

By separating the compute, the metadata, and data storage, CDW dynamically adapts to changing workloads and resource requirements, speeding up deployment while effectively managing costs – while preserving a shared access and governance model. CDW implements a true win-win architecture, for all stakeholders’ and all data involved.

Data Warehouse

Data Warehouse Data Lake IT Analytics

Quantifying the value of multi-cloud deployment strategies with CDP Public Cloud

Cloudera

MAY 6, 2021

Infrastructure Cost Optimization. Infrastructure Optimization . Infrastructure Optimization. Replication Manager: A mechanism that helps migrating workloads between clouds and form factors (CDP Private and Public Cloud) and metadata with minimal effort. Germany (Primary Market) . North America (US East Region).

Strategy

Strategy Cost-Benefit Optimization Risk

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

Limited flexibility to use more complex hosting models (e.g., Increased integration costs using different loose or tight coupling approaches between disparate analytical technologies and hosting environments. public, private, hybrid cloud)?

Data Processing

Data Processing Data Warehouse Enterprise Visualization

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

It integrates data across a wide arrange of sources to help optimize the value of ad dollar spending. Its cloud-hosted tool manages customer communications to deliver the right messages at times when they can be absorbed. Along the way, metadata is collected, organized, and maintained to help debug and ensure data integrity.

Management

Management Advertising Data Lake Sales

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings. The data lake performance optimization is especially important for queries with multiple joins and that is where cost-based optimizers helps the most.

Statistics

Statistics Data Lake Optimization Data-driven

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

In this post, we show how smava optimized their data platform by using Amazon Redshift Serverless and Amazon Redshift data sharing to overcome right-sizing challenges for unpredictable workloads and further improve price-performance. The following diagram shows the high-level data platform architecture before the optimizations.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

AWS Big Data

JUNE 12, 2024

Amazon SQS receives an Amazon S3 event notification as a JSON file with metadata such as the S3 bucket name, object key, and timestamp. Create an SQS queue Amazon SQS offers a secure, durable, and available hosted queue that lets you integrate and decouple distributed software systems and components.

Dashboards

Dashboards Visualization Sales IoT

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

This data is sent to Apache Kafka, which is hosted on Amazon Managed Streaming for Apache Kafka (Amazon MSK). Determining optimal table partitioning Determining optimal partitioning for each table is very important in order to optimize query performance and minimize the impact on teams querying the tables when partitioning changes.

Data Lake

Data Lake Analytics Snapshot Data Quality

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. Cold storage is optimized to store infrequently accessed or historical data. Organizations often need to manage a high volume of data that is growing at an extraordinary rate.

Data Lake

Data Lake Analytics Dashboards Metrics

Extreme data center pressure? Burst to the cloud with CDP!

Cloudera

NOVEMBER 12, 2020

Burst to Cloud not only relieves pressure on your data center, but it also protects your VIP applications and users by giving them optimal performance without breaking your bank. Today, it is nearly impossible for IT departments to know if a particular workload is optimal to move from on-premises to the cloud.

Data Warehouse

Data Warehouse Reporting Risk Cost-Benefit

Cloudera DataFlow for the Public Cloud: A technical deep dive

Cloudera

AUGUST 16, 2021

Users access the CDF-PC service through the hosted CDP Control Plane. The CDP control plane hosts critical components of CDF-PC like the Catalog , the Dashboard and the ReadyFlow Gallery. This will create a JSON file containing the flow metadata. and later). Figure 3: Export a NiFi process group from an existing cluster.

Dashboards

Dashboards Metrics KPI Data-driven

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

2020 saw us hosting our first ever fully digital Data Impact Awards ceremony, and it certainly was one of the highlights of our year. Use cases could include but are not limited to: predictive maintenance, log data pipeline optimization, connected vehicles, industrial IoT, fraud detection, patient monitoring, network monitoring, and more.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Cloudera

JANUARY 19, 2024

It can help you to create, edit, optimize, fix, and succinctly summarize queries using natural language. This will expand the SQL AI toolbar with buttons to generate, edit, explain, optimize and fix SQL statements. After using edit, optimize, or fix, a preview shows the original query and the modified query differences.

Data Warehouse

Data Warehouse Data Processing Optimization Modeling

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Webinars

Trending Sources

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Webinars

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

How EUROGATE established a data mesh architecture using Amazon DataZone

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Integrate custom applications with AWS Lake Formation – Part 2

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

Data confidence begins at the edge

How REA Group approaches Amazon MSK cluster capacity planning

Configure a custom domain name for your Amazon MSK cluster

Optimized joins & filtering with Bloom filter predicate in Kudu

How Cargotec uses metadata replication to enable cross-account data sharing

Themes and Conferences per Pacoid, Episode 11

What you need to know about product management for AI

Business Intelligence for Fairs, Congresses and Exhibitions

AVB accelerates search in LINQ with Amazon OpenSearch Service

HEMA accelerates their data governance journey with Amazon DataZone

How Cloudera Data Flow Enables Successful Data Mesh Architectures

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

How Zurich Insurance Group built a log management solution on AWS

Improving Multi-tenancy with Virtual Private Clusters

Boosting Object Storage Performance with Ozone Manager

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Quantifying the value of multi-cloud deployment strategies with CDP Public Cloud

Addressing the Three Scalability Challenges in Modern Data Platforms

Top 15 data management platforms

Enhance query performance using AWS Glue Data Catalog column-level statistics

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Extreme data center pressure? Burst to the cloud with CDP!

Cloudera DataFlow for the Public Cloud: A technical deep dive

Announcing the 2021 Data Impact Awards

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Stay Connected