Analytics, Data Processing and Metadata

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

Each Lucene index (and, therefore, each OpenSearch shard) represents a completely independent search and storage capability hosted on a single machine. How RFS works OpenSearch and Elasticsearch snapshots are a directory tree that contains both data and metadata. The following is an example for the structure of an Elasticsearch 7.10

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

Internally, making data accessible and fostering cross-departmental processing through advanced analytics and data science enhances information use and decision-making, leading to better resource allocation, reduced bottlenecks, and improved operational performance. Eliminate centralized bottlenecks and complex data pipelines.

IoT

IoT Machine Learning Metadata Data-driven

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

You can use this approach for a variety of use cases, from real-time log analytics to integrating application messaging data for real-time search. This allows the log analytics pipeline to meet Well-Architected best practices for resilience ( REL04-BP02 ) and cost ( COST09-BP02 ).

Metadata

Metadata Metrics Analytics Data Processing

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

format(dbname, table_name)) except Exception as ex: print(ex) failed_table = {"table_name": table_name, "Reason": ex} unprocessed_tables.append(failed_table) def get_table_key(host, port, username, password, dbname): jdbc_url = "jdbc:sqlserver://{0}:{1};databaseName={2}".format(host, To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

These nodes can implement analytical platforms like data lake houses, data warehouses, or data marts, all united by producing data products. The Institutional Data & AI platform adopts a federated approach to data while centralizing the metadata to facilitate simpler discovery and sharing of data products.

Metadata

Metadata Data Governance Data Quality Data-driven

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Add Amplify hosting Amplify can host applications using either the Amplify console or Amazon CloudFront and Amazon Simple Storage Service (Amazon S3) with the option to have manual or continuous deployment. For simplicity, we use the Hosting with Amplify Console and Manual Deployment options.

Data Processing

Data Processing Metadata Publishing Testing

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight.

Data Lake

Data Lake Sales Metadata Machine Learning

Gartner Data & Analytics Summit – May 12-13 in London

Octopai

MAY 4, 2025

Octopai, CEO will be hosting a session on leveraging Metadata to build knowledge hub for successful enterprise AI implementation. Octopai and Cloudera will be attending and sponsoring the Gartner D&A in London from May 12-13.

Data Analytics

Data Analytics Metadata Data Processing Analytics

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

Load balancing challenges with operating custom stream processing applications Customers processing real-time data streams typically use multiple compute hosts such as Amazon Elastic Compute Cloud (Amazon EC2) to handle the high throughput in parallel. KCL uses DynamoDB to store metadata such as shard-worker mapping and checkpoints.

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Organizations with legacy, on-premises, near-real-time analytics solutions typically rely on self-managed relational databases as their data store for analytics workloads. Near-real-time streaming analytics captures the value of operational data and metrics to provide new insights to create business opportunities.

Management

Management Metadata Analytics Dashboards

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

AWS Big Data

OCTOBER 24, 2024

Amazon OpenSearch Service is a fully managed service for search and analytics. It allows organizations to secure data, perform searches, analyze logs, monitor applications in real time, and explore interactive log analytics. es.amazonaws.com' # e.g. my-test-domain.us-east-1.es.amazonaws.com, 1)[0] data = open(path, 'r').read()

Visualization

Visualization Management Data Processing Testing

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Amazon DataZone now launched authentication supports through the Amazon Athena JDBC driver, allowing data users to seamlessly query their subscribed data lake assets via popular business intelligence (BI) and analytics tools like Tableau, Power BI, Excel, SQL Workbench, DBeaver, and more.

Visualization

Visualization Data Lake Testing Data Governance

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of. For Host , enter your host name of your Aurora PostgreSQL database cluster. On your project, in the navigation pane, choose Data.

Visualization

Visualization Data Processing Testing Publishing

Disaster recovery strategies for Amazon MWAA – Part 2

AWS Big Data

JUNE 17, 2024

The solution for this post is hosted on GitHub. Backup and restore architecture The backup and restore strategy involves periodically backing up Amazon MWAA metadata to Amazon Simple Storage Service (Amazon S3) buckets in the primary Region. This is the bucket where you host all of your DAGs for your environment. [1.b]

Strategy

Strategy Metadata Recreation/Entertainment Metrics

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. With its massively parallel processing (MPP) architecture and columnar data storage, Amazon Redshift delivers high price-performance for complex analytical queries against large datasets.

Sales

Sales Metadata Enterprise Testing

Configure a custom domain name for your Amazon MSK cluster

AWS Big Data

JUNE 24, 2024

For the client to resolve DNS queries for the custom domain, an Amazon Route 53 private hosted zone is used to host the DNS records, and is associated with the client’s VPC to enable DNS resolution from the Route 53 VPC resolver. The Kafka client uses the custom domain bootstrap address to send a get metadata request to the NLB.

Advertising

Advertising Data Processing Metadata Management

CIOs are (still) closer than ever to their dream data lakehouse

CIO Business Intelligence

OCTOBER 15, 2024

AWS, which has integrated Iceberg into analytics services like AWS Glue and Amazon Athena, has been actively involved in Iceberg’s development for the past three years. The data catalog is critical because it’s where business manages its metadata,” said Venkat Rajaji, Senior Vice President of Product Management at Cloudera.

Metadata

Metadata Data Processing Uncertainty Data Warehouse

Data confidence begins at the edge

CIO Business Intelligence

SEPTEMBER 23, 2024

For sectors such as industrial manufacturing and energy distribution, metering, and storage, embracing artificial intelligence (AI) and generative AI (GenAI) along with real-time data analytics, instrumentation, automation, and other advanced technologies is the key to meeting the demands of an evolving marketplace, but it’s not without risks.

Manufacturing

Manufacturing Internet of Things Metadata Risk

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

As you experience the benefits of consolidating your data governance strategy on top of Amazon DataZone, you may want to extend its coverage to new, diverse data repositories (either self-managed or as managed services) including relational databases, third-party data warehouses, analytic platforms and more.

Metadata

Metadata Data Lake Data Processing Data-driven

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

After you create the asset, you can add glossaries or metadata forms, but its not necessary for this post. Delete the S3 bucket that hosted the unstructured asset. About the Authors Somdeb Bhattacharjee is a Senior Solutions Architect specializing on data and analytics. Enter a name for the asset. Delete the Lambda function.

Publishing

Publishing Unstructured Data Metadata Data-driven

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

You can take all your data from various silos, aggregate that data in your data lake, and perform analytics and machine learning (ML) directly on top of that data. You can now analyze infrequently queried data in cloud object stores and simultaneously use the operational analytics and visualization capabilities of OpenSearch Service.

Data Lake

Data Lake Analytics Dashboards Metrics

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

Today, customers widely use OpenSearch Service for operational analytics because of its ability to ingest high volumes of data while also providing rich and interactive analytics. As your operational analytics data velocity and volume of data grows, bottlenecks may emerge.

Optimization

Optimization Snapshot Metadata Cost-Benefit

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

But there’s a host of new challenges when it comes to managing AI projects: more unknowns, non-deterministic outcomes, new infrastructures, new processes and new tools. You might have millions of short videos , with user ratings and limited metadata about the creators or content. If you can’t walk, you’re unlikely to run.

Management

Management Machine Learning Experimentation Metrics

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights. Analytics use cases on data lakes are always evolving. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views. The target accounts read data from the source account S3 buckets.

Metadata

Metadata Data Lake Machine Learning Big Data

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. The onboarding of producers is facilitated by sharing metadata, whereas the onboarding of consumers is based on granting permission to access this metadata. The producer account will host the EMR cluster and S3 buckets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

Data analytics – Business analysts gather operational insights from multiple data sources, including the location data collected from the vehicles. The Data Catalog provides metadata that allows analytics applications using Athena to find, read, and process the location data stored in Amazon S3. Choose Run.

Analytics

Analytics IoT Metadata Internet of Things

Enhance your analytics embedding experience with the new Amazon QuickSight JavaScript SDK

AWS Big Data

MARCH 9, 2023

break; } } } const frameOptions = { url: ' ', container: document.getElementById("dashboardContainer"), width: "100%", height: "AutoFit", loadingHeight: "200px", withIframePlaceholder: true, onChange: (changeEvent, metadata) => { switch (changeEvent.eventName) { case 'ERROR': { document.getElementById("dashboardContainer").append('Unable

Slice and Dice

Slice and Dice Dashboards Analytics Interactive

What Is Data Governance? (And Why Your Organization Needs It)

erwin

AUGUST 28, 2020

However, since GDPR’s implementation better decision-making and analytics are their top drivers for investing in data governance. More accurate analytics and improved decision-making: Be more confident in the quality of your data and the decisions you make based on it.

Data Governance

Data Governance IT Cost-Benefit Metadata

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

AWS Big Data

FEBRUARY 27, 2024

Migration of metadata such as security roles and dashboard objects will be covered in another subsequent post. Update the following information for the source: Uncomment hosts and specify the endpoint of the existing OpenSearch Service endpoint. For now, you can leave the default minimum as 1 and maximum as 4.

Metadata

Metadata Data Processing Dashboards IoT

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

But – you need those mission critical analytics services, and you need them now! . By separating the compute, the metadata, and data storage, CDW dynamically adapts to changing workloads and resource requirements, speeding up deployment while effectively managing costs – while preserving a shared access and governance model.

Data Warehouse

Data Warehouse Data Lake IT Analytics

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

BMS’s EDLS platform hosts over 5,000 jobs and is growing at 15% YoY (year over year). EDLS job steps and metadata Every EDLS job comprises one or more job steps chained together and run in a predefined order orchestrated by the custom ETL framework. It retrieves the specified files and available metadata to show on the UI.

Metadata

Metadata Data Lake Visualization Data Quality

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

quintillion bytes of data being produced on a daily basis and the wide range of online data analysis tools in the market, the use of data and analytics has never been more accessible. It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports.

Data Quality

Data Quality Metrics Data-driven Management

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

In the second account, Amazon MWAA is hosted in one VPC and Redshift Serverless in a different VPC, which are connected through VPC peering. Otherwise, it will check the metadata database for the value and return that instead. Create an Airflow connection through the metadata database You can also create connections in the UI.

Metadata

Metadata Data Processing Management Testing

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Amazon OpenSearch Service is a fully managed search and analytics service powered by the Apache Lucene search library that can be operated within a virtual private cloud (VPC). Create an Amazon Route 53 public hosted zone such as mydomain.com to be used for routing internet traffic to your domain. Take note of the group ID.

Dashboards

Dashboards Data Processing Metadata Consulting

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

In this post, we discuss how the Amazon Finance Automation team used AWS Lake Formation and the AWS Glue Data Catalog to build a data mesh architecture that simplified data governance at scale and provided seamless data access for analytics, AI, and machine learning (ML) use cases. Challenges Amazon businesses have grown over the years.

Finance

Finance Metadata Big Data Recreation/Entertainment

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Application Logic: Application logic refers to the type of data processing, and can be anything from analytical or operational systems to data pipelines that ingest data inputs, apply transformations based on some business logic and produce data outputs. Data and Metadata: Data inputs and data outputs produced based on the application logic.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

Recently, we announced enhanced multi-function analytics support in Cloudera Data Platform (CDP) with Apache Iceberg. Iceberg is a high-performance open table format for huge analytic data sets. To provide the CM host we can copy the FQDN of the node where Cloudera Manager is running.

Snapshot

Snapshot Data Processing Metadata Data Processing

Business Intelligence for Fairs, Congresses and Exhibitions

Smart Data Collective

APRIL 14, 2021

The proper use of business intelligence and analytical data is what drives big brands in a competitive market. This is a self-service analytical platform for business users. Once your analytics team gets it up and running, it can be easy to use by anyone in your business. It comes with embedded dashboards privately and publicly.

Business Intelligence

Business Intelligence Dashboards Visualization Big Data

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

However, many companies today still struggle to effectively harness and use their data due to challenges such as data silos, lack of discoverability, poor data quality, and a lack of data literacy and analytical capabilities to quickly access and use data across the organization. Tommaso is the Head of Data & Cloud Platforms at HEMA.

Data Governance

Data Governance Publishing Data-driven Metadata

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

One key component that plays a central role in modern data architectures is the data lake, which allows organizations to store and analyze large amounts of data in a cost-effective manner and run advanced analytics and machine learning (ML) at scale. Moreover, running advanced analytics and ML on disparate data sources proved challenging.

Data Lake

Data Lake Analytics Snapshot Data Quality

Gartner Data & Analytics Summit 2022 in London: 3 Key Takeaways

Alation

MAY 19, 2022

Alation attended last week’s Gartner Data and Analytics Summit in London from May 9 – 11, 2022. Gartner Data & Analytics Summit 2022: Keynote Highlights. Active metadata gives you crucial context around what data you have and how to use it wisely. These are three areas in which analytics is rapidly advancing.

Metadata

Metadata Data Analytics Analytics Data Governance

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg captures metadata information on the state of datasets as they evolve and change over time. AWS Glue crawlers will extract schema information and update the location of Iceberg metadata and schema updates in the Data Catalog.

Data Lake

Data Lake Metadata Snapshot Management

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

How EUROGATE established a data mesh architecture using Amazon DataZone

Webinars

Trending Sources

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Webinars

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Integrate custom applications with AWS Lake Formation – Part 2

How BMW streamlined data access using AWS Lake Formation fine-grained access control

Gartner Data & Analytics Summit – May 12-13 in London

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Disaster recovery strategies for Amazon MWAA – Part 2

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Configure a custom domain name for your Amazon MSK cluster

CIOs are (still) closer than ever to their dream data lakehouse

Data confidence begins at the edge

Governing data in relational databases using Amazon DataZone

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

What you need to know about product management for AI

Migrate an existing data lake to a transactional data lake using Apache Iceberg

How Cargotec uses metadata replication to enable cross-account data sharing

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Enhance your analytics embedding experience with the new Amazon QuickSight JavaScript SDK

What Is Data Governance? (And Why Your Organization Needs It)

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Business Intelligence for Fairs, Congresses and Exhibitions

HEMA accelerates their data governance journey with Amazon DataZone

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Gartner Data & Analytics Summit 2022 in London: 3 Key Takeaways

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Stay Connected