Big Data, Data Processing and Metadata

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

Each Lucene index (and, therefore, each OpenSearch shard) represents a completely independent search and storage capability hosted on a single machine. As a backup strategy, snapshots can be created automatically in OpenSearch, or users can create a snapshot manually for restoring it on to a different domain or for data migration.

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

format(dbname, table_name)) except Exception as ex: print(ex) failed_table = {"table_name": table_name, "Reason": ex} unprocessed_tables.append(failed_table) def get_table_key(host, port, username, password, dbname): jdbc_url = "jdbc:sqlserver://{0}:{1};databaseName={2}".format(host, To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

Integrate custom applications with AWS Lake Formation – Part 2

AWS Big Data

NOVEMBER 19, 2024

Add Amplify hosting Amplify can host applications using either the Amplify console or Amazon CloudFront and Amazon Simple Storage Service (Amazon S3) with the option to have manual or continuous deployment. For simplicity, we use the Hosting with Amplify Console and Manual Deployment options.

Data Processing

Data Processing Metadata Publishing Testing

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

An extract, transform, and load (ETL) process using AWS Glue is triggered once a day to extract the required data and transform it into the required format and quality, following the data product principle of data mesh architectures. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog.

IoT

IoT Machine Learning Metadata Data-driven

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

This approach simplifies your data journey and helps you meet your security requirements. The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections. On your project, in the navigation pane, choose Data. For Add data source , choose Add connection. Choose the plus sign.

Visualization

Visualization Data Processing Testing Publishing

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

The IAM role ARN must be the same for both the OpenSearch Servicer sink definition and the Kinesis Data Streams source definition. You can control what data gets indexed in different indexes using the index definition in the sink.

Metadata

Metadata Metrics Analytics Data Processing

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Given that we are dealing with a SaaS integration, AWS Glue is the logical choice for seamless data ingestion. Next, we focus on building the enterprise data platform where the accumulated data will be hosted. The data security officer sets permissions in Amazon DataZone to allow users to access the data portal.

Sales

Sales Data-driven Data Processing Key Performance Indicator

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

To achieve this, they aimed to break down data silos and centralize data from various business units and countries into the BMW Cloud Data Hub (CDH). Consumer accounts : Used by data consumers to implement use cases insights and build applications tailored to their business needs.

Data Lake

Data Lake Sales Metadata Machine Learning

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Institutional Data & AI Platform architecture The Institutional Division has implemented a self-service data platform to enable the domain teams to build and manage data products autonomously. The following diagram illustrates the building blocks of the Institutional Data & AI Platform.

Metadata

Metadata Data Governance Data Quality Data-driven

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Publish data assets – As the data producer from the retail team, you must ingest individual data assets into Amazon DataZone. For this use case, create a data source and import the technical metadata of four data assets— customers , order_items , orders , products , reviews , and shipments —from AWS Glue Data Catalog.

Visualization

Visualization Data Lake Testing Data Governance

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

Load balancing challenges with operating custom stream processing applications Customers processing real-time data streams typically use multiple compute hosts such as Amazon Elastic Compute Cloud (Amazon EC2) to handle the high throughput in parallel. KCL uses DynamoDB to store metadata such as shard-worker mapping and checkpoints.

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

AWS Big Data

OCTOBER 24, 2024

Launch an EC2 instance Note : Make sure to deploy the EC2 instance for hosting Jenkins in the same VPC as the OpenSearch domain. es.amazonaws.com, this will be different for VPC hosted domain region = 'us-east-1' # e.g. us-west-1 service = 'es' credentials = boto3.Session().get_credentials() 1)[0] data = open(path, 'r').read()

Visualization

Visualization Management Data Processing Testing

Disaster recovery strategies for Amazon MWAA – Part 2

AWS Big Data

JUNE 17, 2024

The solution for this post is hosted on GitHub. Backup and restore architecture The backup and restore strategy involves periodically backing up Amazon MWAA metadata to Amazon Simple Storage Service (Amazon S3) buckets in the primary Region. This is the bucket where you host all of your DAGs for your environment. [1.b]

Strategy

Strategy Metadata Recreation/Entertainment Metrics

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

Select the Consumption hosting plan and then choose Select. Save the federation metadata XML file You use the federation metadata file to configure the IAM IdP in a later step. In the Single sign-on section , under SAML Certificates , choose Download for Federation Metadata XML. Log in with your Azure account credentials.

Sales

Sales Metadata Enterprise Testing

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

After you create the asset, you can add glossaries or metadata forms, but its not necessary for this post. You can publish the data asset so its now discoverable within the Amazon DataZone portal. Delete the S3 bucket that hosted the unstructured asset. Enter a name for the asset. For Asset type , choose S3 object collection.

Publishing

Publishing Unstructured Data Metadata Data-driven

Configure a custom domain name for your Amazon MSK cluster

AWS Big Data

JUNE 24, 2024

For the client to resolve DNS queries for the custom domain, an Amazon Route 53 private hosted zone is used to host the DNS records, and is associated with the client’s VPC to enable DNS resolution from the Route 53 VPC resolver. The Kafka client uses the custom domain bootstrap address to send a get metadata request to the NLB.

Advertising

Advertising Data Processing Metadata Management

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

As you experience the benefits of consolidating your data governance strategy on top of Amazon DataZone, you may want to extend its coverage to new, diverse data repositories (either self-managed or as managed services) including relational databases, third-party data warehouses, analytic platforms and more.

Metadata

Metadata Data Lake Data Processing Data-driven

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery. Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

With quality data at their disposal, organizations can form data warehouses for the purposes of examining trends and establishing future-facing strategies. Industry-wide, the positive ROI on quality data is well understood. 2 – Data profiling. Data profiling is an essential process in the DQM lifecycle.

Data Quality

Data Quality Metrics Data-driven Management

Business Intelligence for Fairs, Congresses and Exhibitions

Smart Data Collective

APRIL 14, 2021

Advancement in big data technology has made the world of business even more competitive. The proper use of business intelligence and analytical data is what drives big brands in a competitive market. Formerly known as Periscope, Sisense is a business intelligence tool ideal for cloud data teams.

Business Intelligence

Business Intelligence Dashboards Visualization Big Data

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

The FinAuto team built AWS Cloud Development Kit (AWS CDK), AWS CloudFormation , and API tools to maintain a metadata store that ingests from domain owner catalogs into the global catalog. This global catalog captures new or updated partitions from the data producer AWS Glue Data Catalogs.

Finance

Finance Metadata Big Data Recreation/Entertainment

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

AWS Big Data

MAY 4, 2023

The Amazon Sustainability Data Initiative (ASDI) uses the capabilities of Amazon S3 to provide a no-cost solution for you to store and share climate science workloads across the globe. Amazon’s Open Data Sponsorship Program allows organizations to host free of charge on AWS.

Data Processing

Data Processing Metadata Informatics Interactive

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Cross-account access has been set up between S3 buckets in Account A with resources in Account B to be able to load and unload data. In the second account, Amazon MWAA is hosted in one VPC and Redshift Serverless in a different VPC, which are connected through VPC peering. secretsmanager ). redshift-serverless.amazonaws.com:5439?

Metadata

Metadata Data Processing Management Testing

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

The following diagram illustrates an indexing flow involving a metadata update in OR1 During indexing operations, individual documents are indexed into Lucene and also appended to a write-ahead log also known as a translog. This makes snapshots more lightweight and faster because they don’t need to re-upload any additional data.

Optimization

Optimization Snapshot Metadata Cost-Benefit

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

It shows a call center streaming data source that sends the latest call center feed in every 15 seconds. The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. client("s3") S3_BUCKET = ' ' kinesis_client = boto3.client("kinesis")

Management

Management Metadata Analytics Dashboards

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Create an Amazon Route 53 public hosted zone such as mydomain.com to be used for routing internet traffic to your domain. For instructions, refer to Creating a public hosted zone. Request an AWS Certificate Manager (ACM) public certificate for the hosted zone. hosted_zone_id – The Route 53 public hosted zone ID.

Dashboards

Dashboards Data Processing Metadata Consulting

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

In-place data upgrade In an in-place data migration strategy, existing datasets are upgraded to Apache Iceberg format without first reprocessing or restating existing data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files. Open AWS Glue Studio.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

How REA Group approaches Amazon MSK cluster capacity planning

AWS Big Data

DECEMBER 5, 2024

In each environment, Hydro manages a single MSK cluster that hosts multiple tenants with differing workload requirements. In the future, we plan to profile workloads based on metadata, cross-check them with capacity metrics, and place them in the appropriate MSK cluster.

Metrics

Metrics Dashboards Testing Optimization

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

This means that there is out of the box support for Ozone storage in services like Apache Hive , Apache Impala, Apache Spark, and Apache Nifi, as well as in Private Cloud experiences like Cloudera Machine Learning (CML) and Data Warehousing Experience (DWX). Data ingestion through ‘s3’. awsAccessKey=s3-spark-user/HOST@REALM.COM.

Data Science

Data Science Forecasting Metadata Machine Learning

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Data Firehose uses an AWS Lambda function to transform data and ingest the transformed records into an Amazon Simple Storage Service (Amazon S3) bucket. An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog.

Metrics

Metrics Visualization Dashboards Publishing

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries.

Data Lake

Data Lake Metadata Data Processing Big Data

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

AWS Big Data

FEBRUARY 27, 2024

OSI is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to OpenSearch Service domains and OpenSearch Serverless collections. In this post, we outline the steps to make migrate the data between provisioned OpenSearch Service domains and OpenSearch Serverless.

Metadata

Metadata Data Processing Dashboards IoT

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

AWS Big Data

OCTOBER 18, 2023

To run HiveQL-based data workloads with Spark on Kubernetes mode, engineers must embed their SQL queries into programmatic code such as PySpark, which requires additional effort to manually change code. host') export PASSWORD=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output text | jq -r '.password')

Big Data

Big Data Data Processing Interactive Testing

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

AWS Big Data

SEPTEMBER 12, 2024

We use leading-edge analytics, data, and science to help clients make intelligent decisions. We developed and host several applications for our customers on Amazon Web Services (AWS). These embeddings, along with metadata such as the document ID and page number, are stored in OpenSearch Service.

Unstructured Data

Unstructured Data Metadata Machine Learning Consulting

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. BMS’s EDLS platform hosts over 5,000 jobs and is growing at 15% YoY (year over year). It retrieves the specified files and available metadata to show on the UI.

Metadata

Metadata Data Lake Visualization Data Quality

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

AWS Big Data

JULY 30, 2024

Two private subnets are used to set up the Amazon MWAA environment, and the third private subnet is used to host the AWS Lambda authorizer function. Review the metadata about your certificate and choose Import. Note the values for App Federation Metadata Url and Login URL. Choose Next. Choose Review and import. Choose Save.

Metadata

Metadata Enterprise Management Data Lake

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

Protecting what traditionally has been considered personally identifiable information (PII) — people’s names, addresses, government identification numbers and so forth — that a business collects, and hosts is just the beginning of GDPR mandates.

Data Governance

Data Governance Cost-Benefit Metadata Risk

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. Iceberg captures metadata information on the state of datasets as they evolve and change over time. Choose Create.

Data Lake

Data Lake Metadata Snapshot Management

Foote Partners: bonus disparities reveal tech skills most in demand in Q3

CIO Business Intelligence

DECEMBER 16, 2022

The top-earning skills were big data analytics and Ethereum, with a pay premium of 20% of base salary, both up 5.3% Other non-certified skills attracting a pay premium of 19% included data engineering , the Zachman Framework , Azure Key Vault and site reliability engineering (SRE). in the previous six months. since March.

Testing

Testing Metadata Data Processing Machine Learning

Achieve high availability in Amazon OpenSearch Multi-AZ with Standby enabled domains: A deep dive into failovers

AWS Big Data

JANUARY 10, 2024

In a Multi-AZ with Standby domain, the data nodes in the active zone act as coordinators for search requests. During the query phase of a search request, the coordinator determines the shards to be queried and sends a request to the data node hosting the shard copy.

Metadata

Metadata Broadcasting Data Processing Modeling

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Refer to Catalogs for more information.

Data Lake

Data Lake Metadata Business Analysis Data-driven

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Each service is hosted in a dedicated AWS account and is built and maintained by a product owner and a development team, as illustrated in the following figure. The business end-users were given a tool to discover data assets produced within the mesh and seamlessly self-serve on their data sharing needs.

Data Governance

Data Governance Publishing Data-driven Metadata

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Webinars

Trending Sources

Integrate custom applications with AWS Lake Formation – Part 2

Webinars

How EUROGATE established a data mesh architecture using Amazon DataZone

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

How BMW streamlined data access using AWS Lake Formation fine-grained access control

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

Disaster recovery strategies for Amazon MWAA – Part 2

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Configure a custom domain name for your Amazon MSK cluster

Governing data in relational databases using Amazon DataZone

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

How Cargotec uses metadata replication to enable cross-account data sharing

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Business Intelligence for Fairs, Congresses and Exhibitions

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

Migrate an existing data lake to a transactional data lake using Apache Iceberg

How REA Group approaches Amazon MSK cluster capacity planning

Apache Ozone Powers Data Science in CDP Private Cloud

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Query your Apache Hive metastore with AWS Lake Formation permissions

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

How Data Governance Protects Sensitive Data

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Foote Partners: bonus disparities reveal tech skills most in demand in Q3

Achieve high availability in Amazon OpenSearch Multi-AZ with Standby enabled domains: A deep dive into failovers

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Build a data lake with Apache Flink on Amazon EMR

HEMA accelerates their data governance journey with Amazon DataZone

Stay Connected