2012 and Metadata - Data Leaders Brief

Enrich your AWS Glue Data Catalog with generative AI metadata using Amazon Bedrock

AWS Big Data

NOVEMBER 15, 2024

Metadata can play a very important role in using data assets to make data driven decisions. Generating metadata for your data assets is often a time-consuming and manual task. This post shows you how to enrich your AWS Glue Data Catalog with dynamic metadata using foundation models (FMs) on Amazon Bedrock and your data documentation.

Metadata

Metadata Modeling Data-driven Machine Learning

Jumia builds a next-generation data platform with metadata-driven specification frameworks

AWS Big Data

DECEMBER 20, 2024

Jumia is a technology company born in 2012, present in 14 African countries, with its main headquarters in Lagos, Nigeria. Solution overview The basic concept of the modernization project is to create metadata-driven frameworks, which are reusable, scalable, and able to respond to the different phases of the modernization process.

Metadata

Metadata Data-driven Snapshot Data Lake

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

AWS Big Data

FEBRUARY 27, 2024

Migration of metadata such as security roles and dashboard objects will be covered in another subsequent post. For index , you can leave it as default, which will get the metadata from the source index and write to the same name in the destination as of the sources.

Metadata

Metadata Data Processing Dashboards IoT

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

Under the hood, UniForm generates Iceberg metadata files (including metadata and manifest files) that are required for Iceberg clients to access the underlying data files in Delta Lake tables. Both Delta Lake and Iceberg metadata files reference the same data files. The table is registered in AWS Glue Data Catalog.

Metadata

Metadata Data Warehouse Big Data Data Lake

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Iceberg tables maintain metadata to abstract large collections of files, providing data management features including time travel, rollback, data compaction, and full schema evolution, reducing management overhead. Snowflake writes Iceberg tables to Amazon S3 and updates metadata automatically with every transaction.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Add this policy to the AWS Glue role and Amazon MWAA role: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl" ], "Resource": "arn:aws:s3:::sample-inp-bucket-etl- /*" } ] } In Account B, create the IAM policy policy_for_roleB specifying Account A as a trusted entity.

Metadata

Metadata Data Processing Management Testing

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

This populates the technical metadata in the business data catalog for each data asset. The business metadata, can be added by business users to provide business context, tags, and data classification for the datasets. If new AWS Glue tables or metadata is created or updated, then it starts the data source sync job.

Data Lake

Data Lake Publishing Metadata Data-driven

Use AWS Glue Data Catalog views to analyze data

AWS Big Data

MAY 9, 2024

The objective is to create views in the Data Catalog so you can create a single common view schema and metadata object to use across engines (in this case, Athena). Solution overview For this post, we use the Women’s E-Commerce Clothing Review. Doing so lets you use the same views across your data lakes to fit your use case.

Data Lake

Data Lake Metadata Management Big Data

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. The onboarding of producers is facilitated by sharing metadata, whereas the onboarding of consumers is based on granting permission to access this metadata. compute.internal ). Choose Submit job run.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

For example, you can use metadata about the Kinesis data stream name to index by data stream ( ${getMetadata("kinesis_stream_name") ), or you can use document fields to index data depending on the CloudWatch log group or other document data ( ${path/to/field/in/document} ).

Metadata

Metadata Metrics Analytics Data Processing

Configure ADFS Identity Federation with Amazon QuickSight

AWS Big Data

FEBRUARY 23, 2023

The metadata document from your IdP. To download it, refer to Federation Metadata Explorer. For Metadata document , upload the metadata document you downloaded as a prerequisite. For Federation metadata address , enter [link]. An AD user with permissions to manage AD FS and AD group membership. Choose Add provider.

Metadata

Metadata Dashboards Management Enterprise

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

After the table is cataloged in your AWS Glue metadata catalog, you can run queries directly on your data in your S3 data lake through OpenSearch Dashboards. You can audit connections to ensure that they are set up in a scalable, cost-efficient, and secure way. Solution overview The following diagram illustrates the solution architecture.

Data Lake

Data Lake Analytics Dashboards Metrics

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Data Lake Testing Consulting

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

AWS Big Data

SEPTEMBER 26, 2023

With Lake Formation, you can manage access control for your data lake data in Amazon Simple Storage Service (Amazon S3 ) and its metadata in AWS Glue Data Catalog in one place with familiar database-style features. AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning.

Data Lake

Data Lake Metadata Management Modeling

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

AWS Big Data

SEPTEMBER 6, 2023

. // It serves as a simple API Gateway to Kafka Proxy, accepting requests and forwarding them to a Kafka topic. withBody("Message successfully pushed to kafka"); } catch (Exception e) { // In case of exception, log the error message and return a 500 status code log.error(e.getMessage(), e); return response.withBody(e.getMessage()).withStatusCode(500);

Testing

Testing Metadata Cost-Benefit Internet of Things

AWS Glue Data Catalog supports automatic optimization of Apache Iceberg tables through your Amazon VPC

AWS Big Data

NOVEMBER 21, 2024

Similarly, the orphan file deletion process scans the table metadata and the actual data files, identifies the unreferenced files, and deletes them to reclaim storage space. These storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance. Choose your S3 bucket. Choose Permissions.

Optimization

Optimization Snapshot Metadata Software

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

int '2' 'InstanceType': 'Ref': 'ClusterInstanceType' 'Market': 'ON_DEMAND' 'Name': 'Core' 'Outputs': 'ClusterId': 'Value': 'Ref': 'EmrCluster' 'Description': 'The ID of the EMR cluster' 'Metadata': 'AWS::CloudFormation::Designer': {} 'Rules': {} Trusted identity propagation is supported from Amazon EMR 6.15

Analytics

Analytics Data Lake Management Enterprise

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

Download the SAML metadata file. In the navigation pane under Clients , import the SAML metadata file. Download the Keycloak IdP SAML metadata file from that URL location. For Metadata document , upload the Keycloak IdP SAML metadata XML file you downloaded and saved to your local machine earlier. Choose Browse.

Metadata

Metadata Dashboards Business Intelligence Data Lake

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

The Data Catalog provides metadata that allows analytics applications using Athena to find, read, and process the location data stored in Amazon S3. The crawlers will automatically classify the data into JSON format, group the records into tables and partitions, and commit associated metadata to the AWS Glue Data Catalog. Choose Run.

Analytics

Analytics IoT Metadata Internet of Things

10 Years Later: Who’s the GOAT of Data Catalogs?

Alation

DECEMBER 15, 2022

December 2012: Alation forms and goes to work creating the first enterprise data catalog. August 2017: Alation debuts as a leader in the Gartner MQ for Metadata Management Solutions. August 2018: Gartner names Alation a 2X Leader in the MQ for Metadata Management Solutions. June 2017: Yahoo Japan Corp.

Metadata

Metadata Data Governance Data Quality Marketing

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

AWS Big Data

MARCH 22, 2023

The IdP metadata is displayed. In the SAML Certificates section, download the Federation Metadata XML file and the Certificate (Raw) file. For IdP SAML metadata under the Identity provider metadata section, choose Choose file. Choose the previously downloaded metadata file ( IIC-QuickSight.xml ). Choose Save.

Management

Management Metadata Enterprise Testing

Real-Real-World Programming with ChatGPT

O'Reilly on Data

JULY 25, 2023

To provide some coherence to the music, I decided to use Taylor Swift songs since her discography covers the time span of most papers that I typically read: Her main albums were released in 2006, 2008, 2010, 2012, 2014, 2017, 2019, 2020, and 2022. This choice also inspired me to call my project Swift Papers.

Consulting

Consulting Interactive Software Metadata

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

By selecting the corresponding asset, you can understand its content through the readme, glossary terms , and technical and business metadata. We use this data source to import metadata information related to our datasets. Use Amazon DataZone APIs through Boto3 to push custom data quality metadata.

Data Quality

Data Quality Visualization Metadata Metrics

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

AWS Big Data

SEPTEMBER 29, 2023

To analyze XML files stored in Amazon S3 using AWS Glue and Athena, we complete the following high-level steps: Create an AWS Glue crawler to extract XML metadata and create a table in the AWS Glue Data Catalog. We use the AWS Glue crawler to extract XML file metadata. We also use a custom XML classifier in this solution.

Metadata

Metadata Visualization Data-driven Optimization

Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK

AWS Big Data

MARCH 11, 2024

ORDERTOPIC" WHERE CAN_JSON_PARSE(kafka_value); The metadata column kafka_value that arrives from Amazon MSK is stored in VARBYTE format in Amazon Redshift. For this post, you use the JSON_PARSE function to convert kafka_value to a SUPER data type.

Analytics

Analytics Data Warehouse Optimization Metrics

Why We Started the Data Intelligence Project

Alation

JULY 7, 2022

In 2013 I joined American Family Insurance as a metadata analyst. I had always been fascinated by how people find, organize, and access information, so a metadata management role after school was a natural choice. The use cases for metadata are boundless, offering opportunities for innovation in every sector. The data scientist.

Metadata

Metadata Data-driven Insurance Statistics

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

AWS Big Data

JUNE 12, 2024

Amazon SQS receives an Amazon S3 event notification as a JSON file with metadata such as the S3 bucket name, object key, and timestamp. The OpenSearch Ingestion pipeline receives the message from Amazon SQS, loads the files from Amazon S3, and parses the CSV data from the message into columns.

Dashboards

Dashboards Visualization Sales IoT

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Eliminating dependency on business units – Redshift Spectrum uses a metadata layer to directly query the data residing in S3 data lakes, eliminating the need for data copying or relying on individual business units to initiate the copy jobs. Similarly, individual business units produce their own domain-specific data.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

Save the federation metadata XML file You use the federation metadata file to configure the IAM IdP in a later step. In the Single sign-on section , under SAML Certificates , choose Download for Federation Metadata XML. Complete the following steps to download the file: Navigate back to your SAML-based sign-in page.

Sales

Sales Metadata Enterprise Testing

Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients

AWS Big Data

MAY 4, 2023

Use the IdP metadata in block 4 and save the metadata file in.xml format (for example, metadata.xml ). Choose Choose file and upload the metadata file (.xml) Collect Okta information To gather your Okta information, complete the following steps: On the Sign On tab, choose View SAML setup instructions. Choose Add provider.

Data Warehouse

Data Warehouse Finance Sales Metadata

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Data as a product Treating data as a product entails three key components: the data itself, the metadata, and the associated code and infrastructure. For orchestration, they use the AWS Cloud Development Kit (AWS CDK) for infrastructure as code (IaC) and AWS Glue Data Catalogs for metadata management.

Data-driven

Data-driven Advertising Metadata Data Architecture

Convergent Evolution

Peter James Thomas

AUGUST 18, 2018

Overlapping with the above, from around 2012, I began to get involved in also designing and implementing Big Data Architectures; initially for narrow purposes and later Data Lakes spanning entire enterprises. This required additional investments in metadata. Of course some architectures featured both paradigms as well.

Data Lake

Data Lake Data Warehouse Data mining Statistics

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

AWS Big Data

OCTOBER 24, 2024

In the trust policy, specify that Amazon Elastic Compute Cloud (Amazon EC2) can assume this role: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" amazonaws.com" }, "Action": "sts:AssumeRole" } ] } Make a note of the role ARN.

Visualization

Visualization Management Data Processing Testing

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

An AWS Glue Crawler scans the above files and catalogs metadata about them into the AWS Glue Data Catalog. Select Create database , as shown in the following screenshot. Repeat the steps for creating other database like lobmarket and hr.

Data Lake

Data Lake Data Warehouse Management Risk

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

AWS Big Data

AUGUST 28, 2023

An example is provided below ocsf-cuid-${/class_uid}-${/metadata/product/name}-${/class_name}-%{yyyy.MM.dd} Complete the following steps to install the index templates and dashboards for your data: Download the component_templates.zip and index_templates.zip files and unzip them on your local device. Set region as us-east-1.

Dashboards

Dashboards Visualization Metadata Management

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

AWS Big Data

AUGUST 5, 2024

Member account 2 (Glue_Member_Account) is where metadata is cataloged in the Data Catalog and Lake Formation is enabled with IAM Identity Center integration. Member account 2 (Glue_Member_Account) where metadata is cataloged in the Data Catalog. We integrate users and groups from the IdP with IAM Identity Center.

Data Lake

Data Lake Finance Sales Management

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of. This approach simplifies your data journey and helps you meet your security requirements. Choose the created IAM role.

Visualization

Visualization Data Processing Testing Publishing

Amazon DataZone announces custom blueprints for AWS services

AWS Big Data

JUNE 26, 2024

Use case 3: Amazon S3 file uploads In addition to the download functionality, users often need to retain and attach metadata to new versions of files. For example, when you download a file, you can perform data changes, enrichment, or analysis on the file, and then upload the updated version back to the Amazon DataZone portal.

Data Lake

Data Lake Data Warehouse Unstructured Data Data Governance

Themes and Conferences per Pacoid, Episode 10

Domino Data Lab

JUNE 2, 2019

I recall a “Data Drinkup Group” gathering at a pub in Palo Alto, circa 2012, where I overheard Pete Skomoroch talking with other data scientists about Kahneman’s work. Rather, they were beaming about Kahneman’s work and its significance in our field.

Data Science

Data Science Data-driven Machine Learning Modeling

Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On

AWS Big Data

NOVEMBER 30, 2023

After you finish entering the required cluster metadata and create the resource, you can check the status for IdC integration in the properties. Note that when a new data warehouse is created, the IAM role specified for IdC integration is automatically attached to the provisioned cluster or Serverless Namespace.

Data Warehouse

Data Warehouse Finance Sales Management

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Founded in 2012, SumUp is the financial partner for more than 4 million small merchants in over 35 markets worldwide, helping them start, run and grow their business. Data Catalog: We also wanted to automate a Glue Crawler to have metadata in a Data Catalog and be able to explore our files in S3 with Athena.

Analytics

Analytics Data Lake Testing Optimization

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

When the IdP is created in the previous step, an event is added in an Amazon Simple Notification Service (Amazon SNS) topic with its details, such as name and SAML metadata. In the NNEDH control plane, a Lambda job is triggered by new events on this SNS topic.

Data Governance

Data Governance Management Data-driven Analytics

Data Science, Past & Future

Domino Data Lab

JULY 22, 2019

I went to a meeting at Starbucks with the founder of Alation right before they launched in 2012, drawing on the proverbial back-of-the-napkin. We had Julia Lane talking about Coleridge Initiative and the work on Project Jupyter to support metadata and data governance and lineage. You started to see point solutions.

Data Science

Data Science Machine Learning Data Governance Modeling

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

That’s a lot of priorities – especially when you group together closely related items such as data lineage and metadata management which rank nearby. DG emerges for the big data side of the world, e.g., the Alation launch in 2012. Allows metadata repositories to share and exchange. That would’ve been heresy in earlier years.

Machine Learning

Machine Learning Data Governance Metadata Data Science

Enrich your AWS Glue Data Catalog with generative AI metadata using Amazon Bedrock

Jumia builds a next-generation data platform with metadata-driven specification frameworks

Trending Sources

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Use AWS Glue Data Catalog views to analyze data

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Configure ADFS Identity Federation with Amazon QuickSight

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

AWS Glue Data Catalog supports automatic optimization of Apache Iceberg tables through your Amazon VPC

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Federate Amazon QuickSight access with open-source identity provider Keycloak

Gain insights from historical location data using Amazon Location Service and AWS analytics services

10 Years Later: Who’s the GOAT of Data Catalogs?

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

Real-Real-World Programming with ChatGPT

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK

Why We Started the Data Intelligence Project

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients

Design a data mesh on AWS that reflects the envisioned organization

Convergent Evolution

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Amazon DataZone announces custom blueprints for AWS services

Themes and Conferences per Pacoid, Episode 10

Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On

How SumUp made digital analytics more accessible using AWS Glue

How Novo Nordisk built distributed data governance and control at scale

Data Science, Past & Future

Themes and Conferences per Pacoid, Episode 8

Stay Connected