2012 and Metadata - Data Leaders Brief

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

Under the hood, UniForm generates Iceberg metadata files (including metadata and manifest files) that are required for Iceberg clients to access the underlying data files in Delta Lake tables. Both Delta Lake and Iceberg metadata files reference the same data files. The table is registered in AWS Glue Data Catalog.

Metadata

Metadata Data Warehouse Big Data Data Lake

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

For example, you can use metadata about the Kinesis data stream name to index by data stream ( ${getMetadata("kinesis_stream_name") ), or you can use document fields to index data depending on the CloudWatch log group or other document data ( ${path/to/field/in/document} ).

Metadata

Metadata Metrics Analytics Data Processing

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

Save the federation metadata XML file You use the federation metadata file to configure the IAM IdP in a later step. In the Single sign-on section , under SAML Certificates , choose Download for Federation Metadata XML. Complete the following steps to download the file: Navigate back to your SAML-based sign-in page.

Sales

Sales Metadata Enterprise Testing

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

AWS Big Data

FEBRUARY 27, 2024

Migration of metadata such as security roles and dashboard objects will be covered in another subsequent post. For index , you can leave it as default, which will get the metadata from the source index and write to the same name in the destination as of the sources.

Metadata

Metadata Data Processing Dashboards IoT

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

AWS Big Data

OCTOBER 24, 2024

In the trust policy, specify that Amazon Elastic Compute Cloud (Amazon EC2) can assume this role: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" amazonaws.com" }, "Action": "sts:AssumeRole" } ] } Make a note of the role ARN.

Visualization

Visualization Management Data Processing Testing

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of. This approach simplifies your data journey and helps you meet your security requirements. Choose the created IAM role.

Visualization

Visualization Data Processing Testing Publishing

Becoming a machine learning company means investing in foundational technologies

O'Reilly on Data

MAY 21, 2019

Consider deep learning, a specific form of machine learning that resurfaced in 2011/2012 due to record-setting models in speech and computer vision. Metadata and artifacts needed for audits. Use ML to unlock new data types—e.g., images, audio, video. Tackle completely new use cases and applications.

Machine Learning

Machine Learning Technology Deep Learning Data Science

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Iceberg tables maintain metadata to abstract large collections of files, providing data management features including time travel, rollback, data compaction, and full schema evolution, reducing management overhead. Snowflake writes Iceberg tables to Amazon S3 and updates metadata automatically with every transaction.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. The onboarding of producers is facilitated by sharing metadata, whereas the onboarding of consumers is based on granting permission to access this metadata. compute.internal ). Choose Submit job run.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Add this policy to the AWS Glue role and Amazon MWAA role: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl" ], "Resource": "arn:aws:s3:::sample-inp-bucket-etl- /*" } ] } In Account B, create the IAM policy policy_for_roleB specifying Account A as a trusted entity.

Metadata

Metadata Data Processing Management Testing

Configure ADFS Identity Federation with Amazon QuickSight

AWS Big Data

FEBRUARY 23, 2023

The metadata document from your IdP. To download it, refer to Federation Metadata Explorer. For Metadata document , upload the metadata document you downloaded as a prerequisite. For Federation metadata address , enter [link]. An AD user with permissions to manage AD FS and AD group membership. Choose Add provider.

Metadata

Metadata Dashboards Management Enterprise

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

This populates the technical metadata in the business data catalog for each data asset. The business metadata, can be added by business users to provide business context, tags, and data classification for the datasets. If new AWS Glue tables or metadata is created or updated, then it starts the data source sync job.

Data Lake

Data Lake Publishing Metadata Data-driven

Use AWS Glue Data Catalog views to analyze data

AWS Big Data

MAY 9, 2024

The objective is to create views in the Data Catalog so you can create a single common view schema and metadata object to use across engines (in this case, Athena). Solution overview For this post, we use the Women’s E-Commerce Clothing Review. Doing so lets you use the same views across your data lakes to fit your use case.

Data Lake

Data Lake Metadata Management Big Data

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Data Lake Testing Consulting

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

AWS Big Data

SEPTEMBER 6, 2023

. // It serves as a simple API Gateway to Kafka Proxy, accepting requests and forwarding them to a Kafka topic. withBody("Message successfully pushed to kafka"); } catch (Exception e) { // In case of exception, log the error message and return a 500 status code log.error(e.getMessage(), e); return response.withBody(e.getMessage()).withStatusCode(500);

Testing

Testing Metadata Cost-Benefit Internet of Things

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

After the table is cataloged in your AWS Glue metadata catalog, you can run queries directly on your data in your S3 data lake through OpenSearch Dashboards. You can audit connections to ensure that they are set up in a scalable, cost-efficient, and secure way. Solution overview The following diagram illustrates the solution architecture.

Data Lake

Data Lake Analytics Dashboards Metrics

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

AWS Big Data

SEPTEMBER 26, 2023

With Lake Formation, you can manage access control for your data lake data in Amazon Simple Storage Service (Amazon S3 ) and its metadata in AWS Glue Data Catalog in one place with familiar database-style features. AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning.

Data Lake

Data Lake Metadata Management Modeling

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

The Data Catalog provides metadata that allows analytics applications using Athena to find, read, and process the location data stored in Amazon S3. The crawlers will automatically classify the data into JSON format, group the records into tables and partitions, and commit associated metadata to the AWS Glue Data Catalog. Choose Run.

Analytics

Analytics IoT Metadata Internet of Things

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

int '2' 'InstanceType': 'Ref': 'ClusterInstanceType' 'Market': 'ON_DEMAND' 'Name': 'Core' 'Outputs': 'ClusterId': 'Value': 'Ref': 'EmrCluster' 'Description': 'The ID of the EMR cluster' 'Metadata': 'AWS::CloudFormation::Designer': {} 'Rules': {} Trusted identity propagation is supported from Amazon EMR 6.15

Analytics

Analytics Data Lake Management Enterprise

Integrate custom applications with AWS Lake Formation – Part 1

AWS Big Data

NOVEMBER 19, 2024

With Lake Formation, you can centralize data security and governance using the AWS Glue Data Catalog , letting you manage metadata and data permissions in one place with familiar database-style features. glue:GetUnfilteredTableMetadata – Allows a third-party analytical engine to retrieve unfiltered table metadata from the Data Catalog.

Data Lake

Data Lake Metadata Testing Data Processing

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

AWS Big Data

MARCH 22, 2023

The IdP metadata is displayed. In the SAML Certificates section, download the Federation Metadata XML file and the Certificate (Raw) file. For IdP SAML metadata under the Identity provider metadata section, choose Choose file. Choose the previously downloaded metadata file ( IIC-QuickSight.xml ). Choose Save.

Management

Management Metadata Enterprise Testing

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

Download the SAML metadata file. In the navigation pane under Clients , import the SAML metadata file. Download the Keycloak IdP SAML metadata file from that URL location. For Metadata document , upload the Keycloak IdP SAML metadata XML file you downloaded and saved to your local machine earlier. Choose Browse.

Metadata

Metadata Dashboards Business Intelligence Data Lake

10 Years Later: Who’s the GOAT of Data Catalogs?

Alation

DECEMBER 15, 2022

December 2012: Alation forms and goes to work creating the first enterprise data catalog. August 2017: Alation debuts as a leader in the Gartner MQ for Metadata Management Solutions. August 2018: Gartner names Alation a 2X Leader in the MQ for Metadata Management Solutions. June 2017: Yahoo Japan Corp.

Metadata

Metadata Data Governance Data Quality Marketing

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

By selecting the corresponding asset, you can understand its content through the readme, glossary terms , and technical and business metadata. We use this data source to import metadata information related to our datasets. Use Amazon DataZone APIs through Boto3 to push custom data quality metadata.

Data Quality

Data Quality Visualization Metadata Metrics

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

AWS Big Data

JUNE 12, 2024

Amazon SQS receives an Amazon S3 event notification as a JSON file with metadata such as the S3 bucket name, object key, and timestamp. The OpenSearch Ingestion pipeline receives the message from Amazon SQS, loads the files from Amazon S3, and parses the CSV data from the message into columns.

Dashboards

Dashboards Visualization Sales IoT

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Eliminating dependency on business units – Redshift Spectrum uses a metadata layer to directly query the data residing in S3 data lakes, eliminating the need for data copying or relying on individual business units to initiate the copy jobs. Similarly, individual business units produce their own domain-specific data.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Data as a product Treating data as a product entails three key components: the data itself, the metadata, and the associated code and infrastructure. For orchestration, they use the AWS Cloud Development Kit (AWS CDK) for infrastructure as code (IaC) and AWS Glue Data Catalogs for metadata management.

Data-driven

Data-driven Advertising Metadata Data Architecture

Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients

AWS Big Data

MAY 4, 2023

Use the IdP metadata in block 4 and save the metadata file in.xml format (for example, metadata.xml ). Choose Choose file and upload the metadata file (.xml) Collect Okta information To gather your Okta information, complete the following steps: On the Sign On tab, choose View SAML setup instructions. Choose Add provider.

Data Warehouse

Data Warehouse Finance Sales Metadata

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

That’s a lot of priorities – especially when you group together closely related items such as data lineage and metadata management which rank nearby. DG emerges for the big data side of the world, e.g., the Alation launch in 2012. Allows metadata repositories to share and exchange. That would’ve been heresy in earlier years.

Machine Learning

Machine Learning Data Governance Metadata Data Science

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

AWS Big Data

SEPTEMBER 29, 2023

To analyze XML files stored in Amazon S3 using AWS Glue and Athena, we complete the following high-level steps: Create an AWS Glue crawler to extract XML metadata and create a table in the AWS Glue Data Catalog. We use the AWS Glue crawler to extract XML file metadata. We also use a custom XML classifier in this solution.

Metadata

Metadata Visualization Data-driven Optimization

Why We Started the Data Intelligence Project

Alation

JULY 7, 2022

In 2013 I joined American Family Insurance as a metadata analyst. I had always been fascinated by how people find, organize, and access information, so a metadata management role after school was a natural choice. The use cases for metadata are boundless, offering opportunities for innovation in every sector. The data scientist.

Metadata

Metadata Data-driven Insurance Statistics

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

AWS Big Data

AUGUST 5, 2024

Member account 2 (Glue_Member_Account) is where metadata is cataloged in the Data Catalog and Lake Formation is enabled with IAM Identity Center integration. Member account 2 (Glue_Member_Account) where metadata is cataloged in the Data Catalog. We integrate users and groups from the IdP with IAM Identity Center.

Data Lake

Data Lake Finance Sales Management

Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK

AWS Big Data

MARCH 11, 2024

ORDERTOPIC" WHERE CAN_JSON_PARSE(kafka_value); The metadata column kafka_value that arrives from Amazon MSK is stored in VARBYTE format in Amazon Redshift. For this post, you use the JSON_PARSE function to convert kafka_value to a SUPER data type.

Analytics

Analytics Data Warehouse Optimization Metrics

Amazon DataZone announces custom blueprints for AWS services

AWS Big Data

JUNE 26, 2024

Use case 3: Amazon S3 file uploads In addition to the download functionality, users often need to retain and attach metadata to new versions of files. For example, when you download a file, you can perform data changes, enrichment, or analysis on the file, and then upload the updated version back to the Amazon DataZone portal.

Data Lake

Data Lake Data Warehouse Unstructured Data Data Governance

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

An AWS Glue Crawler scans the above files and catalogs metadata about them into the AWS Glue Data Catalog. Select Create database , as shown in the following screenshot. Repeat the steps for creating other database like lobmarket and hr.

Data Lake

Data Lake Data Warehouse Management Risk

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

AWS Big Data

AUGUST 28, 2023

An example is provided below ocsf-cuid-${/class_uid}-${/metadata/product/name}-${/class_name}-%{yyyy.MM.dd} Complete the following steps to install the index templates and dashboards for your data: Download the component_templates.zip and index_templates.zip files and unzip them on your local device. Set region as us-east-1.

Dashboards

Dashboards Visualization Metadata Management

Themes and Conferences per Pacoid, Episode 12

Domino Data Lab

AUGUST 8, 2019

The gist is, leveraging metadata about research datasets, projects, publications, etc., Once upon a time, circa 2012-ish, data science conferences were replete with talks about an industry hellbent on loading amazing enormous Big Data into some kind of data lake, and applying all kinds of odd astrophysics-ish approaches…for eventual PROFIT!

Data Science

Data Science Machine Learning Data Governance Statistics

Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On

AWS Big Data

NOVEMBER 30, 2023

After you finish entering the required cluster metadata and create the resource, you can check the status for IdC integration in the properties. Note that when a new data warehouse is created, the IAM role specified for IdC integration is automatically attached to the provisioned cluster or Serverless Namespace.

Data Warehouse

Data Warehouse Finance Sales Management

Natural Language in Python using spaCy: An Introduction

Domino Data Lab

SEPTEMBER 9, 2019

Let’s analyze text data from the party conventions during the 2012 US Presidential elections. metadata=convention_df["speaker"]? ). Here’s an interactive visualization for understanding texts: scattertext , a product of the genius of Jason Kessler. get_data(). ? category="democrat",?. width_in_pixels=1000,?.

Deep Learning

Deep Learning Machine Learning Data Science Visualization

Themes and Conferences per Pacoid, Episode 10

Domino Data Lab

JUNE 2, 2019

I recall a “Data Drinkup Group” gathering at a pub in Palo Alto, circa 2012, where I overheard Pete Skomoroch talking with other data scientists about Kahneman’s work. Rather, they were beaming about Kahneman’s work and its significance in our field.

Data Science

Data Science Data-driven Machine Learning Modeling

Enrich your AWS Glue Data Catalog with generative AI metadata using Amazon Bedrock

AWS Big Data

NOVEMBER 15, 2024

Metadata can play a very important role in using data assets to make data driven decisions. Generating metadata for your data assets is often a time-consuming and manual task. This post shows you how to enrich your AWS Glue Data Catalog with dynamic metadata using foundation models (FMs) on Amazon Bedrock and your data documentation.

Metadata

Metadata Modeling Data-driven Machine Learning

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Founded in 2012, SumUp is the financial partner for more than 4 million small merchants in over 35 markets worldwide, helping them start, run and grow their business. Data Catalog: We also wanted to automate a Glue Crawler to have metadata in a Data Catalog and be able to explore our files in S3 with Athena.

Analytics

Analytics Data Lake Testing Optimization

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

When the IdP is created in the previous step, an event is added in an Amazon Simple Notification Service (Amazon SNS) topic with its details, such as name and SAML metadata. In the NNEDH control plane, a Lambda job is triggered by new events on this SNS topic.

Data Governance

Data Governance Management Data-driven Analytics

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

Amazon S3 hosts the metadata of all the tables as a.csv file. The pipeline uses the Step Functions distributed map to read the table metadata from Amazon S3, iterate on every single item, and call the downstream AWS Glue job in parallel to export the data. The following diagram illustrates the Step Functions workflow.

Metadata

Metadata Visualization Data-driven Data Lake

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Webinars

Trending Sources

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Webinars

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Becoming a machine learning company means investing in foundational technologies

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Configure ADFS Identity Federation with Amazon QuickSight

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Use AWS Glue Data Catalog views to analyze data

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Integrate custom applications with AWS Lake Formation – Part 1

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

Federate Amazon QuickSight access with open-source identity provider Keycloak

10 Years Later: Who’s the GOAT of Data Catalogs?

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Design a data mesh on AWS that reflects the envisioned organization

Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients

Themes and Conferences per Pacoid, Episode 8

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

Why We Started the Data Intelligence Project

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK

Amazon DataZone announces custom blueprints for AWS services

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

Themes and Conferences per Pacoid, Episode 12

Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On

Natural Language in Python using spaCy: An Introduction

Themes and Conferences per Pacoid, Episode 10

Enrich your AWS Glue Data Catalog with generative AI metadata using Amazon Bedrock

How SumUp made digital analytics more accessible using AWS Glue

How Novo Nordisk built distributed data governance and control at scale

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Stay Connected