2012, Big Data and Metadata - Data Leaders Brief

Enrich your AWS Glue Data Catalog with generative AI metadata using Amazon Bedrock

AWS Big Data

NOVEMBER 15, 2024

Metadata can play a very important role in using data assets to make data driven decisions. Generating metadata for your data assets is often a time-consuming and manual task. First, we explore the option of in-context learning, where the LLM generates the requested metadata without documentation.

Metadata

Metadata Modeling Data-driven Machine Learning

Jumia builds a next-generation data platform with metadata-driven specification frameworks

AWS Big Data

DECEMBER 20, 2024

Jumia is a technology company born in 2012, present in 14 African countries, with its main headquarters in Lagos, Nigeria. Solution overview The basic concept of the modernization project is to create metadata-driven frameworks, which are reusable, scalable, and able to respond to the different phases of the modernization process.

Metadata

Metadata Data-driven Snapshot Data Lake

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

The landscape of big data management has been transformed by the rising popularity of open table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake. These formats, designed to address the limitations of traditional data storage systems, have become essential in modern data architectures.

Metadata

Metadata Data Warehouse Big Data Data Lake

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time. Apache Iceberg offers integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

AWS Big Data

FEBRUARY 27, 2024

OSI is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to OpenSearch Service domains and OpenSearch Serverless collections. In this post, we outline the steps to make migrate the data between provisioned OpenSearch Service domains and OpenSearch Serverless.

Metadata

Metadata Data Processing Dashboards IoT

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Add this policy to the AWS Glue role and Amazon MWAA role: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:PutObjectAcl" ], "Resource": "arn:aws:s3:::sample-inp-bucket-etl- /*" } ] } In Account B, create the IAM policy policy_for_roleB specifying Account A as a trusted entity.

Metadata

Metadata Data Processing Management Testing

Use AWS Glue Data Catalog views to analyze data

AWS Big Data

MAY 9, 2024

The objective is to create views in the Data Catalog so you can create a single common view schema and metadata object to use across engines (in this case, Athena). Doing so lets you use the same views across your data lakes to fit your use case. He specializes in permissions and data catalog features in the data lake.

Data Lake

Data Lake Metadata Management Big Data

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery. Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Data domain producers publish data assets using datasource run to Amazon DataZone in the Central Governance account. This populates the technical metadata in the business data catalog for each data asset. Data ownership remains with the producer. using following command $ nvm install 18.12.0

Data Lake

Data Lake Publishing Metadata Data-driven

AWS Glue Data Catalog supports automatic optimization of Apache Iceberg tables through your Amazon VPC

AWS Big Data

NOVEMBER 21, 2024

Similarly, the orphan file deletion process scans the table metadata and the actual data files, identifies the unreferenced files, and deletes them to reclaim storage space. These storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance. Choose your S3 bucket.

Optimization

Optimization Snapshot Metadata Software

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

Many customers run big data workloads such as extract, transform, and load (ETL) on Apache Hive to create a data warehouse on Hadoop. We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The script generates a metadata JSON file for each step.

Metadata

Metadata Data Lake Testing Consulting

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

The IAM role ARN must be the same for both the OpenSearch Servicer sink definition and the Kinesis Data Streams source definition. You can control what data gets indexed in different indexes using the index definition in the sink.

Metadata

Metadata Metrics Analytics Data Processing

Configure ADFS Identity Federation with Amazon QuickSight

AWS Big Data

FEBRUARY 23, 2023

The metadata document from your IdP. To download it, refer to Federation Metadata Explorer. For Metadata document , upload the metadata document you downloaded as a prerequisite. Select Import data about the relying party published online or on a local network. For Federation metadata address , enter [link].

Metadata

Metadata Dashboards Management Enterprise

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

AWS Big Data

SEPTEMBER 26, 2023

AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage access control for your data lake data in Amazon Simple Storage Service (Amazon S3 ) and its metadata in AWS Glue Data Catalog in one place with familiar database-style features.

Data Lake

Data Lake Metadata Management Modeling

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

Direct queries from OpenSearch Service to Amazon S3 use Spark tables within the AWS Glue Data Catalog. After the table is cataloged in your AWS Glue metadata catalog, you can run queries directly on your data in your S3 data lake through OpenSearch Dashboards.

Data Lake

Data Lake Analytics Dashboards Metrics

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

AWS Big Data

SEPTEMBER 6, 2023

. // It serves as a simple API Gateway to Kafka Proxy, accepting requests and forwarding them to a Kafka topic. withBody("Message successfully pushed to kafka"); } catch (Exception e) { // In case of exception, log the error message and return a 500 status code log.error(e.getMessage(), e); return response.withBody(e.getMessage()).withStatusCode(500);

Testing

Testing Metadata Cost-Benefit Internet of Things

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Similarly, individual business units produce their own domain-specific data. There are no duplicate data products created by business units or the Central IT team. This significantly reduces the risk of errors associated with data transfer or movement and data copies.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

int '2' 'InstanceType': 'Ref': 'ClusterInstanceType' 'Market': 'ON_DEMAND' 'Name': 'Core' 'Outputs': 'ClusterId': 'Value': 'Ref': 'EmrCluster' 'Description': 'The ID of the EMR cluster' 'Metadata': 'AWS::CloudFormation::Designer': {} 'Rules': {} Trusted identity propagation is supported from Amazon EMR 6.15

Analytics

Analytics Data Lake Management Enterprise

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

AWS Glue crawls both S3 bucket paths, populates the AWS Glue database tables based on the inferred schemas, and makes the data available to other analytics applications through the AWS Glue Data Catalog. Athena is used to run geospatial queries on the location data stored in the S3 buckets. detail.EventType TrackerName: $.detail.TrackerName

Analytics

Analytics IoT Metadata Internet of Things

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Data as a product Treating data as a product entails three key components: the data itself, the metadata, and the associated code and infrastructure. In this approach, teams responsible for generating data are referred to as producers. Srikant Das is an Acceleration Lab Solutions Architect at Amazon Web Services.

Data-driven

Data-driven Advertising Metadata Data Architecture

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

AWS Big Data

MARCH 22, 2023

The IdP metadata is displayed. In the SAML Certificates section, download the Federation Metadata XML file and the Certificate (Raw) file. For IdP SAML metadata under the Identity provider metadata section, choose Choose file. Choose the previously downloaded metadata file ( IIC-QuickSight.xml ). Choose Save.

Management

Management Metadata Enterprise Testing

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

Download the SAML metadata file. In the navigation pane under Clients , import the SAML metadata file. Download the Keycloak IdP SAML metadata file from that URL location. For Metadata document , upload the Keycloak IdP SAML metadata XML file you downloaded and saved to your local machine earlier. Choose Browse.

Metadata

Metadata Dashboards Business Intelligence Data Lake

Convergent Evolution

Peter James Thomas

AUGUST 18, 2018

Even back then, these were used for activities such as Analytics , Dashboards , Statistical Modelling , Data Mining and Advanced Visualisation. Next, rather than just being the province of Data Scientists, there were moves to use Data Lakes to support general Data Discovery and even business Reporting and Analytics as well.

Data Lake

Data Lake Data Warehouse Data mining Statistics

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

This approach simplifies your data journey and helps you meet your security requirements. The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections. About the Authors Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. Choose the created IAM role.

Visualization

Visualization Data Processing Testing Publishing

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

An AWS Glue Crawler scans the above files and catalogs metadata about them into the AWS Glue Data Catalog. The Glue Data Catalog organizes this Amazon S3 data into tables and databases, assigning columns and data types so the data can be queried using SQL that Amazon Redshift Spectrum can understand.

Data Lake

Data Lake Data Warehouse Management Risk

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane. By selecting the corresponding asset, you can understand its content through the readme, glossary terms , and technical and business metadata.

Data Quality

Data Quality Visualization Metadata Metrics

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

AWS Big Data

JUNE 12, 2024

Amazon SQS receives an Amazon S3 event notification as a JSON file with metadata such as the S3 bucket name, object key, and timestamp. The OpenSearch Ingestion pipeline receives the message from Amazon SQS, loads the files from Amazon S3, and parses the CSV data from the message into columns.

Dashboards

Dashboards Visualization Sales IoT

Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK

AWS Big Data

MARCH 11, 2024

ORDERTOPIC" WHERE CAN_JSON_PARSE(kafka_value); The metadata column kafka_value that arrives from Amazon MSK is stored in VARBYTE format in Amazon Redshift. For this post, you use the JSON_PARSE function to convert kafka_value to a SUPER data type.

Analytics

Analytics Data Warehouse Optimization Metrics

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

AWS Big Data

SEPTEMBER 29, 2023

To analyze XML files stored in Amazon S3 using AWS Glue and Athena, we complete the following high-level steps: Create an AWS Glue crawler to extract XML metadata and create a table in the AWS Glue Data Catalog. We use the AWS Glue crawler to extract XML file metadata. We also use a custom XML classifier in this solution.

Metadata

Metadata Visualization Data-driven Optimization

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

Save the federation metadata XML file You use the federation metadata file to configure the IAM IdP in a later step. In the Single sign-on section , under SAML Certificates , choose Download for Federation Metadata XML. Complete the following steps to download the file: Navigate back to your SAML-based sign-in page.

Sales

Sales Metadata Enterprise Testing

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

AWS Big Data

AUGUST 5, 2024

Member account 2 (Glue_Member_Account) is where metadata is cataloged in the Data Catalog and Lake Formation is enabled with IAM Identity Center integration. Member account 2 (Glue_Member_Account) where metadata is cataloged in the Data Catalog. In the navigation pane, under Data Catalog , choose Catalog settings.

Data Lake

Data Lake Finance Sales Management

Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients

AWS Big Data

MAY 4, 2023

Use the IdP metadata in block 4 and save the metadata file in.xml format (for example, metadata.xml ). Choose Choose file and upload the metadata file (.xml) Collect Okta information To gather your Okta information, complete the following steps: On the Sign On tab, choose View SAML setup instructions. Choose Add provider.

Data Warehouse

Data Warehouse Finance Sales Metadata

Amazon DataZone announces custom blueprints for AWS services

AWS Big Data

JUNE 26, 2024

This functionality streamlines the process of finding and accessing unstructured data and allows you to download multiple files at once, enabling you to build and enhance your analytics more efficiently. His expertise spans across data analytics, data governance, AI, ML, big data, and healthcare-related technologies.

Data Lake

Data Lake Data Warehouse Unstructured Data Data Governance

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

AWS Big Data

OCTOBER 24, 2024

In the trust policy, specify that Amazon Elastic Compute Cloud (Amazon EC2) can assume this role: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" amazonaws.com" }, "Action": "sts:AssumeRole" } ] } Make a note of the role ARN.

Visualization

Visualization Management Data Processing Testing

Themes and Conferences per Pacoid, Episode 10

Domino Data Lab

JUNE 2, 2019

ans from Nick Elprin, CEO and co-founder of Domino Data Lab, about the importance of model-driven business: “Being data-driven is like navigating by watching the rearview mirror. If your business is using big data and putting dashboards in front of analysts, you’re missing the point.”. I consider that a healthy trend.

Data Science

Data Science Data-driven Machine Learning Modeling

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

AWS Big Data

AUGUST 28, 2023

An example is provided below ocsf-cuid-${/class_uid}-${/metadata/product/name}-${/class_name}-%{yyyy.MM.dd} Complete the following steps to install the index templates and dashboards for your data: Download the component_templates.zip and index_templates.zip files and unzip them on your local device. Set region as us-east-1.

Dashboards

Dashboards Visualization Metadata Management

Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On

AWS Big Data

NOVEMBER 30, 2023

Note that when a new data warehouse is created, the IAM role specified for IdC integration is automatically attached to the provisioned cluster or Serverless Namespace. After you finish entering the required cluster metadata and create the resource, you can check the status for IdC integration in the properties.

Data Warehouse

Data Warehouse Finance Sales Management

Data Science, Past & Future

Domino Data Lab

JULY 22, 2019

By virtue of that, if you take those log files of customers interactions, you aggregate them, then you take that aggregated data, run machine learning models on them, you can produce data products that you feed back into your web apps, and then you get this kind of effect in business. That was the origin of big data.

Data Science

Data Science Machine Learning Data Governance Modeling

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Founded in 2012, SumUp is the financial partner for more than 4 million small merchants in over 35 markets worldwide, helping them start, run and grow their business. Data Catalog: We also wanted to automate a Glue Crawler to have metadata in a Data Catalog and be able to explore our files in S3 with Athena.

Analytics

Analytics Data Lake Testing Optimization

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

When the IdP is created in the previous step, an event is added in an Amazon Simple Notification Service (Amazon SNS) topic with its details, such as name and SAML metadata. In the NNEDH control plane, a Lambda job is triggered by new events on this SNS topic.

Data Governance

Data Governance Management Data-driven Analytics

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

It includes perspectives about current issues, themes, vendors, and products for data governance. My interest in data governance (DG) began with the recent industry surveys by O’Reilly Media about enterprise adoption of “ABC” (AI, Big Data, Cloud). We keep feeding the monster data. the flywheel effect.

Machine Learning

Machine Learning Data Governance Metadata Data Science

Integrate custom applications with AWS Lake Formation – Part 1

AWS Big Data

NOVEMBER 19, 2024

AWS Lake Formation makes it straightforward to centrally govern, secure, and globally share data for analytics and machine learning (ML). With Lake Formation, you can centralize data security and governance using the AWS Glue Data Catalog , letting you manage metadata and data permissions in one place with familiar database-style features.

Data Lake

Data Lake Metadata Testing Data Processing

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

There are multiple tables related to customers and order data in the RDS database. Amazon S3 hosts the metadata of all the tables as a.csv file. The following diagram illustrates the Step Functions workflow. In this example, you can identity the exception is due to Glue.ConcurrentRunsExceededException from AWS Glue.

Metadata

Metadata Visualization Data-driven Data Lake

Themes and Conferences per Pacoid, Episode 12

Domino Data Lab

AUGUST 8, 2019

I mention this here because there was a lot of overlap between current industry data governance needs and what the scientific community is working toward for scholarly infrastructure. The gist is, leveraging metadata about research datasets, projects, publications, etc., Or something. Nothing Spreads Like Fear”.

Data Science

Data Science Machine Learning Data Governance Statistics

Enrich your AWS Glue Data Catalog with generative AI metadata using Amazon Bedrock

Jumia builds a next-generation data platform with metadata-driven specification frameworks

Trending Sources

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Use AWS Glue Data Catalog views to analyze data

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Glue Data Catalog supports automatic optimization of Apache Iceberg tables through your Amazon VPC

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Configure ADFS Identity Federation with Amazon QuickSight

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Design a data mesh on AWS that reflects the envisioned organization

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

Federate Amazon QuickSight access with open-source identity provider Keycloak

Convergent Evolution

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients

Amazon DataZone announces custom blueprints for AWS services

Manage Amazon OpenSearch Service Visualizations, Alerts, and More with GitHub and Jenkins

Themes and Conferences per Pacoid, Episode 10

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On

Data Science, Past & Future

How SumUp made digital analytics more accessible using AWS Glue

How Novo Nordisk built distributed data governance and control at scale

Themes and Conferences per Pacoid, Episode 8

Integrate custom applications with AWS Lake Formation – Part 1

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Themes and Conferences per Pacoid, Episode 12

Stay Connected