Big Data, Data Transformation and Metadata

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Amazon Athena provides interactive analytics service for analyzing the data in Amazon Simple Storage Service (Amazon S3). Amazon Redshift is used to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Table metadata is fetched from AWS Glue.

Metadata

Metadata Data Lake Modeling Data Warehouse

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

An extract, transform, and load (ETL) process using AWS Glue is triggered once a day to extract the required data and transform it into the required format and quality, following the data product principle of data mesh architectures. Lakshmi Nair is a Senior Specialist Solutions Architect for Data Analytics at AWS.

IoT

IoT Machine Learning Metadata Data-driven

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of. About the Authors Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team.

Visualization

Visualization Data Processing Testing Publishing

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

With quality data at their disposal, organizations can form data warehouses for the purposes of examining trends and establishing future-facing strategies. Industry-wide, the positive ROI on quality data is well understood. 2 – Data profiling. Data profiling is an essential process in the DQM lifecycle.

Data Quality

Data Quality Metrics Data-driven Management

Biggest Trends in Data Visualization Taking Shape in 2022

Smart Data Collective

OCTOBER 13, 2021

There are countless examples of big data transforming many different industries. There is no disputing the fact that the collection and analysis of massive amounts of unstructured data has been a huge breakthrough. We would like to talk about data visualization and its role in the big data movement.

Visualization

Visualization Cost-Benefit Big Data Prescriptive Analytics

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases.

Management

Management Metadata Analytics Dashboards

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

AWS Big Data

OCTOBER 23, 2024

It’s a set of HTTP endpoints to perform operations such as invoking Directed Acyclic Graphs (DAGs), checking task statuses, retrieving metadata about workflows, managing connections and variables, and even initiating dataset-related events, without directly accessing the Airflow web interface or command line tools.

Interactive

Interactive Testing Data-driven Data Lake

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

AWS Big Data

APRIL 12, 2023

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows organizations to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). About the Authors Melody Yang is a Senior Big Data Solution Architect for Amazon EMR at AWS. As of the Amazon EMR 6.5 with up to 61% lower costs.

Testing

Testing Big Data Metadata Optimization

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

It seamlessly consolidates data from various data sources within AWS, including AWS Cost Explorer (and forecasting with Cost Explorer ), AWS Trusted Advisor , and AWS Compute Optimizer. Data providers and consumers are the two fundamental users of a CDH dataset. You might notice that this differs slightly from traditional ETL.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

Solution overview The following diagram illustrates the solution architecture: The solution uses AWS Glue as an ETL engine to extract data from the source Amazon RDS database. Built-in data transformations then scrub columns containing PII using pre-defined masking functions. This saves time over manually defining schemas.

Visualization

Visualization Metadata Data Transformation Testing

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

The Orca Platform is powered by a state-of-the-art anomaly detection system that uses cutting-edge ML algorithms and big data capabilities to detect potential security threats and alert customers in real time, ensuring maximum security for their cloud environment. This ensures that the data is suitable for training purposes.

Data Lake

Data Lake Analytics Snapshot Data Quality

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

AWS Big Data

DECEMBER 5, 2023

A combination of Amazon Redshift Spectrum and COPY commands are used to ingest the survey data stored as CSV files. For the files with unknown structures, AWS Glue crawlers are used to extract metadata and create table definitions in the Data Catalog. She helps customers architect data analytics solutions at scale on AWS.

Measurement

Measurement Dashboards Data Warehouse Analytics

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

AWS Big Data

OCTOBER 18, 2023

To run HiveQL-based data workloads with Spark on Kubernetes mode, engineers must embed their SQL queries into programmatic code such as PySpark, which requires additional effort to manually change code. About the authors Amit Maindola is a Senior Data Architect focused on big data and analytics at Amazon Web Services.

Big Data

Big Data Data Processing Interactive Testing

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

AWS Big Data

APRIL 2, 2024

You can see the decompressed data has metadata information such as logGroup , logStream , and subscriptionFilters , and the actual data is included within the message field under logEvents (the following example shows an example of CloudTrail events in the CloudWatch Logs).

Metadata

Metadata Marketing Analytics Data Transformation

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches. Athena is used to run geospatial queries on the location data stored in the S3 buckets. Choose Run.

Analytics

Analytics IoT Metadata Internet of Things

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

Incremental query refers to a query strategy that focuses on processing and analyzing only the new or updated data within a data lake since the last query. The key idea behind incremental queries is to use metadata or change tracking mechanisms to identify the new or modified data since the last query.

Data Lake

Data Lake Snapshot Big Data Data-driven

Enhance your analytics embedding experience with the new Amazon QuickSight JavaScript SDK

AWS Big Data

MARCH 9, 2023

break; } } } const frameOptions = { url: ' ', container: document.getElementById("dashboardContainer"), width: "100%", height: "AutoFit", loadingHeight: "200px", withIframePlaceholder: true, onChange: (changeEvent, metadata) => { switch (changeEvent.eventName) { case 'ERROR': { document.getElementById("dashboardContainer").append('Unable

Slice and Dice

Slice and Dice Dashboards Analytics Interactive

The importance of data ingestion and integration for enterprise AI

IBM Big Data Hub

JANUARY 9, 2024

The entire generative AI pipeline hinges on the data pipelines that empower it, making it imperative to take the correct precautions. 4 key components to ensure reliable data ingestion Data quality and governance: Data quality means ensuring the security of data sources, maintaining holistic data and providing clear metadata.

Enterprise

Enterprise Data Integration Data Quality Contextual Data

How healthcare organizations can analyze and create insights using price transparency data

AWS Big Data

OCTOBER 11, 2023

Due to this low complexity, the solution uses AWS serverless services to ingest the data, transform it, and make it available for analytics. The Data Catalog now contains references to the machine-readable data. Use the Data Catalog and transform the hospital price transparency data.

Visualization

Visualization Dashboards Data-driven Gap analysis

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Data Vault 2.0 allows for the following: Agile data warehouse development Parallel data ingestion A scalable approach to handle multiple data sources even on the same entity A high level of automation Historization Full lineage support However, Data Vault 2.0

Enterprise

Enterprise Data Warehouse Data Lake Optimization

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. A data store lets a business connect existing data with new data and discover new insights with real-time analytics and business intelligence. Track models and drive transparent processes.

Risk

Risk Modeling Management Metadata

Automate discovery of data relationships using ML and Amazon Neptune graph technology

AWS Big Data

APRIL 19, 2023

We took this a step further by creating a blueprint to create smart recommendations by linking similar data products using graph technology and ML. In this post, we showed how an organization can augment a data catalog with additional metadata by using ML and Neptune with an automated process.

Technology

Technology Data-driven Machine Learning Sales

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

This involves unifying and sharing a single copy of data and metadata across IBM® watsonx.data ™, IBM® Db2 ®, IBM® Db2® Warehouse and IBM® Netezza ®, using native integrations and supporting open formats, all without the need for migration or recataloging.

Cost-Benefit

Cost-Benefit Metadata Optimization Management

The Modern Data Stack Explained: What The Future Holds

Alation

JANUARY 17, 2023

Extract, load, Transform (ELT) tools. Data ingestion/integration services. Data orchestration tools. These tools are used to manage big data, which is defined as data that is too large or complex to be processed by traditional means. How Did the Modern Data Stack Get Started? Reverse ETL tools.

Data Warehouse

Data Warehouse Cost-Benefit Data Science Data Transformation

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

Alternatively, you can use AWS Glue for Apache Spark, which provides built-in support for bucketing configurations during the data transformation process. AWS Glue allows you to define bucketing parameters, such as the number of buckets and the columns to bucket on, providing an optimized data layout for efficient querying with Athena.

Optimization

Optimization Data Lake Cost-Benefit Reporting

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.

Data Lake

Data Lake Dashboards Metrics Metadata

Improve observability across Amazon MWAA tasks

AWS Big Data

FEBRUARY 6, 2023

The task_id used can be the name of your choice; here we use add_steps : # EMR steps to be executed by EMR cluster SPARK_TEST_STEPS = [{ 'Name': 'Run Spark', 'ActionOnFailure': 'CANCEL_AND_WAIT', 'HadoopJarStep': { 'Jar': 'command-runner.jar', 'Args': ['spark-submit', '/home/hadoop/aggregations.py', 's3://{}/data/transformed/green'.format(S3_BUCKET_NAME),

Management

Management Interactive Publishing Metadata

Cross-account integration between SaaS platforms using Amazon AppFlow

AWS Big Data

APRIL 25, 2023

The following AWS services are used for data ingestion, processing, and load: Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications like Salesforce, SAP, Marketo, Slack, and ServiceNow, and AWS services like Amazon S3 and Amazon Redshift , in just a few clicks.

Sales

Sales Visualization Software Metadata

Data platform trinity: Competitive or complementary?

IBM Big Data Hub

JANUARY 18, 2023

This adds an additional ETL step, making the data even more stale. Data lakehouse was created to solve these problems. The data warehouse storage layer is removed from lakehouse architectures. Instead, continuous data transformation is performed within the BLOB storage. Data fabric promotes data discoverability.

Data Lake

Data Lake Data Warehouse Data-driven Metadata

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

foundation models to help users discover, augment, and enrich data with natural language. Watsonx.data is built on 3 core integrated components: multiple query engines, a catalog that keeps track of metadata, and storage and relational data sources which the query engines directly access.

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

SAP Datasphere Powers Business at the Speed of Data

Rocket-Powered Data Science

MARCH 20, 2023

With all the data in and around the enterprise, users would say that they have a lot of information but need more insights to assist them in producing better and more informative content. This is where we dispel an old “big data” notion (heard a decade ago) that was expressed like this: “we need our data to run at the speed of business.”

Data Warehouse

Data Warehouse Metadata Digital Transformation Machine Learning

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Publish data assets – As the data producer from the retail team, you must ingest individual data assets into Amazon DataZone. For this use case, create a data source and import the technical metadata of four data assets— customers , order_items , orders , products , reviews , and shipments —from AWS Glue Data Catalog.

Visualization

Visualization Data Lake Testing Data Governance

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Institutional Data & AI Platform architecture The Institutional Division has implemented a self-service data platform to enable the domain teams to build and manage data products autonomously. The following diagram illustrates the building blocks of the Institutional Data & AI Platform.

Metadata

Metadata Data Governance Data Quality Data-driven

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Refer to Catalogs for more information.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

For GlueDatabaseName , enter a unique name for the Data Catalog database to hold the Jira data table metadata (the default is jiralake ). This mode will scan all data and disable the change data capture (CDC) features of the stack. The DataBrew job performs data transformation and filtering tasks.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

AWS Big Data

MARCH 15, 2023

When designing the data processing pipeline for the attribute API, the Infomedia team wanted to use a flexible and open-source solution for processing data workloads with minimal operational overhead. The API retrieves data at runtime from an Amazon Aurora PostgreSQL-Compatible Edition database for end-user consumption.

Cost-Benefit

Cost-Benefit Data Processing Optimization Data-driven

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In addition, more data is becoming available for processing / enrichment of existing and new use cases e.g., recently we have experienced a rapid growth in data collection at the edge and an increase in availability of frameworks for processing that data. As a result, alternative data integration technologies (e.g.,

Data Processing

Data Processing Data Warehouse Enterprise Visualization

How to modernize data lakes with a data lakehouse architecture

IBM Big Data Hub

JULY 5, 2023

This was, without a question, a significant departure from traditional analytic environments, which often meant vendor-lock in and the inability to work with data at scale. Another unexpected challenge was the introduction of Spark as a processing framework for big data. Comprehensive data security and data governance (i.e.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To ingest the data, smava uses a set of popular third-party customer data platforms complemented by custom scripts. After the data lands in Amazon S3, smava uses the AWS Glue Data Catalog and crawlers to automatically catalog the available data, capture the metadata, and provide an interface that allows querying all data assets.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Hybrid big data analytics with Amazon EMR on AWS Outposts

AWS Big Data

JANUARY 29, 2025

Amazon EMR has long been the leading solution for processing big data in the cloud. Amazon EMR is the industry-leading big data solution for petabyte-scale data processing, interactive analytics, and machine learning using over 20 open source frameworks such as Apache Hadoop , Hive, and Apache Spark.

Big Data

Big Data Data Analytics Analytics Interactive

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

AWS Big Data

NOVEMBER 14, 2024

We use the built-in features of Data Firehose, including AWS Lambda for necessary data transformation and Amazon Simple Notification Service (Amazon SNS) for near real-time alerts. AWS Glue – The AWS Glue Data Catalog is your persistent technical metadata store in the AWS Cloud. Data Architect at AWS.

Data Lake

Data Lake Metadata Testing Data-driven

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

AWS Big Data

NOVEMBER 6, 2024

To learn more about how to process Firehose records using Lambda, see Transform source data in Amazon Data Firehose. After executing your Lambda function, Firehose looks for routing information and operations in the metadata fields (in the following format) provided by your Lambda function. b64decode(record['data']).decode('utf-8')

Metadata

Metadata Data Lake Management Internet of Things

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

How EUROGATE established a data mesh architecture using Amazon DataZone

Webinars

Trending Sources

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Webinars

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Biggest Trends in Data Visualization Taking Shape in 2022

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Enhance your analytics embedding experience with the new Amazon QuickSight JavaScript SDK

The importance of data ingestion and integration for enterprise AI

How healthcare organizations can analyze and create insights using price transparency data

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

How to use foundation models and trusted governance to manage AI workflow risk

Automate discovery of data relationships using ML and Amazon Neptune graph technology

Tackling AI’s data challenges with IBM databases on AWS

The Modern Data Stack Explained: What The Future Holds

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Improve observability across Amazon MWAA tasks

Cross-account integration between SaaS platforms using Amazon AppFlow

Data platform trinity: Competitive or complementary?

Exploring the AI and data capabilities of watsonx

SAP Datasphere Powers Business at the Speed of Data

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Build a data lake with Apache Flink on Amazon EMR

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

Addressing the Three Scalability Challenges in Modern Data Platforms

How to modernize data lakes with a data lakehouse architecture

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Hybrid big data analytics with Amazon EMR on AWS Outposts

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

Stay Connected