Big Data, Data Analytics and Metadata

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

This week on the keynote stages at AWS re:Invent 2024, you heard from Matt Garman, CEO, AWS, and Swami Sivasubramanian, VP of AI and Data, AWS, speak about the next generation of Amazon SageMaker , the center for all of your data, analytics, and AI. The relationship between analytics and AI is rapidly evolving.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

AWS Big Data

APRIL 9, 2025

Whether youre a data analyst seeking a specific metric or a data steward validating metadata compliance, this update delivers a more precise, governed, and intuitive search experience. Start using this enhanced search capability today and experience the difference it brings to your data discovery journey.

Metadata

Metadata Metrics Data-driven Cost-Benefit

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

In addition to real-time analytics and visualization, the data needs to be shared for long-term data analytics and machine learning applications. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

AWS Big Data

NOVEMBER 29, 2023

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift. This post discusses restricting listing of data catalog metadata as per the granted permissions.

Metadata

Metadata Data Warehouse Analytics Data Analytics

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights.

Metadata

Metadata Sales Data Warehouse Optimization

Enhance data governance with enforced metadata rules in Amazon DataZone

AWS Big Data

NOVEMBER 20, 2024

We’re excited to announce a new feature in Amazon DataZone that offers enhanced metadata governance for your subscription approval process. With this update, domain owners can define and enforce metadata requirements for data consumers when they request access to data assets.

Metadata

Metadata Data Governance Metrics Marketing

Best practices for upgrading Amazon MWAA environments

AWS Big Data

JUNE 2, 2025

With this approach, you create a new Amazon MWAA environment, migrate your metadata, and manage the transition between environments. To help prevent data duplication or corruption during parallel operation, we recommend implementing idempotent DAGs. Back up your current environment configuration and metadata.

Metadata

Metadata Testing Metrics Management

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Institutional Data & AI Platform architecture The Institutional Division has implemented a self-service data platform to enable the domain teams to build and manage data products autonomously. The following diagram illustrates the building blocks of the Institutional Data & AI Platform.

Metadata

Metadata Data Governance Data Quality Data-driven

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

The IAM role ARN must be the same for both the OpenSearch Servicer sink definition and the Kinesis Data Streams source definition. You can control what data gets indexed in different indexes using the index definition in the sink.

Metadata

Metadata Metrics Analytics Data Processing

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

The following are the key components and steps in the integration process: Zero-ETL extracts and loads the data into Amazon S3 , a highly scalable object storage service. The data is also registered in the Glue Data Catalog , a metadata repository. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert.

Data Integration

Data Integration Data Lake Statistics Data-driven

What is a data scientist? A key data analytics role and a lucrative career

CIO Business Intelligence

MARCH 21, 2022

Data scientists are analytical data experts who use data science to discover insights from massive amounts of structured and unstructured data to help shape or meet specific business needs and goals. Data scientist job description. Semi-structured data falls between the two. Data scientist skills.

Unstructured Data

Unstructured Data Data Analytics Analytics Data Science

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Pricing and availability Amazon MWAA pricing dimensions remains unchanged, and you only pay for what you use: The environment class Metadata database storage consumed Metadata database storage pricing remains the same. Over the years, he has helped multiple customers on data platform transformations across industry verticals.

Metadata

Metadata Cost-Benefit Metrics Optimization

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge offers a comprehensive suite of tools that automate much of the complex conversion work, allowing organizations to quickly and reliably transition their data analytics capabilities to the scalable Amazon Redshift data warehouse. She is passionate about data analytics and data science.

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. Industry-leading price-performance: Amazon Redshift launches RA3.large

Data Lake

Data Lake Data Warehouse Data-driven Optimization

How Can Marketing Teams Leverage Data Analytics for Digital Asset Management

Smart Data Collective

FEBRUARY 11, 2021

Data analytics is the linchpin of digital business strategies in the 21st Century. Sensible companies need to know how to properly utilize data analytics to take full advantage of all of their digital resources. The Intersection Between Data Analytics and Digital Asset Management.

Marketing

Marketing Data Analytics Management Analytics

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly. The solution integrates data in three tiers.

Unstructured Data

Unstructured Data Metadata Management Analytics

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

reduces the Amazon DynamoDB cost associated with KCL by optimizing read operations on the DynamoDB table storing metadata. KCL uses DynamoDB to store metadata such as shard-worker mapping and checkpoints. Priyanka Chaudhary is a Senior Solutions Architect and data analytics specialist. Other benefits in KCL 3.0

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

Biggest Trends in Data Visualization Taking Shape in 2022

Smart Data Collective

OCTOBER 13, 2021

There are countless examples of big data transforming many different industries. There is no disputing the fact that the collection and analysis of massive amounts of unstructured data has been a huge breakthrough. We would like to talk about data visualization and its role in the big data movement.

Visualization

Visualization Cost-Benefit Big Data Prescriptive Analytics

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

AWS Big Data

MAY 20, 2025

Create an Amazon Bedrock knowledge base referencing structured data in Amazon Redshift Amazon Bedrock Knowledge Bases uses Amazon Redshift as the query engine to query your data. It reads metadata from your structured data store to generate SQL queries. Choose your Redshift workgroup. Use the IAM role created earlier.

Structured Data

Structured Data Data Warehouse Analytics Finance

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. This metadata file is later used to read source file names during processing into the staging layer. These files follow the same naming pattern, with a daily system-generated timestamp appended to each file name.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

In this post, we walk you through the top analytics announcements from re:Invent 2024 and explore how these innovations can help you unlock the full potential of your data. S3 Metadata is designed to automatically capture metadata from objects as they are uploaded into a bucket, and to make that metadata queryable in a read-only table.

Analytics

Analytics Data Lake Metadata Data Warehouse

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

In this blog post, we dive into different data aspects and how Cloudinary breaks the two concerns of vendor locking and cost efficient data analytics by using Apache Iceberg, Amazon Simple Storage Service (Amazon S3 ), Amazon Athena , Amazon EMR , and AWS Glue. This concept makes Iceberg extremely versatile. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

A Few Proven Suggestions for Handling Large Data Sets

Smart Data Collective

SEPTEMBER 26, 2021

Working with massive structured and unstructured data sets can turn out to be complicated. It’s obvious that you’ll want to use big data, but it’s not so obvious how you’re going to work with it. So, let’s have a close look at some of the best strategies to work with large data sets. It’s a good idea to record metadata.

Metadata

Metadata Visualization Unstructured Data Data mining

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

Benchmark setup In our testing, we used the 3 TB dataset stored in Amazon S3 in compressed Parquet format and metadata for databases and tables is stored in the AWS Glue Data Catalog. This benchmark uses unmodified TPC-DS data schema and table relationships. He has been focusing in the big data analytics space since 2014.

Metadata

Metadata Statistics Broadcasting Optimization

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing. For Matching conditions , choose Match all conditions.

Metadata

Metadata Sales Data Lake Big Data

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Building a Beautiful Data Lakehouse

CIO Business Intelligence

MARCH 9, 2022

Applying artificial intelligence (AI) to data analytics for deeper, better insights and automation is a growing enterprise IT priority. But the data repository options that have been around for a while tend to fall short in their ability to serve as the foundation for big data analytics powered by AI.

Data Lake

Data Lake Unstructured Data Data Warehouse Big Data

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

In this post, we discuss ways to modernize your legacy, on-premises, real-time analytics architecture to build serverless data analytics solutions on AWS using Amazon Managed Service for Apache Flink. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds.

Management

Management Metadata Analytics Dashboards

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

AWS Big Data

DECEMBER 4, 2024

The post will include details on how to perform read/write data operations against Amazon S3 tables with AWS Lake Formation managing metadata and underlying data access using temporary credential vending. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services.

Data Lake

Data Lake Metadata Insurance Data-driven

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

Despite these capabilities, data lakes are not databases, and object storage does not provide support for ACID processing semantics, which you may require to effectively optimize and manage your data at scale across hundreds or thousands of users using a multitude of different technologies.

Data Lake

Data Lake Metadata Statistics Optimization

Understanding The Phenomenal Impact of Social Data on B2B Funnels

Smart Data Collective

JANUARY 5, 2021

We recently talked about the benefits of using big data in marketing. We even discussed some tools that leverage big data to get more value out of marketing strategies. These are all great reasons to use big data in marketing. But for accurate modeling, you need lots of reliable data. Lead Enrichment.

B2B

B2B Sales Big Data Marketing

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

The FinAuto team built AWS Cloud Development Kit (AWS CDK), AWS CloudFormation , and API tools to maintain a metadata store that ingests from domain owner catalogs into the global catalog. This global catalog captures new or updated partitions from the data producer AWS Glue Data Catalogs.

Finance

Finance Metadata Big Data Recreation/Entertainment

Simplifying Big Data Projects with Data Virtualization

Data Virtualization

MARCH 21, 2019

According to Gartner, 60% of all the big data projects fail and according to Capgemini 70% of the big data projects are not profitable. There can only be one conclusion, big data projects are hard! There is not one specific.

Big Data

Big Data Management Data Lake Metadata

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

AWS Glue Data Catalog stores information as metadata tables, where each table specifies a single data store. The AWS Glue crawler writes metadata to the Data Catalog by classifying the data to determine the format, schema, and associated properties of the data. Big Data Architect.

Metadata

Metadata Dashboards Metrics Visualization

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

Because a CDC file can contain data for multiple tables, the job loops over the tables in a file and loads the table metadata from the source table ( RDS column names). If the CDC operation is INSERT or UPDATE, the job merges the data into the Iceberg table.

Data Lake

Data Lake Data Processing Optimization Machine Learning

Business Intelligence for Fairs, Congresses and Exhibitions

Smart Data Collective

APRIL 14, 2021

Advancement in big data technology has made the world of business even more competitive. The proper use of business intelligence and analytical data is what drives big brands in a competitive market. This high-end data visualization makes data exploration more accessible to end-users.

Business Intelligence

Business Intelligence Dashboards Visualization Big Data

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

Many customers run big data workloads such as extract, transform, and load (ETL) on Apache Hive to create a data warehouse on Hadoop. We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The script generates a metadata JSON file for each step.

Metadata

Metadata Data Lake Testing Consulting

The Future Is Hybrid Data, Embrace It

Cloudera

JUNE 7, 2022

Big data is cool again. As the company who taught the world the value of big data, we always knew it would be. But this is not your grandfather’s big data. It has evolved into something new – hybrid data. The future is hybrid data, embrace it.

IT

IT Data Architecture Unstructured Data Big Data

Organize content across business units with enterprise-wide data governance using Amazon DataZone domain units and authorization policies

AWS Big Data

AUGUST 13, 2024

Additionally, authorization policies can be configured for a domain unit permitting actions such as who can create projects, metadata forms, and glossaries within their domain units. Similarly, authorization policies can help organizations govern the management of organizational domains, collaboration, and metadata.

Data Governance

Data Governance Metadata Enterprise Sales

Introducing Amazon MWAA larger environment sizes

AWS Big Data

APRIL 16, 2024

Running Apache Airflow at scale puts proportionally greater load on the Airflow metadata database, sometimes leading to CPU and memory issues on the underlying Amazon Relational Database Service (Amazon RDS) cluster. A resource-starved metadata database may lead to dropped connections from your workers, failing tasks prematurely.

Metadata

Metadata Metrics Testing Management

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery. Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The program must introduce and support standardization of enterprise data. Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

Introducing Amazon EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform

AWS Big Data

MAY 28, 2024

Apache Flink is a scalable, reliable, and efficient data processing framework that handles real-time streaming and batch workloads (but is most commonly used for real-time streaming). It’s the preferred choice to run big data workloads because it helps improve throughput and optimize Amazon EC2 spend.

Data Processing

Data Processing Cost-Benefit Metadata Optimization

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

Webinars

Trending Sources

How EUROGATE established a data mesh architecture using Amazon DataZone

Webinars

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Enhance data governance with enforced metadata rules in Amazon DataZone

Best practices for upgrading Amazon MWAA environments

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

What is a data scientist? A key data analytics role and a lucrative career

Introducing Amazon MWAA micro environments for Apache Airflow

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

Recap of Amazon Redshift key product announcements in 2024

How Can Marketing Teams Leverage Data Analytics for Digital Asset Management

Unstructured data management and governance using AWS AI/ML and analytics services

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Biggest Trends in Data Visualization Taking Shape in 2022

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Top analytics announcements of AWS re:Invent 2024

How Cargotec uses metadata replication to enable cross-account data sharing

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

A Few Proven Suggestions for Handling Large Data Sets

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Use Apache Iceberg in a data lake to support incremental data processing

Building a Beautiful Data Lakehouse

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

Choosing an open table format for your transactional data lake on AWS

Understanding The Phenomenal Impact of Social Data on B2B Funnels

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Simplifying Big Data Projects with Data Virtualization

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Business Intelligence for Fairs, Congresses and Exhibitions

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

The Future Is Hybrid Data, Embrace It

Organize content across business units with enterprise-wide data governance using Amazon DataZone domain units and authorization policies

Introducing Amazon MWAA larger environment sizes

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

What is data governance? Best practices for managing data assets

Introducing Amazon EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform

Stay Connected