Data Analytics, Data Integration and Metadata

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. We take care of the ETL for you by automating the creation and management of data replication. Glue ETL offers customer-managed data ingestion.

Data Integration

Data Integration Data Lake Statistics Data-driven

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

This week on the keynote stages at AWS re:Invent 2024, you heard from Matt Garman, CEO, AWS, and Swami Sivasubramanian, VP of AI and Data, AWS, speak about the next generation of Amazon SageMaker , the center for all of your data, analytics, and AI. The relationship between analytics and AI is rapidly evolving.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

RDF-Star: Metadata Complexity Simplified

Ontotext

JUNE 10, 2021

To handle such scenarios you need a transalytical graph database – a database engine that can deal with both frequent updates (OLTP workload) as well as with graph analytics (OLAP). Not Every Graph is a Knowledge Graph: Schemas and Semantic Metadata Matter. Metadata about Relationships Come in Handy. Schemas are powerful.

Metadata

Metadata Cost-Benefit OLAP Modeling

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

In addition to real-time analytics and visualization, the data needs to be shared for long-term data analytics and machine learning applications. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated. Industry-leading price-performance: Amazon Redshift launches RA3.large

Data Lake

Data Lake Data Warehouse Data-driven Optimization

What’s the Current State of Data Governance and Automation?

erwin

JANUARY 30, 2020

The results of our new research show that organizations are still trying to master data governance, including adjusting their strategies to address changing priorities and overcoming challenges related to data discovery, preparation, quality and traceability. And close to 50 percent have deployed data catalogs and business glossaries.

Data Governance

Data Governance Metadata Cost-Benefit Digital Transformation

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

In this post, we walk you through the top analytics announcements from re:Invent 2024 and explore how these innovations can help you unlock the full potential of your data. S3 Metadata is designed to automatically capture metadata from objects as they are uploaded into a bucket, and to make that metadata queryable in a read-only table.

Analytics

Analytics Data Lake Metadata Data Warehouse

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The program must introduce and support standardization of enterprise data. Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

For this, Cargotec built an Amazon Simple Storage Service (Amazon S3) data lake and cataloged the data assets in AWS Glue Data Catalog. They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed.

Metadata

Metadata Data Lake Machine Learning Big Data

Data confidence begins at the edge

CIO Business Intelligence

SEPTEMBER 23, 2024

For sectors such as industrial manufacturing and energy distribution, metering, and storage, embracing artificial intelligence (AI) and generative AI (GenAI) along with real-time data analytics, instrumentation, automation, and other advanced technologies is the key to meeting the demands of an evolving marketplace, but it’s not without risks.

Manufacturing

Manufacturing Internet of Things Metadata Risk

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

We will partition and format the server access logs with Amazon Web Services (AWS) Glue , a serverless data integration service, to generate a catalog for access logs and create dashboards for insights. Both the user data and logs buckets must be in the same AWS Region and owned by the same account.

Metadata

Metadata Dashboards Metrics Visualization

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Transformation

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

In this blog post, we dive into different data aspects and how Cloudinary breaks the two concerns of vendor locking and cost efficient data analytics by using Apache Iceberg, Amazon Simple Storage Service (Amazon S3 ), Amazon Athena , Amazon EMR , and AWS Glue. This concept makes Iceberg extremely versatile. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

Biggest Trends in Data Visualization Taking Shape in 2022

Smart Data Collective

OCTOBER 13, 2021

If we talk about Big Data, data visualization is crucial to more successfully drive high-level decision making. Big Data analytics has immense potential to help companies in decision making and position the company for a realistic future. There is little use for data analytics without the right visualization tool.

Visualization

Visualization Cost-Benefit Big Data Prescriptive Analytics

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

AWS Transfer Family seamlessly integrates with other AWS services, automates transfer, and makes sure data is protected with encryption and access controls. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. 2 GB into the landing zone daily.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Augmented data management: Data fabric versus data mesh

IBM Big Data Hub

APRIL 27, 2022

Gartner defines a data fabric as “a design concept that serves as an integrated layer of data and connecting processes. The data fabric architectural approach can simplify data access in an organization and facilitate self-service data consumption at scale. 11 May 2021. . 3 March 2022.

Management

Management Metadata Data Architecture Data Lake

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

However, enterprise data generated from siloed sources combined with the lack of a data integration strategy creates challenges for provisioning the data for generative AI applications. Data discoverability Unlike structured data, which is managed in well-defined rows and columns, unstructured data is stored as objects.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. An entity can act both as a producer of data assets and as a consumer of data assets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Are Data Governance Bottlenecks Holding You Back?

erwin

FEBRUARY 4, 2021

Automate code generation : Alleviate the need for developers to hand code connections from data sources to target schema. It also makes it easier for business analysts, data architects, ETL developers, testers and project managers to collaborate for faster decision-making.

Data Governance

Data Governance Metadata Data Quality Risk Management

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Data ingestion You have to build ingestion pipelines based on factors like types of data sources (on-premises data stores, files, SaaS applications, third-party data), and flow of data (unbounded streams or batch data). Then, you transform this data into a concise format.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Why Establishing Data Context is the Key to Creating Competitive Advantage

Ontotext

AUGUST 22, 2023

The age of Big Data inevitably brought computationally intensive problems to the enterprise. Central to today’s efficient business operations are the activities of data capturing and storage, search, sharing, and data analytics. With semantic metadata, enterprise data gets linked to one another and to external sources.

Metadata

Metadata Knowledge Discovery Big Data Enterprise

What is a business intelligence analyst? A key role for data-driven decisions

CIO Business Intelligence

OCTOBER 26, 2023

Business intelligence (BI) analysts transform data into insights that drive business value. If you score a 70% or higher on all three exams, you’ll be certified at the Mastery level, which demonstrates your ability to lead a team and mentor others, according to TDWI.

Business Intelligence

Business Intelligence Data-driven Statistics Data Warehouse

A hybrid approach in healthcare data warehousing with Amazon Redshift

AWS Big Data

FEBRUARY 21, 2023

Loading complex multi-point datasets into a dimensional model, identifying issues, and validating data integrity of the aggregated and merged data points are the biggest challenges that clinical quality management systems face. It is a data modeling methodology designed for large-scale data warehouse platforms.

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Metadata

Octopai’s Groundbreaking Real-Time Data Lineage Support for Databricks

Octopai

SEPTEMBER 27, 2023

Octopai’s real-time capabilities provide a transparent, up-to-the-moment view of data integrations across platforms like Airflow, Azure Data Factory, Snowflake, Redshift, and Azure Synapse. Instead, it’s an intuitive journey where every step of data is transparent and trustworthy.

Metadata

Metadata Visualization Data Integration Data-driven

How healthcare organizations can analyze and create insights using price transparency data

AWS Big Data

OCTOBER 11, 2023

The data in the machine-readable files can provide valuable insights to understand the true cost of healthcare services and compare prices and quality across hospitals. The availability of machine-readable files opens up new possibilities for data analytics, allowing organizations to analyze large amounts of pricing data.

Visualization

Visualization Dashboards Data-driven Gap analysis

AWS re:Invent 2023 Amazon Redshift Sessions Recap

AWS Big Data

DECEMBER 18, 2023

Amazon Redshift has been constantly innovating over the last decade to give you a modern, massively parallel processing cloud data warehouse that delivers the best price-performance, ease of use, scalability, and reliability. Discover how you can use Amazon Redshift to build a data mesh architecture to analyze your data.

Data Warehouse

Data Warehouse Machine Learning Data-driven Data Lake

Use Amazon Athena to query data stored in Google Cloud Platform

AWS Big Data

AUGUST 15, 2023

As customers accelerate their migrations to the cloud and transform their businesses, some find themselves in situations where they have to manage data analytics in a multi-cloud environment, such as acquiring a company that runs on a different cloud provider. For instructions, refer to Setting up databases and tables in AWS Glue.

Recreation/Entertainment

Recreation/Entertainment Unstructured Data Business Intelligence Data-driven

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

AWS Big Data

SEPTEMBER 10, 2024

AWS Glue, with its ability to process data using Apache Spark and connect to various data sources, is a suitable solution for addressing the challenges of accessing data across multiple cloud environments. Athena can then use this metadata to query and analyze the Delta table seamlessly.

Data Lake

Data Lake Metadata Management Software

GraphDB in Action: Putting the Most Reliable RDF Database to Work for Better Human-machine Interaction

Ontotext

JANUARY 26, 2023

The catalog stores the asset’s metadata in RDF. This allows keeping a well-defined representation of the metadata of each asset and enables using a SPARQL endpoint to query it. Towards that end authors introduce a system for integrity checks for building automation applications and using more reliable data for data analytics processes.

Interactive

Interactive Metadata Data Integration Data-driven

How data stores and governance impact your AI initiatives

IBM Big Data Hub

OCTOBER 12, 2023

Among the tasks necessary for internal and external compliance is the ability to report on the metadata of an AI model. Metadata includes details specific to an AI model such as: The AI model’s creation (when it was created, who created it, etc.)

Cost-Benefit

Cost-Benefit Metadata Data Governance Optimization

Three Takeaways from Gartner’s 2019 Magic Quadrant for Data Management Solutions for Analytics

Cloudera

FEBRUARY 11, 2019

Cloudera provides a unified platform with multiple data apps and tools, big data management, hybrid cloud deployment flexibility, admin tools for platform provisioning and control, and a shared data experience for centralized security, governance, and metadata management.

Management

Management Metadata Analytics Machine Learning

Four starting points to transform your organization into a data-driven enterprise

IBM Big Data Hub

JANUARY 17, 2023

Due to the convergence of events in the data analytics and AI landscape, many organizations are at an inflection point. IBM Cloud Pak for Data Express solutions offer clients a simple on ramp to start realizing the business value of a modern architecture. Data governance. Data integration. Start a trial.

Data-driven

Data-driven Enterprise Data Governance Data Science

Five benefits of a data catalog

IBM Big Data Hub

DECEMBER 16, 2022

An enterprise data catalog does all that a library inventory system does – namely streamlining data discovery and access across data sources – and a lot more. For example, data catalogs have evolved to deliver governance capabilities like managing data quality and data privacy and compliance.

Metadata

Metadata Data Quality Data-driven Data Governance

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

AWS Big Data

SEPTEMBER 29, 2023

Analyzing XML files can help organizations gain insights into their data, allowing them to make better decisions and improve their operations. Analyzing XML files can also help in data integration, because many applications and systems use XML as a standard data format. This approach optimizes the use of your XML files.

Metadata

Metadata Visualization Data-driven Optimization

Benchmark Results Position GraphDB As the Most Versatile Graph Database Engine

Ontotext

FEBRUARY 23, 2023

The engines must facilitate the advanced data integration and metadata data management scenarios where an EKG is used for data fabrics or otherwise serves as a data hub between diverse data and content management systems.

Publishing

Publishing Metadata Optimization Testing

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

In addition, using Apache Iceberg’s metadata tables proved to be very helpful in identifying issues related to the physical layout of Iceberg’s tables, which can directly impact query performance. These robust capabilities ensure that data within the data lake remains accurate, consistent, and reliable.

Data Lake

Data Lake Analytics Snapshot Data Quality

Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

AWS Big Data

SEPTEMBER 13, 2023

Streaming data has become an indispensable resource for organizations worldwide because it offers real-time insights that are crucial for data analytics. The escalating velocity and magnitude of collected data has created a demand for real-time analytics. This table acts as a metadata layer for the data.

Data Processing

Data Processing Management Interactive Metadata

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

Organizations across the world are increasingly relying on streaming data, and there is a growing need for real-time data analytics, considering the growing velocity and volume of data being collected. For more information about checkpointing, see the appendix at the end of this post.

Management

Management Metadata Internet of Things Testing

Data architecture strategy for data quality

IBM Big Data Hub

JANUARY 5, 2023

Both approaches were typically monolithic and centralized architectures organized around mechanical functions of data ingestion, processing, cleansing, aggregation, and serving. Monitor and identify data quality issues closer to the source to mitigate the potential impact on downstream processes or workloads.

Data Architecture

Data Architecture Data Quality Strategy Data Lake

Data democratization: How data architecture can drive business decisions and AI initiatives

IBM Big Data Hub

AUGUST 4, 2023

By leveraging data services and APIs, a data fabric can also pull together data from legacy systems, data lakes, data warehouses and SQL databases, providing a holistic view into business performance. It uses knowledge graphs, semantics and AI/ML technology to discover patterns in various types of metadata.

Data Architecture

Data Architecture Data Lake Machine Learning Data Governance

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

AWS Big Data

MAY 16, 2024

With the new REST API, you can now invoke DAG runs, manage datasets, or get the status of Airflow’s metadata database, trigger, and scheduler—all without relying on the Airflow web UI or CLI. She is passionate about data analytics and networking. Big Data and ETL Solutions Architect, MWAA and AWS Glue ETL expert.

Testing

Testing Metrics Interactive Management

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

AWS Big Data

FEBRUARY 22, 2023

An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data. The AWS Glue job can transform the raw data in Amazon S3 to Parquet format, which is optimized for analytic queries. All the metadata of the tables is stored in the AWS Glue Data Catalog, including the Hudi tables.

Data Lake

Data Lake Dashboards Cost-Benefit Data Warehouse

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

Andrew White

OCTOBER 22, 2021

Data Literacy, training, coordination, collaboration 8. Data Management Infrastructure/Data Fabric 5. Data Integration tactics 4. Metadata Strategy 3. CDO (data officer) 2. Figure 3: The Data and Analytics (infrastructure) Continuum. Business Innovation with D&A 6.

IT

IT Data Lake Data Science Strategy

How to Shop for Data?

Data Virtualization

JANUARY 18, 2024

Reading Time: 3 minutes Today, the most innovative and successful organizations leverage data to increase revenue, minimize expenses, and deliver products and services that meet the needs of their customers. To be truly “data-driven,” an organization must view data as more than a byproduct. The post How to Shop for Data?

Data-driven

Data-driven Data Integration Management Metadata

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Webinars

Trending Sources

RDF-Star: Metadata Complexity Simplified

Webinars

How EUROGATE established a data mesh architecture using Amazon DataZone

Recap of Amazon Redshift key product announcements in 2024

What’s the Current State of Data Governance and Automation?

Top analytics announcements of AWS re:Invent 2024

What is data governance? Best practices for managing data assets

How Cargotec uses metadata replication to enable cross-account data sharing

Data confidence begins at the edge

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Biggest Trends in Data Visualization Taking Shape in 2022

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Augmented data management: Data fabric versus data mesh

Data governance in the age of generative AI

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Are Data Governance Bottlenecks Holding You Back?

Create an end-to-end data strategy for Customer 360 on AWS

Why Establishing Data Context is the Key to Creating Competitive Advantage

What is a business intelligence analyst? A key role for data-driven decisions

A hybrid approach in healthcare data warehousing with Amazon Redshift

Octopai’s Groundbreaking Real-Time Data Lineage Support for Databricks

How healthcare organizations can analyze and create insights using price transparency data

AWS re:Invent 2023 Amazon Redshift Sessions Recap

Use Amazon Athena to query data stored in Google Cloud Platform

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

GraphDB in Action: Putting the Most Reliable RDF Database to Work for Better Human-machine Interaction

How data stores and governance impact your AI initiatives

Three Takeaways from Gartner’s 2019 Magic Quadrant for Data Management Solutions for Analytics

Four starting points to transform your organization into a data-driven enterprise

Five benefits of a data catalog

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

Benchmark Results Position GraphDB As the Most Versatile Graph Database Engine

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

Data architecture strategy for data quality

Data democratization: How data architecture can drive business decisions and AI initiatives

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

How to Shop for Data?

Stay Connected