Big Data, Data Integration and Metadata

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. We take care of the ETL for you by automating the creation and management of data replication. Glue ETL offers customer-managed data ingestion.

Data Integration

Data Integration Data Lake Statistics Data-driven

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

7 Benefits of Metadata Management

erwin

FEBRUARY 19, 2021

Metadata management is key to wringing all the value possible from data assets. However, most organizations don’t use all the data at their disposal to reach deeper conclusions about how to drive revenue, achieve regulatory compliance or accomplish other strategic objectives. What Is Metadata? Harvest data.

Metadata

Metadata Management Data Quality Cost-Benefit

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

An extract, transform, and load (ETL) process using AWS Glue is triggered once a day to extract the required data and transform it into the required format and quality, following the data product principle of data mesh architectures. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog.

IoT

IoT Machine Learning Metadata Data-driven

The Missing Link in Enterprise Data Governance: Metadata

Octopai

JUNE 26, 2020

In order to figure out why the numbers in the two reports didn’t match, Steve needed to understand everything about the data that made up those reports – when the report was created, who created it, any changes made to it, which system it was created in, etc. Enterprise data governance. Metadata in data governance.

Metadata

Metadata Data Governance Enterprise Reporting

How Metadata Makes Data Meaningful

erwin

DECEMBER 12, 2019

Metadata is an important part of data governance, and as a result, most nascent data governance programs are rife with project plans for assessing and documenting metadata. But in many scenarios, it seems that the underlying driver of metadata collection projects is that it’s just something you do for data governance.

Metadata

Metadata Data Governance Digital Transformation Data Quality

What Is a Metadata Catalog? (And How it Can Dramatically Improve Your Data Accuracy)

Octopai

JANUARY 31, 2022

If you’re a mystery lover, I’m sure you’ve read that classic tale: Sherlock Holmes and the Case of the Deceptive Data, and you know how a metadata catalog was a key plot element. In The Case of the Deceptive Data, Holmes is approached by B.I. He goes on to explain: Reasons for inaccurate data. Big data is BIG.

Metadata

Metadata IT Unstructured Data IoT

Are You Content with Your Organization’s Content Strategy?

Rocket-Powered Data Science

JULY 6, 2021

This is accomplished through tags, annotations, and metadata (TAM). granules) of the data collection for fast search, access, and retrieval is also important for efficient orchestration and delivery of the data that fuels AI, automation, and machine learning operations. Collect, curate, and catalog (i.e.,

Strategy

Strategy Machine Learning Metadata Knowledge Discovery

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

Third, some services require you to set up and manage compute resources used for federated connectivity, and capabilities like connection testing and data preview arent available in all services. To solve for these challenges, we launched Amazon SageMaker Lakehouse unified data connectivity.

Visualization

Visualization Data Processing Testing Publishing

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

AWS Big Data

FEBRUARY 6, 2023

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. MongoDB Atlas is a developer data service from AWS technology partner MongoDB, Inc.

Metadata

Metadata Data Lake Machine Learning Big Data

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

SageMaker brings together widely adopted AWS ML and analytics capabilities—virtually all of the components you need for data exploration, preparation, and integration; petabyte-scale big data processing; fast SQL analytics; model development and training; governance; and generative AI development.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time. Apache Iceberg offers integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producers data is being updated.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Biggest Trends in Data Visualization Taking Shape in 2022

Smart Data Collective

OCTOBER 13, 2021

There are countless examples of big data transforming many different industries. There is no disputing the fact that the collection and analysis of massive amounts of unstructured data has been a huge breakthrough. We would like to talk about data visualization and its role in the big data movement.

Visualization

Visualization Cost-Benefit Big Data Prescriptive Analytics

IBM named a leader in the 2022 Gartner® Magic Quadrant™ for Data Integration Tools

IBM Big Data Hub

AUGUST 24, 2022

The only question is, how do you ensure effective ways of breaking down data silos and bringing data together for self-service access? It starts by modernizing your data integration capabilities – ensuring disparate data sources and cloud environments can come together to deliver data in real time and fuel AI initiatives.

Data Integration

Data Integration Metadata Data-driven Data Architecture

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Data integrity vs. data quality: Is there a difference?

IBM Big Data Hub

JULY 13, 2023

When we talk about data integrity, we’re referring to the overarching completeness, accuracy, consistency, accessibility, and security of an organization’s data. Together, these factors determine the reliability of the organization’s data. In short, yes.

Data Quality

Data Quality Data Integration Metadata Cost-Benefit

Dive deep into security management: The Data on EKS Platform

AWS Big Data

APRIL 29, 2024

The construction of big data applications based on open source software has become increasingly uncomplicated since the advent of projects like Data on EKS , an open source project from AWS to provide blueprints for building data and machine learning (ML) applications on Amazon Elastic Kubernetes Service (Amazon EKS).

Management

Management Big Data Data Warehouse Metadata

Becoming a machine learning company means investing in foundational technologies

O'Reilly on Data

MAY 21, 2019

Thus, many developers will need to curate data, train models, and analyze the results of models. With that said, we are still in a highly empirical era for ML: we need big data, big models, and big compute. A typical data pipeline for machine learning. Metadata and artifacts needed for audits.

Machine Learning

Machine Learning Technology Deep Learning Data Science

How Metadata Makes Data Meaningful

erwin

DECEMBER 12, 2019

Metadata is an important part of data governance, and as a result, most nascent data governance programs are rife with project plans for assessing and documenting metadata. But in many scenarios, it seems that the underlying driver of metadata collection projects is that it’s just something you do for data governance.

Metadata

Metadata Data Governance Digital Transformation Data Quality

What Is Data Modeling? Data Modeling Best Practices for Data-Driven Organizations

erwin

JANUARY 17, 2020

Data modeling is a process that enables organizations to discover, design, visualize, standardize and deploy high-quality data assets through an intuitive, graphical interface. Data models provide visualization, create additional metadata and standardize data design across the enterprise. SQL or NoSQL?

Data-driven

Data-driven Modeling Metadata Data Governance

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

AWS Transfer Family seamlessly integrates with other AWS services, automates transfer, and makes sure data is protected with encryption and access controls. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. 2 GB into the landing zone daily.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

We will partition and format the server access logs with Amazon Web Services (AWS) Glue , a serverless data integration service, to generate a catalog for access logs and create dashboards for insights. Both the user data and logs buckets must be in the same AWS Region and owned by the same account. Big Data Architect.

Metadata

Metadata Dashboards Metrics Visualization

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Apache Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for processing engines such as Apache Spark, Trino, Apache Flink, Presto, Apache Hive, and Impala to safely work with the same tables at the same time. This concept makes Iceberg extremely versatile. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Data Lake

Data Lake Metadata Snapshot Analytics

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

For this, Cargotec built an Amazon Simple Storage Service (Amazon S3) data lake and cataloged the data assets in AWS Glue Data Catalog. They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed.

Metadata

Metadata Data Lake Machine Learning Big Data

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

In this post, we discuss how the reimagined data flow works with OR1 instances and how it can provide high indexing throughput and durability using a new physical replication protocol. We also dive deep into some of the challenges we solved to maintain correctness and data integrity.

Optimization

Optimization Snapshot Metadata Cost-Benefit

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The program must introduce and support standardization of enterprise data. Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

How to Do Data Modeling the Right Way

erwin

MAY 27, 2020

What, then, should users look for in a data modeling product to support their governance/intelligence requirements in the data-driven enterprise? Nine Steps to Data Modeling. Provide metadata and schema visualization regardless of where data is stored. naming and database standards, formatting options, and so on.

Modeling

Modeling Metadata Data Governance Visualization

Why Your Business Should Use a Data Catalog to Organize Its Data

Smart Data Collective

JULY 15, 2021

A data catalog serves the same purpose. By using metadata (or short descriptions), data catalogs help companies gather, organize, retrieve, and manage information. You can think of a data catalog as an enhanced Access database or library card catalog system. What Does a Data Catalog Consist Of?

Metadata

Metadata IT Data-driven Data Quality

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

S3 Tables integration with the AWS Glue Data Catalog is in preview, allowing you to stream, query, and visualize dataincluding Amazon S3 Metadata tablesusing AWS analytics services such as Amazon Data Firehose , Amazon Athena , Amazon Redshift, Amazon EMR, and Amazon QuickSight. With AWS Glue 5.0, With AWS Glue 5.0,

Analytics

Analytics Data Lake Metadata Data Warehouse

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

In-place data upgrade In an in-place data migration strategy, existing datasets are upgraded to Apache Iceberg format without first reprocessing or restating existing data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files. This can save time.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Five Benefits of an Automation Framework for Data Governance

erwin

JANUARY 24, 2019

In most companies, an incredible amount of data flows from multiple sources in a variety of formats and is constantly being moved and federated across a changing system landscape. With an automation framework, data professionals can meet these needs at a fraction of the cost of the traditional manual way. Governing metadata.

Data Governance

Data Governance Metadata Data-driven Cost-Benefit

Proposals for model vulnerability and security

O'Reilly on Data

MARCH 20, 2019

Data integrity constraints: Many databases don’t allow for strange or unrealistic combinations of input variables and this could potentially thwart watermarking attacks. Applying data integrity constraints on live, incoming data streams could have the same benefits. Disparate impact analysis: see section 1.

Modeling

Modeling Machine Learning Predictive Modeling Consulting

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery. Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Many AWS customers adopted Apache Hudi on their data lakes built on top of Amazon S3 using AWS Glue , a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

Data Lake

Data Lake Snapshot Metadata Optimization

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

What is a data fabric architecture?

IBM Big Data Hub

MARCH 25, 2022

A data fabric is an architectural approach that enables organizations to simplify data access and data governance across a hybrid multicloud landscape for better 360-degree views of the customer and enhanced MLOps and trustworthy AI. The post What is a data fabric architecture? appeared first on Journey to AI Blog.

Metadata

Metadata Data Quality Data Governance Data Integration

Why next-generation execs should care about data governance

IBM Big Data Hub

APRIL 9, 2018

There’s a general need for next-gen executives to not only understand corporate regulations, but be able to adhere to and follow them using metadata solutions like data governance. As the business world’s top asset becomes data, data governance will ensure that data and information being handled is consistent, reliable and trustworthy.

Data Governance

Data Governance Metadata Data Integration Analytics

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Iceberg stores the metadata pointer for all the metadata files. When a SELECT query is reading an Iceberg table, the query engine first goes to the Iceberg catalog, then retrieves the entry of the location of the latest metadata file, as shown in the following diagram.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

However, enterprise data generated from siloed sources combined with the lack of a data integration strategy creates challenges for provisioning the data for generative AI applications. Data discoverability Unlike structured data, which is managed in well-defined rows and columns, unstructured data is stored as objects.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

You can slice data by different dimensions like job name, see anomalies, and share reports securely across your organization. With these insights, teams have the visibility to make data integration pipelines more efficient. An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog.

Metrics

Metrics Visualization Dashboards Publishing

Why Establishing Data Context is the Key to Creating Competitive Advantage

Ontotext

AUGUST 22, 2023

The age of Big Data inevitably brought computationally intensive problems to the enterprise. Central to today’s efficient business operations are the activities of data capturing and storage, search, sharing, and data analytics. As a result, organizations have spent untold money and time gathering and integrating data.

Metadata

Metadata Knowledge Discovery Big Data Enterprise

How to stay ahead of ever-evolving data privacy regulations

IBM Big Data Hub

SEPTEMBER 12, 2022

To understand how a data fabric helps maintain compliance to privacy regulations, it’s helpful to look at some essential elements of that single pane of glass. Build a foundation using a common catalog and metadata. It lets appropriate parties, such as the company’s chief data analyst, know what the data is and where it resides.

Metadata

Metadata Data Governance Enterprise Data Architecture

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Webinars

Trending Sources

Build a high-performance quant research platform with Apache Iceberg

Webinars

7 Benefits of Metadata Management

How EUROGATE established a data mesh architecture using Amazon DataZone

The Missing Link in Enterprise Data Governance: Metadata

How Metadata Makes Data Meaningful

What Is a Metadata Catalog? (And How it Can Dramatically Improve Your Data Accuracy)

Are You Content with Your Organization’s Content Strategy?

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Recap of Amazon Redshift key product announcements in 2024

Biggest Trends in Data Visualization Taking Shape in 2022

IBM named a leader in the 2022 Gartner® Magic Quadrant™ for Data Integration Tools

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Data integrity vs. data quality: Is there a difference?

Dive deep into security management: The Data on EKS Platform

Becoming a machine learning company means investing in foundational technologies

How Metadata Makes Data Meaningful

What Is Data Modeling? Data Modeling Best Practices for Data-Driven Organizations

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

How Cargotec uses metadata replication to enable cross-account data sharing

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

What is data governance? Best practices for managing data assets

How to Do Data Modeling the Right Way

Why Your Business Should Use a Data Catalog to Organize Its Data

Top analytics announcements of AWS re:Invent 2024

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Five Benefits of an Automation Framework for Data Governance

Proposals for model vulnerability and security

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Introducing Apache Hudi support with AWS Glue crawlers

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

What is a data fabric architecture?

Why next-generation execs should care about data governance

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Data governance in the age of generative AI

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Why Establishing Data Context is the Key to Creating Competitive Advantage

How to stay ahead of ever-evolving data privacy regulations

Stay Connected