Data Integration, Data Processing and Metadata

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

An extract, transform, and load (ETL) process using AWS Glue is triggered once a day to extract the required data and transform it into the required format and quality, following the data product principle of data mesh architectures. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog.

IoT

IoT Machine Learning Metadata Data-driven

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. Amazon Athena is used to query, and explore the data.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

Third, some services require you to set up and manage compute resources used for federated connectivity, and capabilities like connection testing and data preview arent available in all services. To solve for these challenges, we launched Amazon SageMaker Lakehouse unified data connectivity. For Add data source , choose Add connection.

Visualization

Visualization Data Processing Testing Publishing

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Data confidence begins at the edge

CIO Business Intelligence

SEPTEMBER 23, 2024

Trustworthy data is essential for the energy industry to overcome these challenges and accelerate the transition toward digital transformation and sustainability. Specifically, what the DCF does is capture metadata related to the application and compute stack. Addressing this complex issue requires a multi-pronged approach.

Manufacturing

Manufacturing Internet of Things Metadata Risk

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

APRIL 17, 2024

In this post, we discuss how the reimagined data flow works with OR1 instances and how it can provide high indexing throughput and durability using a new physical replication protocol. We also dive deep into some of the challenges we solved to maintain correctness and data integrity.

Optimization

Optimization Snapshot Metadata Cost-Benefit

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

For this, Cargotec built an Amazon Simple Storage Service (Amazon S3) data lake and cataloged the data assets in AWS Glue Data Catalog. They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed.

Metadata

Metadata Data Lake Machine Learning Big Data

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

In-place data upgrade In an in-place data migration strategy, existing datasets are upgraded to Apache Iceberg format without first reprocessing or restating existing data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files. Open AWS Glue Studio.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

SAP enhances Datasphere and SAC for AI-driven transformation

CIO Business Intelligence

MARCH 6, 2024

SAP announced today a host of new AI copilot and AI governance features for SAP Datasphere and SAP Analytics Cloud (SAC). The company is expanding its partnership with Collibra to integrate Collibra’s AI Governance platform with SAP data assets to facilitate data governance for non-SAP data assets in customer environments. “We

Unstructured Data

Unstructured Data Dashboards Business Intelligence Data Governance

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. An entity can act both as a producer of data assets and as a consumer of data assets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Data and Metadata: Data inputs and data outputs produced based on the application logic.

Metadata

Metadata Cost-Benefit Enterprise Interactive

The importance of data ingestion and integration for enterprise AI

IBM Big Data Hub

JANUARY 9, 2024

Data ingestion must be done properly from the start, as mishandling it can lead to a host of new issues. The groundwork of training data in an AI model is comparable to piloting an airplane. The entire generative AI pipeline hinges on the data pipelines that empower it, making it imperative to take the correct precautions.

Enterprise

Enterprise Data Integration Data Quality Contextual Data

AVB accelerates search in LINQ with Amazon OpenSearch Service

AWS Big Data

MAY 21, 2024

Initially, searches from Hub queried LINQ’s Microsoft SQL Server database hosted on Amazon Elastic Compute Cloud (Amazon EC2), with search times averaging 3 seconds, leading to reduced adoption and negative feedback. The LINQ team exposes access to the OpenSearch Service index through a search API hosted on Amazon EC2.

Manufacturing

Manufacturing Sales Optimization Data Processing

From Data Silos to Data Fabric with Knowledge Graphs

Ontotext

SEPTEMBER 15, 2020

Added to this is the increasing demands being made on our data from event-driven and real-time requirements, the rise of business-led use and understanding of data, and the move toward automation of data integration, data and service-level management. Knowledge Graphs are the Warp and Weft of a Data Fabric.

Metadata

Metadata Knowledge Discovery Data Quality Strategy

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

It integrates data across a wide arrange of sources to help optimize the value of ad dollar spending. Its cloud-hosted tool manages customer communications to deliver the right messages at times when they can be absorbed. Along the way, metadata is collected, organized, and maintained to help debug and ensure data integrity.

Management

Management Advertising Data Lake Sales

Top 10 Data Lineage Podcasts, Blogs, and Magazines

Octopai

JANUARY 31, 2021

Within each episode, there are actionable insights that data teams can apply in their everyday tasks or projects. The host is Tobias Macey, an engineer with many years of experience. Agile Data. Agile Data. Another podcast we think is worth a listen is Agile Data.

Data Governance

Data Governance Data Processing Data Quality Metadata

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

You can slice data by different dimensions like job name, see anomalies, and share reports securely across your organization. With these insights, teams have the visibility to make data integration pipelines more efficient. An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog.

Metrics

Metrics Visualization Dashboards Interactive

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

Rise in polyglot data movement because of the explosion in data availability and the increased need for complex data transformations (due to, e.g., different data formats used by different processing frameworks or proprietary applications). As a result, alternative data integration technologies (e.g.,

Data Processing

Data Processing Data Warehouse Enterprise Visualization

Why Enterprise Data Lineage is Critical for the Success of Your Modern Data Stack

Octopai

NOVEMBER 13, 2022

If you want to know why a report from Power BI delivered a particular number, data lineage traces that data point back through your data warehouse or lakehouse, back through your data integration tool, back to where the data basis for that report metric first entered your system.

Enterprise

Enterprise Data Warehouse Reporting Metadata

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

Ontotext

DECEMBER 1, 2023

So, KGF 2023 proved to be a breath of fresh air for anyone interested in topics like data mesh and data fabric , knowledge graphs, text analysis , large language model (LLM) integrations, retrieval augmented generation (RAG), chatbots, semantic data integration , and ontology building.

Metadata

Metadata Sales Machine Learning Consulting

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. Over time, workloads start processing more data, tenants start onboarding more workloads, and administrators (admins) start onboarding more tenants.

Metadata

Metadata Data Lake Optimization Strategy

Combining the Flexibility of Knowledge Graphs with the Power of Semantic Tagging: The Enterprise PowerPack

Ontotext

JULY 12, 2024

We offer a seamless integration of the PoolParty Semantic Suite and GraphDB , called the PowerPack bundles. This enables our customers to work with a rich, user-friendly toolset to manage a graph composed of billions of edges hosted in data centers around the world. PowerPack Bundles – What is it and what is included?

Enterprise

Enterprise Cost-Benefit Metadata Data Integration

Sovereign Clouds: Partner Perspectives on Safeguarding Critical Customer Data

CIO Business Intelligence

APRIL 27, 2022

All are ideally qualified to help their customers achieve and maintain the highest standards for data integrity, including absolute control over data access, transparency and visibility into the provider’s operation, the knowledge that their information is managed appropriately, and access to VMware’s growing ecosystem of sovereign cloud solutions.

Digital Transformation

Digital Transformation Metadata Risk Enterprise

Top 15 data management platforms available today

CIO Business Intelligence

SEPTEMBER 22, 2023

It integrates data across a wide arrange of sources to help optimize the value of ad dollar spending. Its cloud-hosted tool manages customer communications to deliver the right messages at times when they can be absorbed. Along the way, metadata is collected, organized, and maintained to help debug and ensure data integrity.

Management

Management Advertising Data Lake Sales

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

To develop your disaster recovery plan, you should complete the following tasks: Define your recovery objectives for downtime and data loss (RTO and RPO) for data and metadata. Choose your hosted zone. On the Route 53 console, choose Hosted zones in the navigation pane. Choose your hosted zone. Choose Save.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

Data governance shows up as the fourth-most-popular kind of solution that enterprise teams were adopting or evaluating during 2019. That’s a lot of priorities – especially when you group together closely related items such as data lineage and metadata management which rank nearby. Those days are long gone if they ever existed.

Machine Learning

Machine Learning Data Governance Metadata Data Science

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Data ingestion You have to build ingestion pipelines based on factors like types of data sources (on-premises data stores, files, SaaS applications, third-party data), and flow of data (unbounded streams or batch data). Then, you transform this data into a concise format.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

AWS Big Data

MAY 16, 2024

With the new REST API, you can now invoke DAG runs, manage datasets, or get the status of Airflow’s metadata database, trigger, and scheduler—all without relying on the Airflow web UI or CLI. Args: region (str): AWS region where the MWAA environment is hosted. Args: region (str): AWS region where the MWAA environment is hosted.

Testing

Testing Metrics Interactive Management

Data Management Requirements for the Enterprise Data Lake

In(tegrate) the Clouds

MAY 1, 2016

Metadata and Governance. The company also recently hosted a webinar on Democratizing the Data Lake with Constellation Research and published 2 whitepapers from Mark Madsen. Ingest and Delivery. Discovery and Preparation. Transformation and Analytics. Scheduling and Workflow.

Data Lake

Data Lake Enterprise Management Metadata

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

The system ingests data from various sources such as cloud resources, cloud activity logs, and API access logs, and processes billions of messages, resulting in terabytes of data daily. This data is sent to Apache Kafka, which is hosted on Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Data Lake

Data Lake Analytics Snapshot Data Quality

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

AWS Big Data

FEBRUARY 2, 2023

Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data. For Report path prefix , enter cur-data/account-cur-daily.

Reporting

Reporting Data Lake Management Optimization

The Modern Data Stack Explained: What The Future Holds

Alation

JANUARY 17, 2023

What if, experts asked, you could load raw data into a warehouse, and then empower people to transform it for their own unique needs? Today, data integration platforms like Rivery do just that. By pushing the T to the last step in the process, such products have revolutionized how data is understood and analyzed.

Data Warehouse

Data Warehouse Cost-Benefit Data Science Data Transformation

The Gartner 2022 Leadership Vision for Data and Analytics Leaders Questions and Answers

Andrew White

JANUARY 9, 2022

On Thursday January 6th I hosted Gartner’s 2022 Leadership Vision for Data and Analytics webinar. Much as the analytics world shifted to augmented analytics, the same is happening in data management. Where these efforts break down is in the data that goes into the connection at one end and comes out the other.

Analytics

Analytics Measurement Data-driven Modeling

Fivetran Modern Data Stack Conference 2023: Key Takeaways

Alation

APRIL 14, 2023

Last week, the Alation team had the privilege of joining IT professionals, business leaders, and data analysts and scientists for the Modern Data Stack Conference in San Francisco.

Data Warehouse

Data Warehouse Data-driven Metadata Digital Transformation

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

Data mapping is essential for integration, migration, and transformation of different data sets; it allows you to improve your data quality by preventing duplications and redundancies in your data fields. Data mapping helps standardize, visualize, and understand data across different systems and applications.

Data Warehouse

Data Warehouse Reporting Data Transformation Visualization

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

Metadata Self-service analysis is made easy with user-friendly naming conventions for tables and columns. That means it should be connected to your data sources, integrated with your security, and be embedded into your app. addresses). Furthermore, it uses techniques that are known for scaling the implementation.

Analytics

Analytics Cost-Benefit Visualization Dashboards

Apache HBase online migration to Amazon EMR

AWS Big Data

OCTOBER 23, 2024

HBase can run on Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3) , and can host very large tables with billions of rows and millions of columns. Test and verify After incremental data synchronization is complete, you can start testing and verifying the results.

Snapshot

Snapshot Recreation/Entertainment Testing Data Processing

BARC Perspective: SAP BDC – Breaking Tradition and Embracing Data Products

BI-Survey

FEBRUARY 13, 2025

Business Data Cloud (BDC) consists of multiple existing and new services built by SAP and its partners: Object store which is an OEM from Databricks Databricks Data Engineering and AI/ML Tools SAP Datasphere SAP BW 7.5 Instead, the Databricks object store provides an industry-standard and more cost-efficient solution for storing data.

Cost-Benefit

Cost-Benefit Unstructured Data Strategy Data-driven

Hybrid big data analytics with Amazon EMR on AWS Outposts

AWS Big Data

JANUARY 29, 2025

To avoid this situation, Oktank aims to decouple compute from storage, allowing them to scale down compute nodes and repurpose them for other workloads without compromising data integrity and accessibility. The Data Catalog is a centralized metadata repository for all your data assets across various data sources.

Big Data

Big Data Data Analytics Analytics Interactive

How EUROGATE established a data mesh architecture using Amazon DataZone

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Webinars

Trending Sources

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Webinars

Data confidence begins at the edge

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

How Cargotec uses metadata replication to enable cross-account data sharing

Migrate an existing data lake to a transactional data lake using Apache Iceberg

SAP enhances Datasphere and SAC for AI-driven transformation

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

How Cloudera Data Flow Enables Successful Data Mesh Architectures

The importance of data ingestion and integration for enterprise AI

AVB accelerates search in LINQ with Amazon OpenSearch Service

From Data Silos to Data Fabric with Knowledge Graphs

Top 15 data management platforms

Top 10 Data Lineage Podcasts, Blogs, and Magazines

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Addressing the Three Scalability Challenges in Modern Data Platforms

Why Enterprise Data Lineage is Critical for the Success of Your Modern Data Stack

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

Improving Multi-tenancy with Virtual Private Clusters

Combining the Flexibility of Knowledge Graphs with the Power of Semantic Tagging: The Enterprise PowerPack

Sovereign Clouds: Partner Perspectives on Safeguarding Critical Customer Data

Top 15 data management platforms available today

Implement disaster recovery with Amazon Redshift

Themes and Conferences per Pacoid, Episode 8

Create an end-to-end data strategy for Customer 360 on AWS

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

Data Management Requirements for the Enterprise Data Lake

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

The Modern Data Stack Explained: What The Future Holds

The Gartner 2022 Leadership Vision for Data and Analytics Leaders Questions and Answers

Fivetran Modern Data Stack Conference 2023: Key Takeaways

What is Data Mapping?

What Is Embedded Analytics?

Apache HBase online migration to Amazon EMR

BARC Perspective: SAP BDC – Breaking Tradition and Embracing Data Products

Hybrid big data analytics with Amazon EMR on AWS Outposts

Stay Connected