Data Transformation, Machine Learning and Metadata

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Amazon EMR provides a big data environment for data processing, interactive analysis, and machine learning using open source frameworks such as Apache Spark, Apache Hive, and Presto. These data processing and analytical services support Structured Query Language (SQL) to interact with the data.

Metadata

Metadata Data Lake Modeling Data Warehouse

SAP Datasphere Powers Business at the Speed of Data

Rocket-Powered Data Science

MARCH 20, 2023

We live in a data-rich, insights-rich, and content-rich world. Data collections are the ones and zeroes that encode the actionable insights (patterns, trends, relationships) that we seek to extract from our data through machine learning and data science. Source: [link] I will finish with three quotes.

Data Warehouse

Data Warehouse Metadata Digital Transformation Machine Learning

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

The following requirements were essential to decide for adopting a modern data mesh architecture: Domain-oriented ownership and data-as-a-product : EUROGATE aims to: Enable scalable and straightforward data sharing across organizational boundaries. Eliminate centralized bottlenecks and complex data pipelines.

IoT

IoT Machine Learning Metadata Data-driven

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

We also examine how centralized, hybrid and decentralized data architectures support scalable, trustworthy ecosystems. As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Institutional Data & AI Platform architecture The Institutional Division has implemented a self-service data platform to enable the domain teams to build and manage data products autonomously. The following diagram illustrates the building blocks of the Institutional Data & AI Platform.

Metadata

Metadata Data Governance Data Quality Data-driven

How to Build a Successful Metadata Management Framework

Alation

JUNE 28, 2022

This is where metadata, or the data about data, comes into play. Having a data catalog is the cornerstone of your data governance strategy, but what supports your data catalog? Your metadata management framework provides the underlying structure that makes your data accessible and manageable.

Metadata

Metadata Management Data Governance Machine Learning

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Wayne Yaddow

MARCH 5, 2025

The goal is to examine five major methods of verifying and validating data transformations in data pipelines with an eye toward high-quality data deployment. First, we look at how unit and integration tests uncover transformation errors at an early stage. Applicability by Transformation Type 2.

Testing

Testing Data Transformation Statistics Metadata

Ensuring Data Transformation Quality with dbt Core

Wayne Yaddow

MARCH 14, 2025

How dbt Core aids data teams test, validate, and monitor complex data transformations and conversions Photo by NASA on Unsplash Introduction dbt Core, an open-source framework for developing, testing, and documenting SQL-based data transformations, has become a must-have tool for modern data teams as the complexity of data pipelines grows.

Data Transformation

Data Transformation Testing Unstructured Data Data Quality

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Quality

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

It seamlessly consolidates data from various data sources within AWS, including AWS Cost Explorer (and forecasting with Cost Explorer ), AWS Trusted Advisor , and AWS Compute Optimizer. They can use their own toolsets or rely on provided blueprints to ingest the data from source systems.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Tableau further democratizes analytics with AI-fueled features

CIO Business Intelligence

APRIL 30, 2024

Einstein Copilot for Tableau remains in beta, but Tableau announced two new features for the AI assistant as well: AI-assisted data transformation. This feature can automate a data transformation pipeline with step-by-step suggestions for preparing data for analysis.

Analytics

Analytics Metrics Visualization Dashboards

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

Third, some services require you to set up and manage compute resources used for federated connectivity, and capabilities like connection testing and data preview arent available in all services. To solve for these challenges, we launched Amazon SageMaker Lakehouse unified data connectivity.

Visualization

Visualization Data Processing Testing Publishing

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Cloudera

OCTOBER 7, 2022

We’re excited to announce the general availability of the open source adapters for dbt for all the engines in CDP — Apache Hive , Apache Impala , and Apache Spark, with added support for Apache Livy and Cloudera Data Engineering. The Open Data Lakehouse . Cloudera builds dbt adaptors for all engines in the open data lakehouse.

Data Warehouse

Data Warehouse Data Transformation Machine Learning Data Lake

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

Today’s general availability announcement covers Iceberg running within key data services in the Cloudera Data Platform (CDP) — including Cloudera Data Warehousing ( CDW ), Cloudera Data Engineering ( CDE ), and Cloudera Machine Learning ( CML ). Why integrate Apache Iceberg with Cloudera Data Platform?

Data Lake

Data Lake Data Warehouse Data Architecture Metadata

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

The platform converges data cataloging, data ingestion, data profiling, data tagging, data discovery, and data exploration into a unified platform, driven by metadata. Modak Nabu automates repetitive tasks in the data preparation process and thus accelerates the data preparation by 4x.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

Automate discovery of data relationships using ML and Amazon Neptune graph technology

AWS Big Data

APRIL 19, 2023

However, when a data producer shares data products on a data mesh self-serve web portal, it’s neither intuitive nor easy for a data consumer to know which data products they can join to create new insights. This is especially true in a large enterprise with thousands of data products.

Technology

Technology Data-driven Machine Learning Sales

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. An AI governance framework ensures the ethical, responsible and transparent use of AI and machine learning (ML). Capture and document model metadata for report generation.

Risk

Risk Modeling Management Metadata

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

AWS Big Data

APRIL 12, 2023

Also, you can run other types of business applications, such as web applications and machine learning (ML) TensorFlow workloads, on the same EKS cluster. We have been continually improving the Spark performance in each Amazon EMR release to further shorten job runtime and optimize users’ spending on their Amazon EMR big data workloads.

Testing

Testing Big Data Metadata Optimization

NEW: Octopai Announces Support of Microsoft Azure Data Factory

Octopai

JANUARY 19, 2021

This is done by visualizing the Azure Data Factory pipelines’ full column-level with source-to-target traceability through different data transformations at the most detailed level. Octopai can fully map the BI landscape and trace metadata movement in a mixed environment including complex multi-vendor landscapes.

Metadata

Metadata ROI Machine Learning Data Quality

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

One key component that plays a central role in modern data architectures is the data lake, which allows organizations to store and analyze large amounts of data in a cost-effective manner and run advanced analytics and machine learning (ML) at scale. This ensures that the data is suitable for training purposes.

Data Lake

Data Lake Analytics Snapshot Data Quality

How healthcare organizations can analyze and create insights using price transparency data

AWS Big Data

OCTOBER 11, 2023

The data in the machine-readable files can provide valuable insights to understand the true cost of healthcare services and compare prices and quality across hospitals. The availability of machine-readable files opens up new possibilities for data analytics, allowing organizations to analyze large amounts of pricing data.

Visualization

Visualization Dashboards Data-driven Gap analysis

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

is our enterprise-ready next-generation studio for AI builders, bringing together traditional machine learning (ML) and new generative AI capabilities powered by foundation models. Automated development: Automates data preparation, model development, feature engineering and hyperparameter optimization using AutoAI.

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

Enhance your analytics embedding experience with the new Amazon QuickSight JavaScript SDK

AWS Big Data

MARCH 9, 2023

break; } } } const frameOptions = { url: ' ', container: document.getElementById("dashboardContainer"), width: "100%", height: "AutoFit", loadingHeight: "200px", withIframePlaceholder: true, onChange: (changeEvent, metadata) => { switch (changeEvent.eventName) { case 'ERROR': { document.getElementById("dashboardContainer").append('Unable

Slice and Dice

Slice and Dice Dashboards Analytics Interactive

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

Alation

AUGUST 11, 2022

We chatted about industry trends, why decentralization has become a hot topic in the data world, and how metadata drives many data-centric use cases. But, through it all, Mohan says it’s critical to view everything through the same lens: gaining business value from data. Data fabric is a technology architecture.

Metadata

Metadata Data Warehouse Data Quality Data Lake

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. For InitialRunFlag , choose Setup.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In addition, data pipelines include more and more stages, thus making it difficult for data engineers to compile, manage, and troubleshoot those analytical workloads. As a result, alternative data integration technologies (e.g.,

Data Processing

Data Processing Data Warehouse Enterprise Visualization

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

AWS Big Data

MARCH 15, 2023

The API retrieves data at runtime from an Amazon Aurora PostgreSQL-Compatible Edition database for end-user consumption. To populate the database, the Infomedia team developed a data pipeline using Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue for data transformations, and Apache Hudi for CDC and record-level updates.

Cost-Benefit

Cost-Benefit Data Processing Optimization Data-driven

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Data Lake Optimization

Sure, Trust Your Data… Until It Breaks Everything: How Automated Data Lineage Saves the Day

Octopai

JUNE 9, 2024

As data inconsistencies grew, so did skepticism about the accuracy of the data. Decision-makers hesitated to rely on data-driven insights, fearing the consequences of potential errors. Automated data lineage is essential in addressing these challenges, and Octopai’s solution makes it both achievable and manageable.

IT

IT Data-driven Predictive Analytics Data Strategy

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

This involves unifying and sharing a single copy of data and metadata across IBM® watsonx.data ™, IBM® Db2 ®, IBM® Db2® Warehouse and IBM® Netezza ®, using native integrations and supporting open formats, all without the need for migration or recataloging.

Cost-Benefit

Cost-Benefit Metadata Optimization Management

Data platform trinity: Competitive or complementary?

IBM Big Data Hub

JANUARY 18, 2023

They defined it as : “ A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data. ”.

Data Lake

Data Lake Data Warehouse Data-driven Metadata

Why The Public Sector Needs Data Governance

Alation

NOVEMBER 22, 2022

Before you implement a data governance framework, you need to know the data you already have. This means you need to: Inventory data: Know all information resources and relevant metadata. Classify data: Organize structured and unstructured data into relevant categories. Reuse metadata productively.

Data Governance

Data Governance Metadata Data-driven Unstructured Data

Improve observability across Amazon MWAA tasks

AWS Big Data

FEBRUARY 6, 2023

The most common use case for Airflow is ETL (extract, transform, and load). Operationalizing machine learning (ML) is another growing use case, where data has to be transformed and normalized before it can be loaded into an ML model. format(S3_BUCKET_NAME), 's3://{}/data/aggregated/green'.format(S3_BUCKET_NAME),

Management

Management Interactive Publishing Metadata

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

AWS Big Data

OCTOBER 18, 2023

FINRA centralizes all its data in Amazon Simple Storage Service (Amazon S3) with a remote Hive metastore on Amazon Relational Database Service (Amazon RDS) to manage their metadata information. Melody Yang is a Senior Big Data Solutions Architect for Amazon EMR at AWS. or later installed.

Big Data

Big Data Data Processing Interactive Testing

Cross-account integration between SaaS platforms using Amazon AppFlow

AWS Big Data

APRIL 25, 2023

The following AWS services are used for data ingestion, processing, and load: Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications like Salesforce, SAP, Marketo, Slack, and ServiceNow, and AWS services like Amazon S3 and Amazon Redshift , in just a few clicks.

Sales

Sales Visualization Software Metadata

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

This field guide to data mapping will explore how data mapping connects volumes of data for enhanced decision-making. Why Data Mapping is Important Data mapping is a critical element of any data management initiative, such as data integration, data migration, data transformation, data warehousing, or automation.

Data Warehouse

Data Warehouse Reporting Data Transformation Visualization

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

Requirement Multi-Source Data Blending Data from multiple sources is compiled and the output is a single view, metric, or visualization. Data Transformation and Enrichment Data can be enriched for analysis. Metadata Self-service analysis is made easy with user-friendly naming conventions for tables and columns.

Analytics

Analytics Cost-Benefit Visualization Dashboards

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

AWS Big Data

NOVEMBER 6, 2024

Iceberg brings the reliability and simplicity of SQL tables to Amazon Simple Storage Service (Amazon S3) data lakes. To learn more about how to process Firehose records using Lambda, see Transform source data in Amazon Data Firehose. b64decode(record['data']).decode('utf-8') b64decode(record['data']).decode('utf-8')

Metadata

Metadata Data Lake Management Internet of Things

Hybrid big data analytics with Amazon EMR on AWS Outposts

AWS Big Data

JANUARY 29, 2025

Amazon EMR has long been the leading solution for processing big data in the cloud. Amazon EMR is the industry-leading big data solution for petabyte-scale data processing, interactive analytics, and machine learning using over 20 open source frameworks such as Apache Hadoop , Hive, and Apache Spark.

Big Data

Big Data Data Analytics Analytics Interactive

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

SAP Datasphere Powers Business at the Speed of Data

Webinars

Trending Sources

How EUROGATE established a data mesh architecture using Amazon DataZone

Webinars

Data’s dark secret: Why poor quality cripples AI and growth

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

How to Build a Successful Metadata Management Framework

From Raw Inputs to Polished Outputs: The Art of Testing Data Transformations

Ensuring Data Transformation Quality with dbt Core

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Tableau further democratizes analytics with AI-fueled features

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Automate discovery of data relationships using ML and Amazon Neptune graph technology

How to use foundation models and trusted governance to manage AI workflow risk

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

NEW: Octopai Announces Support of Microsoft Azure Data Factory

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

How healthcare organizations can analyze and create insights using price transparency data

Exploring the AI and data capabilities of watsonx

Enhance your analytics embedding experience with the new Amazon QuickSight JavaScript SDK

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Addressing the Three Scalability Challenges in Modern Data Platforms

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Sure, Trust Your Data… Until It Breaks Everything: How Automated Data Lineage Saves the Day

Tackling AI’s data challenges with IBM databases on AWS

Data platform trinity: Competitive or complementary?

Why The Public Sector Needs Data Governance

Improve observability across Amazon MWAA tasks

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

Cross-account integration between SaaS platforms using Amazon AppFlow

What is Data Mapping?

What Is Embedded Analytics?

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

Hybrid big data analytics with Amazon EMR on AWS Outposts

Stay Connected