Data Processing, Data Warehouse and Metadata

Data Processing

Data Warehouse

Metadata

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

An extract, transform, and load (ETL) process using AWS Glue is triggered once a day to extract the required data and transform it into the required format and quality, following the data product principle of data mesh architectures. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog.

IoT

IoT Machine Learning Metadata Data-driven

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

These nodes can implement analytical platforms like data lake houses, data warehouses, or data marts, all united by producing data products. The Institutional Data & AI platform adopts a federated approach to data while centralizing the metadata to facilitate simpler discovery and sharing of data products.

Metadata

Metadata Data Governance Data Quality Data-driven

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. With its massively parallel processing (MPP) architecture and columnar data storage, Amazon Redshift delivers high price-performance for complex analytical queries against large datasets.

Sales

Sales Metadata Enterprise Testing

Webinars

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. An entity can act both as a producer of data assets and as a consumer of data assets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

CIOs are (still) closer than ever to their dream data lakehouse

CIO Business Intelligence

OCTOBER 15, 2024

“The data catalog is critical because it’s where business manages its metadata,” said Venkat Rajaji, Senior Vice President of Product Management at Cloudera. There’s been a ton of innovation lately around the Iceberg REST catalog because the data turf war is over. But the metadata turf war is just getting started.”

Metadata

Metadata Data Processing Uncertainty Data Warehouse

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. The producer also needs to manage and publish the data asset so it’s discoverable throughout the organization.

Metadata

Metadata Data Lake Data Processing Data-driven

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Document the entire disaster recovery process.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

While cloud-native, point-solution data warehouse services may serve your immediate business needs, there are dangers to the corporation as a whole when you do your own IT this way. Cloudera Data Warehouse (CDW) is here to save the day! CDW is an integrated data warehouse service within Cloudera Data Platform (CDP).

Data Warehouse

Data Warehouse Data Lake IT Analytics

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all. Marketing-focused or not, DMPs excel at negotiating with a wide array of databases, data lakes, or data warehouses, ingesting their streams of data and then cleaning, sorting, and unifying the information therein.

Management

Management Advertising Data Lake Sales

Extreme data center pressure? Burst to the cloud with CDP!

Cloudera

NOVEMBER 12, 2020

Your sunk costs are minimal and if a workload or project you are supporting becomes irrelevant, you can quickly spin down your cloud data warehouses and not be “stuck” with unused infrastructure. Cloud deployments for suitable workloads gives you the agility to keep pace with rapidly changing business and data needs.

Data Warehouse

Data Warehouse Reporting Risk Cost-Benefit

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To speed up the self-service analytics and foster innovation based on data, a solution was needed to provide ways to allow any team to create data products on their own in a decentralized manner. To create and manage the data products, smava uses Amazon Redshift , a cloud data warehouse.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Simplifying Migration to Amazon Redshift

Octopai

NOVEMBER 24, 2021

As the first of its reasons why to migrate to Redshift , Amazon says, “Amazon Redshift is fully managed and simple to use, enabling you to deploy a new data warehouse in minutes and load virtually any type of data from a range of cloud or on-premises data sources.”. Setting up the data warehouse can take minutes.

Data Warehouse

Data Warehouse Metadata Data Processing Reporting

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

They enable transactions on top of data lakes and can simplify data storage, management, ingestion, and processing. These transactional data lakes combine features from both the data lake and the data warehouse. Data can be organized into three different zones, as shown in the following figure.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

As the queries finish running, an UNLOAD operation is invoked from the Redshift data warehouse to the S3 bucket in Account A. Cross-account access has been set up between S3 buckets in Account A with resources in Account B to be able to load and unload data. If the alternate backend contains the needed value, it is returned.

Metadata

Metadata Data Processing Management Testing

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

And knowing the business purpose translates into actively governing personal data against potential privacy and security violations. Do You Know Where Your Sensitive Data Is? Data is a valuable asset used to operate, manage and grow a business.

Data Governance

Data Governance Cost-Benefit Metadata Risk

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

With quality data at their disposal, organizations can form data warehouses for the purposes of examining trends and establishing future-facing strategies. Industry-wide, the positive ROI on quality data is well understood. 2 – Data profiling. Data profiling is an essential process in the DQM lifecycle.

Data Quality

Data Quality Metrics Data-driven Management

Themes and Conferences per Pacoid, Episode 11

Domino Data Lab

JULY 2, 2019

In other words, using metadata about data science work to generate code. In this case, code gets generated for data preparation, where so much of the “time and labor” in data science work is concentrated. Less data gets decompressed, deserialized, loaded into memory, run through the processing, etc.

Metadata

Metadata Data Science Machine Learning Data-driven

Why Enterprise Data Lineage is Critical for the Success of Your Modern Data Stack

Octopai

NOVEMBER 13, 2022

Data lineage is the ability to view the path of data as it flows from source to target within your data ecosystem, along with everything that happened to it along the way. And data lineage solutions will also show you any transformations the data underwent on its journey.

Enterprise

Enterprise Data Warehouse Reporting Metadata

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Profile aggregation – When you’ve uniquely identified a customer, you can build applications in Managed Service for Apache Flink to consolidate all their metadata, from name to interaction history. Then, you transform this data into a concise format. Let’s find out what role each of these components play in the context of C360.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In legacy analytical systems such as enterprise data warehouses, the scalability challenges of a system were primarily associated with computational scalability, i.e., the ability of a data platform to handle larger volumes of data in an agile and cost-efficient way. Introduction. public, private, hybrid cloud)?

Data Processing

Data Processing Data Warehouse Enterprise Visualization

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Data Firehose uses an AWS Lambda function to transform data and ingest the transformed records into an Amazon Simple Storage Service (Amazon S3) bucket. An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog.

Metrics

Metrics Visualization Dashboards Publishing

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

Apache Hive is a SQL-based data warehouse system for processing highly distributed datasets on the Apache Hadoop platform. The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table.

Data Lake

Data Lake Metadata Data Processing Big Data

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Cloudera

SEPTEMBER 29, 2022

This team or domain expert will be responsible for the data produced by the team. The data itself is then treated as a product. The data product is not just the data itself, but a bunch of metadata that surrounds it — the simple stuff like schema is a given. Data fabric defined. What is a data mesh contract?

Data Architecture

Data Architecture Data Warehouse Metadata Sales

Top 15 data management platforms available today

CIO Business Intelligence

SEPTEMBER 22, 2023

All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all. DMPs excel at negotiating with a wide array of databases, data lakes, or data warehouses, ingesting their streams of data and then cleaning, sorting, and unifying the information therein.

Management

Management Advertising Data Lake Sales

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Cloudera

JANUARY 19, 2024

Supported AI models and services The SQL AI Assistant is not bundled with a specific LLM; instead it supports various LLMs and hosting services. The model can run locally, be hosted on CML infra or in the infrastructure of a trusted service provider. Log in to the Cloudera Data Warehouse service as DWAdmin.

Data Warehouse

Data Warehouse Data Processing Optimization Modeling

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

AWS Big Data

OCTOBER 18, 2023

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Spark SQL is an Apache Spark module for structured data processing. host') export PASSWORD=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output text | jq -r '.password')

Big Data

Big Data Data Processing Interactive Testing

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Cloudera

MAY 18, 2021

CDP Public Cloud leverages the elastic nature of the cloud hosting model to align spend on Cloudera subscription (measured in Cloudera Consumption Units or CCUs) with actual usage of the platform. Experience configuration / use case deployment: At the data lifecycle experience level (e.g., Ongoing management.

Cost-Benefit

Cost-Benefit Data-driven Machine Learning Data Warehouse

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

AWS Big Data

MARCH 9, 2023

Thousands of customers rely on Amazon Redshift to build data warehouses to accelerate time to insights with fast, simple, and secure analytics at scale and analyze data from terabytes to petabytes by running complex analytical queries. Data loading is one of the key aspects of maintaining a data warehouse.

Slice and Dice

Slice and Dice Data Warehouse Metrics Metadata

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

2020 saw us hosting our first ever fully digital Data Impact Awards ceremony, and it certainly was one of the highlights of our year. We saw a record number of entries and incredible examples of how customers were using Cloudera’s platform and services to unlock the power of data. SECURITY AND GOVERNANCE LEADERSHIP.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

Data governance shows up as the fourth-most-popular kind of solution that enterprise teams were adopting or evaluating during 2019. That’s a lot of priorities – especially when you group together closely related items such as data lineage and metadata management which rank nearby. Increasingly, these were simply web servers.

Machine Learning

Machine Learning Data Governance Metadata Data Science

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

It is supported by querying, governance, and open data formats to access and share data across the hybrid cloud. Through workload optimization across multiple query engines and storage tiers, organizations can reduce data warehouse costs by up to 50 percent. Later this year, it will leverage watsonx.ai

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

The Modern Data Stack Explained: What The Future Holds

Alation

JANUARY 17, 2023

The modern data stack is a combination of various software tools used to collect, process, and store data on a well-integrated cloud-based data platform. It is known to have benefits in handling data due to its robustness, speed, and scalability. A typical modern data stack consists of the following: A data warehouse.

Data Warehouse

Data Warehouse Cost-Benefit Data Science Data Transformation

Dancing with Elephants in 5 Easy Steps

Cloudera

AUGUST 21, 2020

There are now tens of thousands of instances of these Big Data platforms running in production around the world today, and the number is increasing every year. Many of them are increasingly deployed outside of traditional data centers in hosted, “cloud” environments. But the “elephant in the room” is NOT ‘Hadoop’.

Big Data

Big Data Cost-Benefit ROI Risk

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

The Delta tables created by the EMR Serverless application are exposed through the AWS Glue Data Catalog and can be queried through Amazon Athena. Athena supports reading native Delta tables and therefore we can read the data successfully even though the Data Catalog shows only a single array column.

Data Lake

Data Lake Dashboards Metrics Metadata

The Future of Cloud-based Analytics (Part 3)

Cloudera

NOVEMBER 13, 2017

The net result is much improved productivity for data engineers, data scientists, and analysts. Unified – Conceptually, cloud sounds like a single place to host diverse, data-intensive functions. The ability to discover and define metadata definitions for the business is a critical enabler for self-service functions.

Analytics

Analytics Big Data Machine Learning Cost-Benefit

Top Takeaways from the Gartner® Innovation Insight: Data Security Posture Management

Laminar Security

MAY 3, 2023

When evaluating DSPM solutions , look for one that not only extends to all major cloud service providers, but also reads from various databases, data pipelines, object storage, disk storage, managed file storage, data warehouses, lakes, and analytics pipelines, both managed and self-hosted.

Management

Management Risk Risk Management Data Processing

Data Mesh Architecture and the Data Catalog

Alation

FEBRUARY 8, 2022

Some argue that data governance and quality practices may vary between domains. Duplication of data, too, may become a problem, as siloed patterns emerge unique to the domains that host them. Principle #2: Data as a product. A data catalog is essential for several of these capabilities, according to Thoughtworks.

Data Governance

Data Governance Data-driven Metadata Enterprise

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

On January 4th I had the pleasure of hosting a webinar. It was titled, The Gartner 2021 Leadership Vision for Data & Analytics Leaders. This was for the Chief Data Officer, or head of data and analytics. See recorded webinars: Emerging Practices for a Data-driven Strategy. Link Data to Business Outcomes.

Data Analytics

Data Analytics Analytics Data-driven Finance

Fivetran Modern Data Stack Conference 2023: Key Takeaways

Alation

APRIL 14, 2023

In “The modern data stack is dead, long live the modern data stack!” the presenters elaborated on the common pain points of the cloud data warehouse today and predicted what it may look like in the future. Cloud costs are growing prohibitive.

Data Warehouse

Data Warehouse Data-driven Metadata Digital Transformation

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

An on-premise solution provides a high level of control and customization as it is hosted and managed within the organization’s physical infrastructure, but it can be expensive to set up and maintain. This includes cleaning, aggregating, enriching, and restructuring data to fit the desired format.

Data Warehouse

Data Warehouse Reporting Data Transformation Visualization

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

These sit on top of data warehouses that are strictly governed by IT departments. The role of traditional BI platforms is to collect data from various business systems. Metadata Self-service analysis is made easy with user-friendly naming conventions for tables and columns. addresses).

Analytics

Analytics Cost-Benefit Visualization Dashboards

Build a data lakehouse in a hybrid Environment using Amazon EMR Serverless, Apache DolphinScheduler, and TiDB

AWS Big Data

MARCH 20, 2025

Solution overview For our use case, an enterprise data warehouse with business data is hosted on an on-premises TiDB platform, an AWS Global Partner that is also available on AWS through AWS Marketplace. Typically, there are four layers in terms of data warehouse design.

Data Warehouse

Data Warehouse Metadata Testing Management

How EUROGATE established a data mesh architecture using Amazon DataZone

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Webinars

Trending Sources

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Webinars

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

CIOs are (still) closer than ever to their dream data lakehouse

Governing data in relational databases using Amazon DataZone

Implement disaster recovery with Amazon Redshift

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Top 15 data management platforms

Extreme data center pressure? Burst to the cloud with CDP!

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Simplifying Migration to Amazon Redshift

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

How Data Governance Protects Sensitive Data

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Themes and Conferences per Pacoid, Episode 11

Why Enterprise Data Lineage is Critical for the Success of Your Modern Data Stack

Create an end-to-end data strategy for Customer 360 on AWS

Addressing the Three Scalability Challenges in Modern Data Platforms

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Query your Apache Hive metastore with AWS Lake Formation permissions

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Top 15 data management platforms available today

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

The value of CDP Public Cloud over legacy Hadoop-on-IaaS implementations

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

Announcing the 2021 Data Impact Awards

Themes and Conferences per Pacoid, Episode 8

Exploring the AI and data capabilities of watsonx

The Modern Data Stack Explained: What The Future Holds

Dancing with Elephants in 5 Easy Steps

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

The Future of Cloud-based Analytics (Part 3)

Top Takeaways from the Gartner® Innovation Insight: Data Security Posture Management

Data Mesh Architecture and the Data Catalog

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Fivetran Modern Data Stack Conference 2023: Key Takeaways

What is Data Mapping?

What Is Embedded Analytics?

Build a data lakehouse in a hybrid Environment using Amazon EMR Serverless, Apache DolphinScheduler, and TiDB

Stay Connected