Consulting, Data Lake and Metadata

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. Delete the bucket.

Data Lake

Data Lake Data Processing Optimization Machine Learning

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Over the years, organizations have invested in creating purpose-built, cloud-based data lakes that are siloed from one another. A major challenge is enabling cross-organization discovery and access to data across these multiple data lakes, each built on different technology stacks.

Data Lake

Data Lake Publishing Metadata Data-driven

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

We also examine how centralized, hybrid and decentralized data architectures support scalable, trustworthy ecosystems. As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Refer to Catalogs for more information.

Data Lake

Data Lake Metadata Business Analysis Data-driven

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

HBL started their data journey in 2019 when data lake initiative was started to consolidate complex data sources and enable the bank to use single version of truth for decision making. CDP Private Cloud’s new approach to data management and analytics would allow HBL to access powerful self-service analytics.

Management

Management Data Lake Consulting Unstructured Data

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Data Lake Testing Consulting

6 BI challenges IT teams must address

CIO Business Intelligence

DECEMBER 21, 2022

By 2025, it’s estimated we’ll have 463 million terabytes of data created every day,” says Lisa Thee, data for good sector lead at Launch Consulting Group in Seattle. But what they really need to do is fundamentally rethink how data is managed and accessed,” he says. And key to this is the metadata management.”

IT

IT Business Intelligence Sales Key Performance Indicator

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.

Data Lake

Data Lake Data Warehouse Data-driven B2B

The Madness of Data (and analytics) Governance

Andrew White

DECEMBER 9, 2019

The client had recently engaged with a well-known consulting company that had recommended a large data catalog effort to collect all enterprise metadata to help identify all data and business issues. Modern data (and analytics) governance does not necessarily need: Wall-to-wall discovery of your data and metadata.

Analytics

Analytics Data Lake Data Governance Data Warehouse

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Cloudera

JANUARY 15, 2021

Cloudera Data Warehouse is a highly scalable service that marries the SQL engine technologies of Apache Impala and Apache Hive with cloud-native features to deliver best-in-class price-performance for users running data warehousing workloads in the cloud. The benchmark run by McKnight Consulting Group used the Impala engine.

Data Warehouse

Data Warehouse Cost-Benefit Consulting Interactive

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

AWS Big Data

DECEMBER 5, 2023

HR&A Advisors —a multi-disciplinary consultancy with extensive work in the broadband and digital equity space is helping its state, county, and municipal clients deliver affordable internet access by analyzing locally specific digital inclusion needs and building tailored digital equity plans.

Measurement

Measurement Dashboards Data Warehouse Analytics

What Is Data Curation?

Alation

FEBRUARY 13, 2020

Data curation is important in today’s world of data sharing and self-service analytics, but I think it is a frequently misused term. When speaking and consulting, I often hear people refer to data in their data lakes and data warehouses as curated data, believing that it is curated because it is stored as shareable data.

Metadata

Metadata Data Warehouse Data Lake Data Governance

Introducing watsonx: The future of AI for business

IBM Big Data Hub

MAY 9, 2023

Optimized for all data, analytics and AI workloads, watsonx.data combines the flexibility of a data lake with the performance of a data warehouse, helping businesses to scale data analytics and AI anywhere their data resides. Put AI to work in your business with IBM today IBM is infusing watsonx.ai

Data Warehouse

Data Warehouse Machine Learning Cost-Benefit Metadata

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

AWS Big Data

JULY 28, 2023

Tagging Consider tagging your Amazon Redshift resources to quickly identify which clusters and snapshots contain the PII data, the owners, the data retention policy, and so on. Tags provide metadata about resources at a glance. Refer to AWS Lake Formation-managed Redshift shares for more details on the implementation.

Snapshot

Snapshot Metadata Measurement Data Warehouse

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Streaming jobs constantly ingest new data to synchronize across systems and can perform enrichment, transformations, joins, and aggregations across windows of time more efficiently. With a file system sink connector, Apache Flink jobs can deliver data to Amazon S3 in open format (such as JSON, Avro, Parquet, and more) files as data objects.

Data Lake

Data Lake Unstructured Data Management Snapshot

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

Alation

AUGUST 11, 2022

Today, the brightest minds in our industry are targeting the massive proliferation of data volumes and the accompanying but hard-to-find value locked within all that data. We chatted about industry trends, why decentralization has become a hot topic in the data world, and how metadata drives many data-centric use cases.

Metadata

Metadata Data Warehouse Data Quality Data Lake

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

bridgei2i

MARCH 3, 2021

In this episode of the AI to Impact Podcast, host Pavan Kumar speaks to Prinkan Pal about the evolution of data engineering and ML-operations from a closed team into a tech consulting unit. My team primarily comprises data engineers, ML-engineers, full-stack developers and solution architects primarily focusing towards cloud.

Enterprise

Enterprise Digital Transformation Data-driven Interactive

Automate legacy ETL conversion to AWS Glue using Cognizant Data and Intelligence Toolkit (CDIT) – ETL Conversion Tool

AWS Big Data

OCTOBER 4, 2023

Cognizant Data & Intelligence Toolkit (CDIT) – ETL Conversion Tool automates this process, bringing in more predictability and accuracy, eliminating the risk associated with manual conversion, and providing faster time to market for customers. Cognizant is an AWS Premier Tier Services Partner with several AWS Competencies.

Data Warehouse

Data Warehouse Cost-Benefit Metadata Data Lake

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

Download the SAML metadata file. In the navigation pane under Clients , import the SAML metadata file. Download the Keycloak IdP SAML metadata file from that URL location. For Metadata document , upload the Keycloak IdP SAML metadata XML file you downloaded and saved to your local machine earlier. Choose Browse.

Metadata

Metadata Dashboards Business Intelligence Data Lake

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

This involves unifying and sharing a single copy of data and metadata across IBM® watsonx.data ™, IBM® Db2 ®, IBM® Db2® Warehouse and IBM® Netezza ®, using native integrations and supporting open formats, all without the need for migration or recataloging.

Cost-Benefit

Cost-Benefit Metadata Optimization Management

Data Profiling: What It Is and How to Perfect It

Alation

APRIL 18, 2023

Gartner defines data profiling as: A technology for discovering and investigating data quality issues, such as duplication, lack of consistency, and lack of accuracy and completeness. The tools provide data statistics, such as degree of duplication and ratios of attribute values, both in tabular and graphical formats.

IT

IT Data Quality Metadata Data Governance

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

In the era of data, organizations are increasingly using data lakes to store and analyze vast amounts of structured and unstructured data. Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making.

Optimization

Optimization Data Lake Cost-Benefit Reporting

What Is Alation Connected Sheets? Q&A with the Creators

Alation

NOVEMBER 28, 2022

But refreshing this analysis with the latest data was impossible… unless you were proficient in SQL or Python. We wanted to make it easy for anyone to pull data and self service without the technical know-how of the underlying database or data lake. Sathish and I met in 2004 when we were working for Oracle.

Metadata

Metadata Enterprise Cost-Benefit Finance

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

To develop your disaster recovery plan, you should complete the following tasks: Define your recovery objectives for downtime and data loss (RTO and RPO) for data and metadata. For additional information on how to configure data sharing, refer to Sharing datashares.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Top Graph Use Cases and Enterprise Applications (with Real World Examples)

Ontotext

MARCH 8, 2023

Rich metadata and semantic modeling continue to drive the matching of 50K training materials to specific curricula, leading new, data-driven, audience-based marketing efforts that demonstrate how the recommender service is achieving increased engagement and performance from over 2.3 million users.

Enterprise

Enterprise Knowledge Discovery Risk Machine Learning

In-depth with CDO Christopher Bannocks

Peter James Thomas

AUGUST 29, 2018

I have since run and driven transformation in Reference Data, Master Data , KYC [3] , Customer Data, Data Warehousing and more recently Data Lakes and Analytics , constantly building experience and capability in the Data Governance , Quality and data services domains, both inside banks, as a consultant and as a vendor.

Data-driven

Data-driven Cost-Benefit Metadata Technology

Data Governance for Dummies: Your Questions, Answered

Alation

FEBRUARY 17, 2023

This is because you will know what data you can trust, and you will have processes to create and upkeep data, as well as curated metadata to exploit data’s full capabilities. Can you differentiate between governance of raw data and enhanced data (information)? Where do you govern? Here’s an example.

Data Governance

Data Governance Data Quality Metadata Cost-Benefit

Themes and Conferences per Pacoid, Episode 12

Domino Data Lab

AUGUST 8, 2019

I mention this here because there was a lot of overlap between current industry data governance needs and what the scientific community is working toward for scholarly infrastructure. The gist is, leveraging metadata about research datasets, projects, publications, etc., Nothing Spreads Like Fear”.

Data Science

Data Science Machine Learning Data Governance Statistics

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

In this example, the analytics tool accesses the data lake on Amazon Simple Storage Service (Amazon S3) through Athena queries. As the data mesh pattern expands across domains covering more downstream services, we need a mechanism to keep IdPs and IAM role trusts continuously updated.

Data Governance

Data Governance Management Data-driven Analytics

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

As with any good consulting response, “it depends.” Do you recommend a consulting approach strategy rather than a CDO strategy? Does Data warehouse as a software tool will play role in future of Data & Analytics strategy? Data lakes don’t offer this nor should they. It really does.

Data Analytics

Data Analytics Analytics Data-driven Finance

Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation

AWS Big Data

OCTOBER 25, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Statistics

Statistics Data Warehouse Metadata Data Lake

Governing streaming data in Amazon DataZone with the Data Solutions Framework on AWS

AWS Big Data

FEBRUARY 27, 2025

Data governance involves establishing clear ownership and accountability for data, including defining roles, responsibilities, and decision-making authority related to data management. A mechanism to handle the custom authorization flow when a consumer triggers the subscription grant to an environment.

Metadata

Metadata Data Governance Testing Data Lake

Is Your Data Catalog Ready for the AI Age?

BI-Survey

FEBRUARY 27, 2025

However, a closer look reveals that these systems are far more than simple repositories: Data catalogs are at the forefront of bringing AI into your business for at least two reasons. However, lineage information and comprehensive metadata are also crucial to document and assess AI models holistically in the domain of AI governance.

Unstructured Data

Unstructured Data Metadata Data Quality Data Governance

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Choosing an open table format for your transactional data lake on AWS

Webinars

Trending Sources

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Webinars

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Data’s dark secret: Why poor quality cripples AI and growth

Build a data lake with Apache Flink on Amazon EMR

What is a data architect? Skills, salaries, and how to become a data framework master

Habib Bank manages data at scale with Cloudera Data Platform

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

6 BI challenges IT teams must address

How smava makes loans transparent and affordable using Amazon Redshift Serverless

The Madness of Data (and analytics) Governance

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

What Is Data Curation?

Introducing watsonx: The future of AI for business

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

Exploring real-time streaming for generative AI Applications

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

Automate legacy ETL conversion to AWS Glue using Cognizant Data and Intelligence Toolkit (CDIT) – ETL Conversion Tool

Federate Amazon QuickSight access with open-source identity provider Keycloak

Tackling AI’s data challenges with IBM databases on AWS

Data Profiling: What It Is and How to Perfect It

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

What Is Alation Connected Sheets? Q&A with the Creators

Implement disaster recovery with Amazon Redshift

Top Graph Use Cases and Enterprise Applications (with Real World Examples)

In-depth with CDO Christopher Bannocks

Data Governance for Dummies: Your Questions, Answered

Themes and Conferences per Pacoid, Episode 12

How Novo Nordisk built distributed data governance and control at scale

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation

Governing streaming data in Amazon DataZone with the Data Solutions Framework on AWS

Is Your Data Catalog Ready for the AI Age?

Stay Connected