Data Lake, Data Quality and Presentation

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Traditional data management—wherein each business unit ingests raw data in separate data lakes or warehouses—hinders visibility and cross-functional analysis. A data mesh framework empowers business units with data ownership and facilitates seamless sharing.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Data governance is the process of ensuring the integrity, availability, usability, and security of an organization’s data. Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake.

Data Quality

Data Quality Data Governance Data Lake Testing

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

Poor-quality data can lead to incorrect insights, bad decisions, and lost opportunities. AWS Glue Data Quality measures and monitors the quality of your dataset. It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines.

Data Quality

Data Quality Data Lake Visualization Data-driven

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Domain ownership recognizes that the teams generating the data have the deepest understanding of it and are therefore best suited to manage, govern, and share it effectively. This principle makes sure data accountability remains close to the source, fostering higher data quality and relevance.

Metadata

Metadata Data Governance Data Quality Data-driven

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. Both operations target the same partition based on customer_id , leading to potential conflicts because theyre modifying an overlapping dataset.

Snapshot

Snapshot Management Metadata Big Data

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

These formats, exemplified by Apache Iceberg, Apache Hudi, and Delta Lake, addresses persistent challenges in traditional data lake structures by offering an advanced combination of flexibility, performance, and governance capabilities. Delta Lake highlights AWS Glue 5.0 supports Delta Lake 3.2.1.

Snapshot

Snapshot Metadata Data Lake Optimization

Accomplish Agile Business Intelligence & Analytics For Your Business

datapine

APRIL 15, 2020

To make sure your BI and agile data analytics methodologies are successfully implemented and will deliver actual business value, here we present some extra tips that will ensure you stay on track and don’t forget any important point in the process, starting with the stakeholders. Active stakeholder engagement.

Business Intelligence

Business Intelligence Analytics Testing Dashboards

Differentiate generative AI applications with your data using AWS analytics and managed databases

AWS Big Data

SEPTEMBER 12, 2024

On the importance of company data for generative AI, McKinsey stated that “If your data isn’t ready for generative AI, your business isn’t ready for generative AI.” In this post, we present a framework to implement generative AI applications enriched and differentiated with your data.

Management

Management Analytics Data Lake Interactive

Build a semantic search engine for tabular columns with Transformers and Amazon OpenSearch Service

AWS Big Data

MARCH 1, 2023

Finding similar columns in a data lake has important applications in data cleaning and annotation, schema matching, data discovery, and analytics across multiple data sources. In this example, we searched for columns in our data lake that have similar Column Names ( payload type ) to district ( payload ).

Data Lake

Data Lake Deep Learning Interactive Machine Learning

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

One of the bank’s key challenges related to strict cybersecurity requirements is to implement field level encryption for personally identifiable information (PII), Payment Card Industry (PCI), and data that is classified as high privacy risk (HPR). Only users with required permissions are allowed to access data in clear text.

Data Lake

Data Lake Data Warehouse Management Risk

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

Data governance is increasingly top-of-mind for customers as they recognize data as one of their most important assets. Effective data governance enables better decision-making by improving data quality, reducing data management costs, and ensuring secure access to data for stakeholders.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

Straumann Group is transforming dentistry with data, AI

CIO Business Intelligence

FEBRUARY 16, 2023

The company’s orthodontics business, for instance, makes heavy use of image processing to the point that unstructured data is growing at a pace of roughly 20% to 25% per month. Advances in imaging technology present Straumann Group with the opportunity to provide its customers with new capabilities to offer their clients.

Unstructured Data

Unstructured Data Data Lake Prescriptive Analytics Data Warehouse

Better, faster decisions: Why businesses thrive on real-time data

CIO Business Intelligence

SEPTEMBER 8, 2022

The ability to pivot quickly to address rapidly changing customer or market demands is driving the need for real-time data. But poor data quality, siloed data, entrenched processes, and cultural resistance often present roadblocks to using data to speed up decision making and innovation.

Cost-Benefit

Cost-Benefit Internet of Things Data-driven Data Lake

You Can’t Hit What You Can’t See

Cloudera

DECEMBER 1, 2022

Data observability provides insight into the condition and evolution of the data resources from source through the delivery of the data products. Barr Moses of Monte Carlo presents it as a combination of data flow, data quality, data governance, and data lineage.

Data Quality

Data Quality Metrics Data Lake Statistics

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Data has become an invaluable asset for businesses, offering critical insights to drive strategic decision-making and operational optimization. From establishing an enterprise-wide data inventory and improving data discoverability, to enabling decentralized data sharing and governance, Amazon DataZone has been a game changer for HEMA.

Data Governance

Data Governance Publishing Data-driven Metadata

LA Public Defender CIO digitizes to divert people to programs, not prison

CIO Business Intelligence

APRIL 4, 2024

Access to digitized records as well as analytics and AI tools gives public defenders like Cox the time required to present clients’ cases more thoroughly and build a better defense, which in Johnny’s case led to a treatment program instead of prison time.

Digital Transformation

Digital Transformation Data Lake ROI Modeling

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI).

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

8 tips for unleashing the power of unstructured data

CIO Business Intelligence

NOVEMBER 28, 2023

With each game release and update, the amount of unstructured data being processed grows exponentially, Konoval says. This volume of data poses serious challenges in terms of storage and efficient processing,” he says. To address this problem RetroStyle Games invested in data lakes. Quality is job one.

Unstructured Data

Unstructured Data Data-driven Visualization Data Quality

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Mark: The first element in the process is the link between the source data and the entry point into the data platform. At Ramsey International (RI), we refer to that layer in the architecture as the foundation, but others call it a staging area, raw zone, or even a source data lake.

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources. all URIDs that have are present).

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

AWS Big Data

NOVEMBER 9, 2023

As part of their cloud modernization initiative, they sought to migrate and modernize their legacy data platform. This process has been scheduled to run daily, ensuring a consistent batch of fresh data for analysis. AWS Glue – AWS Glue is used to load files into Amazon Redshift through the S3 data lake.

Data Warehouse

Data Warehouse Testing Data Quality Reporting

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

How data literacy allows gen AI to drive productivity at Dow

CIO Business Intelligence

JULY 31, 2024

We also have a blended architecture of deep process capabilities in our SAP system and decision-making capabilities in our Microsoft tools, and a great base of information in our integrated data hub, or data lake, which is all Microsoft-based. That’s what we’re running our AI and our machine learning against.

Manufacturing

Manufacturing Cost-Benefit Digital Transformation Forecasting

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

erwin

AUGUST 31, 2023

Improved Decision Making : Well-modeled data provides insights that drive informed decision-making across various business domains, resulting in enhanced strategic planning. Reduced Data Redundancy : By eliminating data duplication, it optimizes storage and enhances data quality, reducing errors and discrepancies.

Data-driven

Data-driven Modeling Enterprise Structured Data

How Data Management and Big Data Analytics Speed Up Business Growth

BizAcuity

APRIL 14, 2022

Big Data technology in today’s world. Did you know that the big data and business analytics market is valued at $198.08 Or that the US economy loses up to $3 trillion per year due to poor data quality? quintillion bytes of data which means an average person generates over 1.5 megabytes of data every second?

Big Data

Big Data Data Analytics Management Analytics

The Data Scientist’s Guide to the Data Catalog

Alation

JULY 19, 2022

Modern data catalogs also facilitate data quality checks. Historically restricted to the purview of data engineers, data quality information is essential for all user groups to see. Cataloging data science projects in this way is critical to helping them generate value for the company.

Metadata

Metadata Data Quality Statistics Data Science

Data Mesh 101: What it is and Why You Should Care

Ontotext

FEBRUARY 12, 2024

It proposes a technological, architectural, and organizational approach to solving data management problems by breaking up the monolithic data platform and de-centralizing data management across different domain teams and services. Some examples of data products are data sets, tables, machine learning models, and APIs.

IT

IT Metadata Data Quality Data Lake

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Unless, of course, the rest of their data also resides in the Google Cloud. In this post we showcase how we used AWS Glue to move siloed digital analytics data, with inconsistent arrival times, to AWS S3 (our Data Lake) and our central data warehouse (DWH), Snowflake. It consists of full-day and intraday tables.

Analytics

Analytics Data Lake Testing Optimization

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Equally crucial is the ability to segregate and audit problematic data, not just for maintaining data integrity, but also for regulatory compliance, error analysis, and potential data recovery. One of its key features is the ability to manage data using branches.

Data Quality

Data Quality Publishing Snapshot Data Lake

Why enterprise CIOs need to plan for Microsoft gen AI

CIO Business Intelligence

AUGUST 14, 2024

Start where your data is Using your own enterprise data is the major differentiator from open access gen AI chat tools, so it makes sense to start with the provider already hosting your enterprise data. Organizations with experience building enterprise data lakes connecting to many different data sources have AI advantages.

Enterprise

Enterprise Cost-Benefit Experimentation Modeling

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

Additionally, the scale is significant because the multi-tenant data sources provide a continuous stream of testing activity, and our users require quick data refreshes as well as historical context for up to a decade due to compliance and regulatory demands. Finally, data integrity is of paramount importance.

Software

Software Data Lake Testing Cost-Benefit

What is Data Mesh?

Ontotext

NOVEMBER 16, 2023

Data mesh solves this by promoting data autonomy, allowing users to make decisions about domains without a centralized gatekeeper. It also improves development velocity with better data governance and access with improved data quality aligned with business needs.

Metadata

Metadata Data-driven Data Quality Data Architecture

A Simple Data Capability Framework

Peter James Thomas

MAY 3, 2019

Control of Data to ensure it is Fit-for-Purpose. This refers to a wide range of activities from Data Governance to Data Management to Data Quality improvement and indeed related concepts such as Master Data Management. When I first started focussing on the data arena, Data Warehouses were state of the art.

Strategy

Strategy Data Architecture Data Quality Data Strategy

Scale knowledge management use cases with generative AI

IBM Big Data Hub

JULY 27, 2023

Data quality strongly impacts the quality and usefulness of content produced by an AI model, underscoring the significance of addressing data challenges. It provides the combination of data lake flexibility and data warehouse performance to help to scale AI.

Management

Management Enterprise Modeling Data Quality

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

Businesses face significant hurdles when preparing data for artificial intelligence (AI) applications. The existence of data silos and duplication, alongside apprehensions regarding data quality, presents a multifaceted environment for organizations to manage.

Cost-Benefit

Cost-Benefit Metadata Optimization Management

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

Alation

AUGUST 11, 2022

Today, the brightest minds in our industry are targeting the massive proliferation of data volumes and the accompanying but hard-to-find value locked within all that data. So we have to be very careful about giving the domains the right and authority to fix data quality. Let’s take data privacy as an example.

Metadata

Metadata Data Warehouse Data Quality Data Lake

In-depth with CDO Christopher Bannocks

Peter James Thomas

AUGUST 29, 2018

I have since run and driven transformation in Reference Data, Master Data , KYC [3] , Customer Data, Data Warehousing and more recently Data Lakes and Analytics , constantly building experience and capability in the Data Governance , Quality and data services domains, both inside banks, as a consultant and as a vendor.

Data-driven

Data-driven Cost-Benefit Metadata Technology

What Is Alation Connected Sheets? Q&A with the Creators

Alation

NOVEMBER 28, 2022

Sathish Raju, cofounder & CTO, Kloudio and senior director of engineering, Alation: This presents challenges for both business users and data teams. It’s impossible for data teams to assure the data quality of such spreadsheets and govern them all effectively.

Metadata

Metadata Enterprise Cost-Benefit Finance

Decoding Data Analyst Job Description: Skills, Tools, and Career Paths

FineReport

MARCH 24, 2024

Daily, data analysts engage in various tasks tailored to their organization’s needs, including identifying efficiency improvements, conducting sector and competitor benchmarking, and implementing tools for data validation. Showcase relevant work experiences, even if they may not directly align with the internship role.

Statistics

Statistics Data mining Visualization Sales

CIOs weigh where to place AI bets — and how to de-risk them

CIO Business Intelligence

MARCH 18, 2024

We are proceeding cautiously because the rise of LLMs [large language models] presents a new level of data security risk,” he says. “We This ensures that none of our sensitive data and intellectual property are availed to an outside provider.” AI tools rely on the data in use in these solutions.

Risk

Risk Cost-Benefit Data Processing Testing

What is a Data Pipeline?

Jet Global

MAY 9, 2024

The key components of a data pipeline are typically: Data Sources : The origin of the data, such as a relational database , data warehouse, data lake , file, API, or other data store. This can include tasks such as data ingestion, cleansing, filtering, aggregation, or standardization.

Data Lake

Data Lake Data Warehouse Business Intelligence Machine Learning

Unlock The Power of Your Data With These 19 Big Data & Data Analytics Books

datapine

AUGUST 29, 2022

Each of the three parts starts with chapters that are theoretical and finishes with more practical ones to make sense of all the concepts and knowledge previously presented, which is something that readers really enjoy about Nathan Marz’s work. – Eric Siegel, author, and founder of Predictive Analytics World.

Big Data

Big Data Data Analytics Analytics Data mining

Beyond the lakehouse: Architecting the open, interoperable data cloud for AI

CIO Business Intelligence

MAY 29, 2025

The open data foundation: Beyond raw Iceberg to enterprise-grade control For years, the vast scale of data lakes often resulted in data swamps, lacking the critical governance and performance necessary for enterprise-grade workloads. Modern unified catalogs (e.g.,

Metadata

Metadata Contextual Data Cost-Benefit Unstructured Data

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Webinars

Trending Sources

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Webinars

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Accomplish Agile Business Intelligence & Analytics For Your Business

Differentiate generative AI applications with your data using AWS analytics and managed databases

Build a semantic search engine for tabular columns with Transformers and Amazon OpenSearch Service

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Lake Formation 2022 year in review

Straumann Group is transforming dentistry with data, AI

Better, faster decisions: Why businesses thrive on real-time data

You Can’t Hit What You Can’t See

HEMA accelerates their data governance journey with Amazon DataZone

LA Public Defender CIO digitizes to divert people to programs, not prison

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

8 tips for unleashing the power of unstructured data

Demystifying Modern Data Platforms

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

How data literacy allows gen AI to drive productivity at Dow

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

How Data Management and Big Data Analytics Speed Up Business Growth

The Data Scientist’s Guide to the Data Catalog

Data Mesh 101: What it is and Why You Should Care

How SumUp made digital analytics more accessible using AWS Glue

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Why enterprise CIOs need to plan for Microsoft gen AI

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

What is Data Mesh?

A Simple Data Capability Framework

Scale knowledge management use cases with generative AI

Tackling AI’s data challenges with IBM databases on AWS

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

In-depth with CDO Christopher Bannocks

What Is Alation Connected Sheets? Q&A with the Creators

Decoding Data Analyst Job Description: Skills, Tools, and Career Paths

CIOs weigh where to place AI bets — and how to de-risk them

What is a Data Pipeline?

Unlock The Power of Your Data With These 19 Big Data & Data Analytics Books

Beyond the lakehouse: Architecting the open, interoperable data cloud for AI

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift