Data Lake, Data Quality and Reference

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

SageMaker still includes all the existing ML and AI capabilities you’ve come to know and love for data wrangling, human-in-the-loop data labeling with Amazon SageMaker Ground Truth , experiments, MLOps, Amazon SageMaker HyperPod managed distributed training, and more. Having confidence in your data is key.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Data Lakes on Cloud & it’s Usage in Healthcare

BizAcuity

MARCH 29, 2019

Data lakes are centralized repositories that can store all structured and unstructured data at any desired scale. The power of the data lake lies in the fact that it often is a cost-effective way to store data. The power of the data lake lies in the fact that it often is a cost-effective way to store data.

Data Lake

Data Lake Unstructured Data Cost-Benefit Data Quality

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Unlocking the true value of data often gets impeded by siloed information. Traditional data management—wherein each business unit ingests raw data in separate data lakes or warehouses—hinders visibility and cross-functional analysis. Amazon DataZone natively supports data sharing for Amazon Redshift data assets.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

AWS Big Data

OCTOBER 9, 2024

Today, customers are embarking on data modernization programs by migrating on-premises data warehouses and data lakes to the AWS Cloud to take advantage of the scale and advanced analytical capabilities of the cloud. Some customers build custom in-house data parity frameworks to validate data during migration.

Data Quality

Data Quality Data Lake Data Warehouse Metrics

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake. Data confidentiality and data quality are the two essential themes for data governance.

Data Quality

Data Quality Data Governance Data Lake Testing

AWS Glue Data Quality is Generally Available

AWS Big Data

JUNE 6, 2023

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. It takes days for data engineers to identify and implement data quality rules.

Data Quality

Data Quality Statistics Data Lake Visualization

Get started with AWS Glue Data Quality dynamic rules for ETL pipelines

AWS Big Data

MAY 23, 2024

They establish data quality rules to ensure the extracted data is of high quality for accurate business decisions. These rules assess the data based on fixed criteria reflecting current business states. We are excited to talk about how to use dynamic rules , a new capability of AWS Glue Data Quality.

Data Quality

Data Quality Metrics Data Lake Sales

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

These formats, exemplified by Apache Iceberg, Apache Hudi, and Delta Lake, addresses persistent challenges in traditional data lake structures by offering an advanced combination of flexibility, performance, and governance capabilities. In this post, we highlight notable updates on Iceberg, Hudi, and Delta Lake in AWS Glue 5.0.

Snapshot

Snapshot Metadata Data Lake Optimization

Visualize data quality scores and metrics generated by AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

AWS Glue Data Quality allows you to measure and monitor the quality of data in your data repositories. It’s important for business users to be able to see quality scores and metrics to make confident business decisions and debug data quality issues. An AWS Glue crawler crawls the results.

Data Quality

Data Quality Metrics Visualization Dashboards

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset.

Data Quality

Data Quality Measurement Testing Visualization

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

Poor-quality data can lead to incorrect insights, bad decisions, and lost opportunities. AWS Glue Data Quality measures and monitors the quality of your dataset. It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines.

Data Quality

Data Quality Data Lake Visualization Data-driven

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

AWS Big Data

JUNE 6, 2023

You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. Hundreds of thousands of customers use data lakes for analytics and ML to make data-driven business decisions.

Data Quality

Data Quality Data-driven Data Lake Metrics

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. For more detailed configuration, refer to Write properties in the Iceberg documentation.

Snapshot

Snapshot Management Metadata Big Data

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

To provide a response that includes the enterprise context, each user prompt needs to be augmented with a combination of insights from structured data from the data warehouse and unstructured data from the enterprise data lake. Implement data privacy policies. Implement data quality by data type and source.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI

AWS Big Data

FEBRUARY 13, 2025

This plane drives users to engage in data-driven conversations with knowledge and insights shared across the organization. Through the product experience plane, data product owners can use automated workflows to capture data lineage and data quality metrics and oversee access controls.

Data Analytics

Data Analytics Analytics Modeling Data-driven

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. To learn more about DataZone, refer to the User Guide. Bienvenue dans DataZone!

Data Lake

Data Lake Metadata Data Governance Statistics

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

Data governance is increasingly top-of-mind for customers as they recognize data as one of their most important assets. Effective data governance enables better decision-making by improving data quality, reducing data management costs, and ensuring secure access to data for stakeholders.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

Figure 2: Example data pipeline with DataOps automation. In this project, I automated data extraction from SFTP, the public websites, and the email attachments. The automated orchestration published the data to an AWS S3 Data Lake. Historic Balance – compares current data to previous or expected values.

Testing

Testing Metadata Dashboards Statistics

Constructing A Digital Transformation Strategy: Putting the Data in Digital Transformation

erwin

JULY 17, 2019

Outsourcing these data management efforts to professional services firms only delays schedules and increases costs. With automation, data quality is systemically assured. The data pipeline is seamlessly governed and operationalized to the benefit of all stakeholders. Digital Transformation Strategy: Smarter Data.

Digital Transformation

Digital Transformation Strategy Metadata Data-driven

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

AWS Big Data

JULY 25, 2024

Solution To address the challenge, ATPCO sought inspiration from a modern data mesh architecture. ATPCO has existing data lake resources set up in multiple accounts, each owned by different data-producing teams. Enter a name, such as Sales – Data lake blueprint. Choose Accept & configure association.

Data Lake

Data Lake Metadata Sales Publishing

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

IBM Big Data Hub

MAY 9, 2023

Data: the foundation of your foundation model Data quality matters. An AI model trained on biased or toxic data will naturally tend to produce biased or toxic outputs. When objectionable data is identified, we remove it, retrain the model, and repeat. Data curation is a task that’s never truly finished.

Enterprise

Enterprise Technology Modeling Cost-Benefit

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

Flexible and easy to use – The solutions should provide less restrictive, easy-to-access, and ready-to-use data. A data hub contains data at multiple levels of granularity and is often not integrated. It differs from a data lake by offering data that is pre-validated and standardized, allowing for simpler consumption by users.

Analytics

Analytics Data Warehouse Data Lake Metadata

Thank You Snowflake for Naming Alation the Data Governance Partner of the Year

Alation

JUNE 17, 2021

Lastly, active data governance simplifies stewardship tasks of all kinds. Tehnical stewards have the tools to monitor data quality, access, and access control. A compliance steward is empowered to monitor sensitive data and usage sharing policies at scale. The Data Swamp Problem. The Governance Solution.

Data Governance

Data Governance Data Lake Insurance Enterprise

Your 5-Step Journey from Analytics to AI

CIO Business Intelligence

MARCH 22, 2022

Which type(s) of storage consolidation you use depends on the data you generate and collect. . One option is a data lake—on-premises or in the cloud—that stores unprocessed data in any type of format, structured or unstructured, and can be queried in aggregate. Just starting out with analytics?

Analytics

Analytics Key Performance Indicator Data Warehouse Data-driven

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

Making the gen AI and data connection work

CIO Business Intelligence

AUGUST 9, 2024

If after anonymization the level of information in the data is the same, the data is still useful. But once personal or sensitive references are removed, and the data is no longer effective, a problem arises. Synthetic data avoids these difficulties, but they’re not exempt from the need of a trade-off.

Risk

Risk Measurement Data Lake Data Collection

Addressing the Elephant in the Room – Welcome to Today’s Cloudera

Cloudera

JUNE 13, 2024

After countless open-source innovations ushered in the Big Data era, including the first commercial distribution of HDFS (Apache Hadoop Distributed File System), commonly referred to as Hadoop, the two companies joined forces, giving birth to an entire ecosystem of technology and tech companies.

Big Data

Big Data Machine Learning Contextual Data Data Lake

How AWS helped Altron Group accelerate their vision for optimized customer engagement

AWS Big Data

JULY 13, 2023

Data quality for account and customer data – Altron wanted to enable data quality and data governance best practices. Goals – Lay the foundation for a data platform that can be used in the future by internal and external stakeholders. Basic formatting and readability of the data is standardized here.

Optimization

Optimization B2B Data Quality Sales

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Mark: The first element in the process is the link between the source data and the entry point into the data platform. At Ramsey International (RI), we refer to that layer in the architecture as the foundation, but others call it a staging area, raw zone, or even a source data lake.

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

AWS Big Data

DECEMBER 21, 2023

As the volume and complexity of analytics workloads continue to grow, customers are looking for more efficient and cost-effective ways to ingest and analyse data. AWS Glue provides both visual and code-based interfaces to make data integration effortless. For setup instructions, refer to Getting started with Amazon OpenSearch Service.

Analytics

Analytics IT Data Lake Visualization

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

The goal of DataOps is to help organizations make better use of their data to drive business decisions and improve outcomes. ChatGPT> DataOps is a term that refers to the set of practices and tools that organizations use to improve the quality and speed of data analytics and machine learning.

Machine Learning

Machine Learning Data-driven Optimization Data Analytics

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI).

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources. Answering questions as simple as “How many unique customers do we have?”

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

AWS Big Data

JANUARY 30, 2023

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities.

Insurance

Insurance Data Lake Data-driven Management

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

Automate large-scale data validation using Amazon EMR and Apache Griffin

AWS Big Data

APRIL 4, 2024

Griffin is an open source data quality solution for big data, which supports both batch and streaming mode. In today’s data-driven landscape, where organizations deal with petabytes of data, the need for automated data validation frameworks has become increasingly critical.

Data Quality

Data Quality Data Lake Data Warehouse Data-driven

Differentiate generative AI applications with your data using AWS analytics and managed databases

AWS Big Data

SEPTEMBER 12, 2024

The application gets prompt templates from an S3 data lake and creates the engineered prompt. The user interaction is stored in a data lake for downstream usage and BI analysis. Conclusion In this post, we discussed the importance of using customer data to differentiate generative AI usage in applications.

Management

Management Analytics Data Lake Interactive

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

Optimization

Optimization Forecasting Data Lake Metadata

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

A Gartner Marketing survey found only 14% of organizations have successfully implemented a C360 solution, due to lack of consensus on what a 360-degree view means, challenges with data quality, and lack of cross-functional governance structure for customer data.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

erwin

AUGUST 31, 2023

Improved Decision Making : Well-modeled data provides insights that drive informed decision-making across various business domains, resulting in enhanced strategic planning. Reduced Data Redundancy : By eliminating data duplication, it optimizes storage and enhances data quality, reducing errors and discrepancies.

Data-driven

Data-driven Modeling Enterprise Structured Data

How Knowledge Graphs Power Data Mesh and Data Fabric

Ontotext

APRIL 10, 2024

Bad data tax is rampant in most organizations. Currently, every organization is blindly chasing the GenAI race, often forgetting that data quality and semantics is one of the fundamentals to achieving AI success. Sadly, data quality is losing to data quantity, resulting in “ Infobesity ”. “Any

Metadata

Metadata Data Lake Data Warehouse Data Quality

Data democratization: How data architecture can drive business decisions and AI initiatives

IBM Big Data Hub

AUGUST 4, 2023

In turn, they both must also have the data literacy skills to be able to verify the data’s accuracy, ensure its security, and provide or follow guidance on when and how it should be used. Then, it applies these insights to automate and orchestrate the data lifecycle.

Data Architecture

Data Architecture Data Lake Machine Learning Data Governance

The New Normal for FP&A: Data Analytics

Jedox

OCTOBER 22, 2020

The term “data analytics” refers to the process of examining datasets to draw conclusions about the information they contain. Data analysis techniques enhance the ability to take raw data and uncover patterns to extract valuable insights from it.

Data Analytics

Data Analytics Analytics Unstructured Data Data mining

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Data Lakes on Cloud & it’s Usage in Healthcare

Webinars

Trending Sources

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Webinars

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Glue Data Quality is Generally Available

Get started with AWS Glue Data Quality dynamic rules for ETL pipelines

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Visualize data quality scores and metrics generated by AWS Glue Data Quality

Measure performance of AWS Glue Data Quality for ETL pipelines

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Data governance in the age of generative AI

Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI

AWS Lake Formation 2023 year in review

AWS Lake Formation 2022 year in review

A Day in the Life of a DataOps Engineer

Constructing A Digital Transformation Strategy: Putting the Data in Digital Transformation

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Thank You Snowflake for Naming Alation the Data Governance Partner of the Year

Your 5-Step Journey from Analytics to AI

Governing data in relational databases using Amazon DataZone

Making the gen AI and data connection work

Addressing the Elephant in the Room – Welcome to Today’s Cloudera

How AWS helped Altron Group accelerate their vision for optimized customer engagement

Demystifying Modern Data Platforms

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

An AI Chat Bot Wrote This Blog Post …

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

What is an open data lakehouse and why you should care?

Automate large-scale data validation using Amazon EMR and Apache Griffin

Differentiate generative AI applications with your data using AWS analytics and managed databases

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

Create an end-to-end data strategy for Customer 360 on AWS

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

How Knowledge Graphs Power Data Mesh and Data Fabric

Data democratization: How data architecture can drive business decisions and AI initiatives

The New Normal for FP&A: Data Analytics

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift