Data Lake, Data Quality and Publishing

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

data engineers delivered over 100 lines of code and 1.5 data quality tests every day to support a cast of analysts and customers. They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. It is necessary to have more than a data lake and a database.

Data Quality

Data Quality Data Lake Testing Statistics

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

As technology and business leaders, your strategic initiatives, from AI-powered decision-making to predictive insights and personalized experiences, are all fueled by data. Yet, despite growing investments in advanced analytics and AI, organizations continue to grapple with a persistent and often underestimated challenge: poor data quality.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

Amazon SageMaker Lakehouse , now generally available, unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. Having confidence in your data is key.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

AWS Big Data

AUGUST 15, 2024

Unlocking the true value of data often gets impeded by siloed information. Traditional data management—wherein each business unit ingests raw data in separate data lakes or warehouses—hinders visibility and cross-functional analysis. Amazon DataZone natively supports data sharing for Amazon Redshift data assets.

Data Lake

Data Lake Data Warehouse Data Governance Publishing

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Data governance is the process of ensuring the integrity, availability, usability, and security of an organization’s data. Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake.

Data Quality

Data Quality Data Governance Data Lake Testing

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

Plug-and-play integration : A seamless, plug-and-play integration between data producers and consumers should facilitate rapid use of new data sets and enable quick proof of concepts, such as in the data science teams. As part of the required data, CHE data is shared using Amazon DataZone.

IoT

IoT Machine Learning Metadata Data-driven

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Domain ownership recognizes that the teams generating the data have the deepest understanding of it and are therefore best suited to manage, govern, and share it effectively. This principle makes sure data accountability remains close to the source, fostering higher data quality and relevance.

Metadata

Metadata Data Governance Data Quality Data-driven

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

AWS Big Data

JUNE 6, 2023

You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. Hundreds of thousands of customers use data lakes for analytics and ML to make data-driven business decisions.

Data Quality

Data Quality Data-driven Data Lake Metrics

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

Poor-quality data can lead to incorrect insights, bad decisions, and lost opportunities. AWS Glue Data Quality measures and monitors the quality of your dataset. It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines.

Data Quality

Data Quality Data Lake Visualization Data-driven

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

These formats, exemplified by Apache Iceberg, Apache Hudi, and Delta Lake, addresses persistent challenges in traditional data lake structures by offering an advanced combination of flexibility, performance, and governance capabilities. For more information, refer to What are deletion vectors?

Snapshot

Snapshot Metadata Data Lake Optimization

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

AWS Big Data

JULY 25, 2024

Solution To address the challenge, ATPCO sought inspiration from a modern data mesh architecture. In Amazon DataZone, data owners can publish their data and its business catalog (metadata) to ATPCO’s DataZone domain. Data consumers can then search for relevant data assets using these human-friendly metadata terms.

Data Lake

Data Lake Metadata Sales Publishing

How Fujitsu implemented a global data mesh architecture and democratized data

AWS Big Data

MAY 1, 2024

For those reasons, it was extremely difficult for Fujitsu to manage and utilize data at scale with Excel. Solution overview OneData defines three personas: Publisher – This role includes the organizational and management team of systems that serve as data sources. It is crucial in data governance and data management.

Dashboards

Dashboards Publishing Data-driven Cost-Benefit

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

Figure 2: Example data pipeline with DataOps automation. In this project, I automated data extraction from SFTP, the public websites, and the email attachments. The automated orchestration published the data to an AWS S3 Data Lake. All the code, Talend job, and the BI report are version controlled using Git.

Testing

Testing Metadata Dashboards Statistics

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2022 , we talked about the enhancements we had done to these services. Bien intégré!

Data Lake

Data Lake Metadata Data Governance Statistics

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Data has become an invaluable asset for businesses, offering critical insights to drive strategic decision-making and operational optimization. Delta tables technical metadata is stored in the Data Catalog, which is a native source for creating assets in the Amazon DataZone business catalog.

Data Governance

Data Governance Publishing Data-driven Metadata

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

Data governance is increasingly top-of-mind for customers as they recognize data as one of their most important assets. Effective data governance enables better decision-making by improving data quality, reducing data management costs, and ensuring secure access to data for stakeholders.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

Accomplish Agile Business Intelligence & Analytics For Your Business

datapine

APRIL 15, 2020

You will need to continually return to your business dashboard to make sure that it’s working, the data is accurate and it’s still answering the right questions in the most effective way. Testing will eliminate lots of data quality challenges and bring a test-first approach through your agile cycle.

Business Intelligence

Business Intelligence Analytics Testing Dashboards

Augmented data management: Data fabric versus data mesh

IBM Big Data Hub

APRIL 27, 2022

Since its uniquely metadata-driven, the abstraction layer of a data fabric makes it easier to model, integrate and query any data sources, build data pipelines, and integrate data in real-time. This improves data engineering productivity and time-to-value for data consumers. What’s a data mesh?

Management

Management Metadata Data Architecture Data Lake

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI). When the wave is complete, the people from that wave will move to another wave.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

A data hub contains data at multiple levels of granularity and is often not integrated. It differs from a data lake by offering data that is pre-validated and standardized, allowing for simpler consumption by users. Data hubs and data lakes can coexist in an organization, complementing each other.

Analytics

Analytics Data Warehouse Data Lake Metadata

Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI

AWS Big Data

FEBRUARY 13, 2025

This plane drives users to engage in data-driven conversations with knowledge and insights shared across the organization. Through the product experience plane, data product owners can use automated workflows to capture data lineage and data quality metrics and oversee access controls.

Data Analytics

Data Analytics Analytics Modeling Management

Automate large-scale data validation using Amazon EMR and Apache Griffin

AWS Big Data

APRIL 4, 2024

Griffin is an open source data quality solution for big data, which supports both batch and streaming mode. In today’s data-driven landscape, where organizations deal with petabytes of data, the need for automated data validation frameworks has become increasingly critical.

Data Quality

Data Quality Data Lake Data Warehouse Data-driven

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

Given the importance of data in the world today, organizations face the dual challenges of managing large-scale, continuously incoming data while vetting its quality and reliability. AWS Glue is a serverless data integration service that you can use to effectively monitor and manage data quality through AWS Glue Data Quality.

Data Quality

Data Quality Publishing Snapshot Data Lake

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

This is the promise of the modern data lakehouse architecture. analyst Sumit Pal, in “Exploring Lakehouse Architecture and Use Cases,” published January 11, 2022: “Data lakehouses integrate and unify the capabilities of data warehouses and data lakes, aiming to support AI, BI, ML, and data engineering on a single platform.”

Metadata

Metadata Machine Learning Unstructured Data Data Lake

Data Mesh 101: How Data Mesh Can Be Used in an Organization

Ontotext

FEBRUARY 12, 2024

Domain teams should continually monitor for data errors with data validation checks and incorporate data lineage to track usage. Establish and enforce data governance by ensuring all data used is accurate, complete, and compliant with regulations. For instance, JPMorgan Chase & Co.

Data Quality

Data Quality Data-driven Data Lake Data Governance

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

erwin

AUGUST 31, 2023

Improved Decision Making : Well-modeled data provides insights that drive informed decision-making across various business domains, resulting in enhanced strategic planning. Reduced Data Redundancy : By eliminating data duplication, it optimizes storage and enhances data quality, reducing errors and discrepancies.

Data-driven

Data-driven Modeling Enterprise Structured Data

Data Mesh 101: What it is and Why You Should Care

Ontotext

FEBRUARY 12, 2024

It proposes a technological, architectural, and organizational approach to solving data management problems by breaking up the monolithic data platform and de-centralizing data management across different domain teams and services. Some examples of data products are data sets, tables, machine learning models, and APIs.

IT

IT Metadata Data Quality Data Lake

The Audience for Data Catalogs and Data Intelligence

Alation

JUNE 21, 2022

Why start with a data source and build a visualization, if you can just find a visualization that already exists, complete with metadata about it? Data scientists went beyond database tables to data lakes and cloud data stores. Data scientists want to catalog not just information sources, but models.

Metadata

Metadata Data Quality Visualization Data Lake

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

It has been well published since the State of DevOps 2019 DORA Metrics were published that with DevOps, companies can deploy software 208 times more often and 106 times faster, recover from incidents 2,604 times faster, and release 7 times fewer defects. Finally, data integrity is of paramount importance.

Software

Software Data Lake Testing Cost-Benefit

Migrate workloads from AWS Data Pipeline

AWS Big Data

JULY 25, 2024

With AWS Glue, you can discover and connect to hundreds of different data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor ETL pipelines to load data into your data lakes.

Visualization

Visualization Management Data Integration Testing

What is Data Mesh?

Ontotext

NOVEMBER 16, 2023

Data mesh solves this by promoting data autonomy, allowing users to make decisions about domains without a centralized gatekeeper. It also improves development velocity with better data governance and access with improved data quality aligned with business needs.

Metadata

Metadata Data-driven Data Quality Data Architecture

Turnkey Cloud DataOps: Solution from Alation and Accenture

Alation

MARCH 22, 2022

As the latest iteration in this pursuit of high-quality data sharing, DataOps combines a range of disciplines. It synthesizes all we’ve learned about agile, data quality , and ETL/ELT. This produces end-to-end lineage so business and technology users alike can understand the state of a data lake and/or lake house.

Metadata

Metadata Cost-Benefit Data Quality Data Lake

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

Therefore, it’s crucial to keep the schema definition in the Schema Registry and the Data Catalog table in sync. To avoid this, it’s recommended to use a data quality check mechanism to identify such anomalies and take appropriate action in case of unexpected behavior. page in the GitHub repository. $

Management

Management Metadata Internet of Things Testing

Fact-based Decision-making

Peter James Thomas

AUGUST 12, 2018

These normally appear at the end of an article, but it seemed to make sense to start with them in this case: Recently I published Building Momentum – How to begin becoming a Data-driven Organisation. These and other areas are covered in greater detail in an older article, Using BI to drive improvements in data quality.

Metrics

Metrics Statistics Data Quality Measurement

5 Ways Data Engineers Can Support Data Governance

Alation

JANUARY 26, 2023

Offer the right tools Data stewardship is greatly simplified when the right tools are on hand. So ask yourself, does your steward have the software to spot issues with data quality, for example? 2) Always Remember Compliance Source: Unsplash There are now many different data privacy and security laws worldwide.

Data Governance

Data Governance Strategy Data Quality Marketing

What Is Alation Connected Sheets? Q&A with the Creators

Alation

NOVEMBER 28, 2022

It’s impossible for data teams to assure the data quality of such spreadsheets and govern them all effectively. If unaddressed, this chaos can lead to data quality, compliance, and security issues. In an enterprise, there may be thousands of spreadsheets used for critical business decisions.

Metadata

Metadata Enterprise Cost-Benefit Finance

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

This was for the Chief Data Officer, or head of data and analytics. Gartner also published the same piece of research for other roles, such as Application and Software Engineering. Does Data warehouse as a software tool will play role in future of Data & Analytics strategy? We have published some case studies.

Data Analytics

Data Analytics Analytics Data-driven Finance

Enhance Trino Performance With Simba’s Powerful Connectivity

Jet Global

JANUARY 30, 2025

Its distributed architecture empowers organizations to query massive datasets across databases, data lakes, and cloud platforms with speed and reliability. Optimizing connections to your data sources is equally important, as it directly impacts the speed and efficiency of data access.

Data Lake

Data Lake Data-driven Optimization Enterprise

What is a Data Pipeline?

Jet Global

MAY 9, 2024

The key components of a data pipeline are typically: Data Sources : The origin of the data, such as a relational database , data warehouse, data lake , file, API, or other data store. This can include tasks such as data ingestion, cleansing, filtering, aggregation, or standardization.

Data Lake

Data Lake Data Warehouse Business Intelligence Machine Learning

Unlock The Power of Your Data With These 19 Big Data & Data Analytics Books

datapine

AUGUST 29, 2022

Due to this book being published recently, there are not any written reviews available. 4) Big Data: Principles and Best Practices Of Scalable Real-Time Data Systems by Nathan Marz and James Warren. 6) Lean Analytics: Use Data to Build a Better Startup Faster, by Alistair Croll and Benjamin Yoskovitz.

Big Data

Big Data Data Analytics Analytics Data mining

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

The quick and dirty definition of data mapping is the process of connecting different types of data from various data sources. Data mapping is a crucial step in data modeling and can help organizations achieve their business goals by enabling data integration, migration, transformation, and quality.

Data Warehouse

Data Warehouse Reporting Data Transformation Visualization

Is Your Data Catalog Ready for the AI Age?

BI-Survey

FEBRUARY 27, 2025

Advanced: Does it leverage AI/ML to enrich metadata by automatically linking glossary entries with data assets and performing semantic tagging? Leading-edge: Does it provide data quality or anomaly detection features to enrich metadata with quality metrics and insights, proactively identifying potential issues?

Unstructured Data

Unstructured Data Metadata Data Quality Data Governance

Drug Launch Case Study: Amazing Efficiency Using DataOps

Data’s dark secret: Why poor quality cripples AI and growth

Webinars

Trending Sources

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Webinars

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

How EUROGATE established a data mesh architecture using Amazon DataZone

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Use open table format libraries on AWS Glue 5.0 for Apache Spark

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

How Fujitsu implemented a global data mesh architecture and democratized data

A Day in the Life of a DataOps Engineer

AWS Lake Formation 2023 year in review

HEMA accelerates their data governance journey with Amazon DataZone

Governing data in relational databases using Amazon DataZone

AWS Lake Formation 2022 year in review

Accomplish Agile Business Intelligence & Analytics For Your Business

Augmented data management: Data fabric versus data mesh

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI

Automate large-scale data validation using Amazon EMR and Apache Griffin

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

The Modern Data Lakehouse: An Architectural Innovation

Data Mesh 101: How Data Mesh Can Be Used in an Organization

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

Data Mesh 101: What it is and Why You Should Care

The Audience for Data Catalogs and Data Intelligence

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Migrate workloads from AWS Data Pipeline

What is Data Mesh?

Turnkey Cloud DataOps: Solution from Alation and Accenture

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

Fact-based Decision-making

5 Ways Data Engineers Can Support Data Governance

What Is Alation Connected Sheets? Q&A with the Creators

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Enhance Trino Performance With Simba’s Powerful Connectivity

What is a Data Pipeline?

Unlock The Power of Your Data With These 19 Big Data & Data Analytics Books

What is Data Mapping?

Is Your Data Catalog Ready for the AI Age?

Stay Connected