Blog, Data Lake and Data Quality

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

A Drug Launch Case Study in the Amazing Efficiency of a Data Team Using DataOps How a Small Team Powered the Multi-Billion Dollar Acquisition of a Pharma Startup When launching a groundbreaking pharmaceutical product, the stakes and the rewards couldnt be higher. data engineers delivered over 100 lines of code and 1.5

Data Quality

Data Quality Data Lake Testing Statistics

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

DataKitchen

SEPTEMBER 18, 2024

A DataOps Approach to Data Quality The Growing Complexity of Data Quality Data quality issues are widespread, affecting organizations across industries, from manufacturing to healthcare and financial services. 73% of data practitioners do not trust their data (IDC).

Scorecard

Scorecard Data Quality Measurement Testing

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

Amazon SageMaker Lakehouse , now generally available, unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. Having confidence in your data is key.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

AWS Big Data

OCTOBER 9, 2024

Today, customers are embarking on data modernization programs by migrating on-premises data warehouses and data lakes to the AWS Cloud to take advantage of the scale and advanced analytical capabilities of the cloud. Some customers build custom in-house data parity frameworks to validate data during migration.

Data Quality

Data Quality Data Lake Data Warehouse Metrics

Data Lakes on Cloud & it’s Usage in Healthcare

BizAcuity

MARCH 29, 2019

Data lakes are centralized repositories that can store all structured and unstructured data at any desired scale. The power of the data lake lies in the fact that it often is a cost-effective way to store data. The power of the data lake lies in the fact that it often is a cost-effective way to store data.

Data Lake

Data Lake Unstructured Data Cost-Benefit Data Quality

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Data governance is the process of ensuring the integrity, availability, usability, and security of an organization’s data. Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake.

Data Quality

Data Quality Data Governance Data Lake Testing

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

AWS Glue Data Quality is Generally Available

AWS Big Data

JUNE 6, 2023

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. It takes days for data engineers to identify and implement data quality rules.

Data Quality

Data Quality Statistics Data Lake Visualization

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

Poor-quality data can lead to incorrect insights, bad decisions, and lost opportunities. AWS Glue Data Quality measures and monitors the quality of your dataset. It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines.

Data Quality

Data Quality Data Lake Visualization Data-driven

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt. The post What is a Data Mesh?

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

Observability in DataOps refers to the ability to monitor and understand the performance and behavior of data-related systems and processes, and to use that information to improve the quality and speed of data-driven decision making. Query> An AI, Chat GPT wrote this blog post, why should I read it? .

Machine Learning

Machine Learning Data-driven Optimization Data Analytics

Fire Your Super-Smart Data Consultants with DataOps

DataKitchen

JANUARY 25, 2022

Ensuring that data is available, secure, correct, and fit for purpose is neither simple nor cheap. Companies end up paying outside consultants enormous fees while still having to suffer the effects of poor data quality and lengthy cycle time. . For example, DataOps can be used to automate data integration.

Consulting

Consulting Testing Data Quality Data Lake

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Data architecture strategy for data quality

IBM Big Data Hub

JANUARY 5, 2023

Poor data quality is one of the top barriers faced by organizations aspiring to be more data-driven. Ill-timed business decisions and misinformed business processes, missed revenue opportunities, failed business initiatives and complex data systems can all stem from data quality issues.

Data Architecture

Data Architecture Data Quality Strategy Data Lake

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

These formats, exemplified by Apache Iceberg, Apache Hudi, and Delta Lake, addresses persistent challenges in traditional data lake structures by offering an advanced combination of flexibility, performance, and governance capabilities. For more information, refer to What are deletion vectors?

Snapshot

Snapshot Metadata Data Lake Optimization

Navigating the Chaos of Unruly Data: Solutions for Data Teams

DataKitchen

NOVEMBER 10, 2023

The core issue plaguing many organizations is the presence of out-of-control databases or data lakes characterized by: Unrestrained Data Changes: Numerous users and tools incessantly alter data, leading to a tumultuous environment. Monitor freshness, schema changes, volume, and column health are standard.

Data Quality

Data Quality Testing Data Lake Data Integration

Accomplish Agile Business Intelligence & Analytics For Your Business

datapine

APRIL 15, 2020

You will need to continually return to your business dashboard to make sure that it’s working, the data is accurate and it’s still answering the right questions in the most effective way. Testing will eliminate lots of data quality challenges and bring a test-first approach through your agile cycle.

Business Intelligence

Business Intelligence Analytics Testing Dashboards

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

AWS Big Data

JULY 25, 2024

This blog post is co-written with Raj Samineni from ATPCO. In today’s data-driven world, companies across industries recognize the immense value of data in making decisions, driving innovation, and building new products to serve their customers. Choose the Amazon DataZone blueprint you want to enable.

Data Lake

Data Lake Metadata Sales Publishing

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

One of the bank’s key challenges related to strict cybersecurity requirements is to implement field level encryption for personally identifiable information (PII), Payment Card Industry (PCI), and data that is classified as high privacy risk (HPR). Only users with required permissions are allowed to access data in clear text.

Data Lake

Data Lake Data Warehouse Management Risk

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

AWS Big Data

JANUARY 30, 2023

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities. Choose Create role.

Insurance

Insurance Data Lake Data-driven Analytics

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

IBM Big Data Hub

MAY 9, 2023

Data: the foundation of your foundation model Data quality matters. An AI model trained on biased or toxic data will naturally tend to produce biased or toxic outputs. When objectionable data is identified, we remove it, retrain the model, and repeat. Data curation is a task that’s never truly finished.

Enterprise

Enterprise Technology Modeling Cost-Benefit

Automate large-scale data validation using Amazon EMR and Apache Griffin

AWS Big Data

APRIL 4, 2024

Griffin is an open source data quality solution for big data, which supports both batch and streaming mode. In today’s data-driven landscape, where organizations deal with petabytes of data, the need for automated data validation frameworks has become increasingly critical.

Data Quality

Data Quality Data Lake Data Warehouse Data-driven

Constructing A Digital Transformation Strategy: Putting the Data in Digital Transformation

erwin

JULY 17, 2019

Part Two of the Digital Transformation Journey … In our last blog on driving digital transformation , we explored how enterprise architecture (EA) and business process (BP) modeling are pivotal factors in a viable digital transformation strategy. With automation, data quality is systemically assured.

Digital Transformation

Digital Transformation Strategy Metadata Data-driven

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

Figure 2: Example data pipeline with DataOps automation. In this project, I automated data extraction from SFTP, the public websites, and the email attachments. The automated orchestration published the data to an AWS S3 Data Lake. All the code, Talend job, and the BI report are version controlled using Git.

Testing

Testing Metadata Dashboards Statistics

Thank You Snowflake for Naming Alation the Data Governance Partner of the Year

Alation

JUNE 17, 2021

Lastly, active data governance simplifies stewardship tasks of all kinds. Tehnical stewards have the tools to monitor data quality, access, and access control. A compliance steward is empowered to monitor sensitive data and usage sharing policies at scale. The Data Swamp Problem. Subscribe to Alation's Blog.

Data Governance

Data Governance Data Lake Insurance Enterprise

Breaking State and Local Data Silos with Modern Data Architectures

Cloudera

AUGUST 30, 2022

For state and local agencies, data silos create compounding problems: Inaccessible or hard-to-access data creates barriers to data-driven decision making. Legacy data sharing involves proliferating copies of data, creating data management, and security challenges. Forrester ). Gartner ).

Data Architecture

Data Architecture Data Lake Data Warehouse Metadata

Data Architecture and Strategy in the AI Era

Cloudera

MARCH 28, 2024

Among the most common challenges to achieving AI adoption at scale were data quality and availability (36%), scalability and deployment (36%), integration with existing systems and processes (35%), and change management and organizational culture (34%).

Data Architecture

Data Architecture Strategy Data Lake Data-driven

Did Big Data Deliver Business Transformation & Improved CX?

Alation

AUGUST 4, 2022

And where data was available, the ability to access and interpret it proved problematic. Big data can grow too big fast. Left unchecked, data lakes became data swamps. Some data lake implementations required expensive ‘cleansing pumps’ to make them navigable again. Subscribe to Alation's Blog.

Big Data

Big Data Digital Transformation Data Lake Data-driven

Addressing the Elephant in the Room – Welcome to Today’s Cloudera

Cloudera

JUNE 13, 2024

After countless open-source innovations ushered in the Big Data era, including the first commercial distribution of HDFS (Apache Hadoop Distributed File System), commonly referred to as Hadoop, the two companies joined forces, giving birth to an entire ecosystem of technology and tech companies. We hope to see you there.

Big Data

Big Data Machine Learning Contextual Data Data Lake

You Can’t Hit What You Can’t See

Cloudera

DECEMBER 1, 2022

Data observability provides insight into the condition and evolution of the data resources from source through the delivery of the data products. Barr Moses of Monte Carlo presents it as a combination of data flow, data quality, data governance, and data lineage. Source: IDC .

Data Quality

Data Quality Metrics Data Lake Statistics

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Mark: The first element in the process is the link between the source data and the entry point into the data platform. At Ramsey International (RI), we refer to that layer in the architecture as the foundation, but others call it a staging area, raw zone, or even a source data lake.

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

Data Mesh vs. Data Fabric: A Love Story

Alation

JANUARY 13, 2022

Thoughtworks says data mesh is key to moving beyond a monolithic data lake. Spoiler alert: data fabric and data mesh are independent design concepts that are, in fact, quite complementary. Thoughtworks says data mesh is key to moving beyond a monolithic data lake 2. Gartner on Data Fabric.

Data Lake

Data Lake Metadata Data-driven Data Governance

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

analyst Sumit Pal, in “Exploring Lakehouse Architecture and Use Cases,” published January 11, 2022: “Data lakehouses integrate and unify the capabilities of data warehouses and data lakes, aiming to support AI, BI, ML, and data engineering on a single platform.” According to Gartner, Inc.

Metadata

Metadata Machine Learning Unstructured Data Data Lake

Top 10 Data Governance Predictions for 2019

erwin

DECEMBER 13, 2018

As organizations become data-driven and awash in an overwhelming amount of data from multiple data sources (AI, IOT, ML, etc.), organizations will need to get a better handle on data quality and focus on data management processes and practices. To stay up to date, click here to subscribe.

Data Governance

Data Governance IoT Internet of Things Data-driven

DataOps Observability: Taming the Chaos (Part 3)

DataKitchen

NOVEMBER 18, 2022

In addition to the tracking of relationships and quality metrics, DataOps Observability journeys allow users to establish baselines?concrete concrete expectations for run schedules, run durations, data quality, and upstream and downstream dependencies. And she’ll know when newer data will arrive.

Testing

Testing Statistics Measurement Metrics

Augmented data management: Data fabric versus data mesh

IBM Big Data Hub

APRIL 27, 2022

Since its uniquely metadata-driven, the abstraction layer of a data fabric makes it easier to model, integrate and query any data sources, build data pipelines, and integrate data in real-time. This improves data engineering productivity and time-to-value for data consumers. What’s a data mesh?

Management

Management Metadata Data Architecture Data Lake

What Is a Data Catalog?

Alation

FEBRUARY 13, 2020

Figure 1 illustrates the typical metadata subjects contained in a data catalog. Figure 1 – Data Catalog Metadata Subjects. Datasets are the files and tables that data workers need to find and access. They may reside in a data lake, warehouse, master data repository, or any other shared data resource.

Metadata

Metadata Data Lake Recreation/Entertainment Big Data

O’Reilly Releases First Chapters of a New Book about Logical Data Management

Data Virtualization

JANUARY 21, 2025

However, companies are still struggling to manage data effectively, to implement GenAI applications that deliver proven business value. The post OReilly Releases First Chapters of a New Book about Logical Data Management appeared first on Data Management Blog - Data Integration and Modern Data Management Articles, Analysis and Information.

Management

Management Data Integration Technology Data Warehouse

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

erwin

AUGUST 31, 2023

Improved Decision Making : Well-modeled data provides insights that drive informed decision-making across various business domains, resulting in enhanced strategic planning. Reduced Data Redundancy : By eliminating data duplication, it optimizes storage and enhances data quality, reducing errors and discrepancies.

Data-driven

Data-driven Modeling Enterprise Structured Data

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources. Answering questions as simple as “How many unique customers do we have?”

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

Data Mesh 101: How Data Mesh Helps Organizations Be Data-Driven and Achieve Velocity

Ontotext

FEBRUARY 12, 2024

This is especially beneficial when teams need to increase data product velocity with trust and data quality, reduce communication costs, and help data solutions align with business objectives. In most enterprises, data is needed and produced by many business units but owned and trusted by no one.

Data-driven

Data-driven Data Lake Data Quality Business Objectives

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

With in-place table migration, you can rapidly convert to Iceberg tables since there is no need to regenerate data files. Newly generated metadata will then point to source data files as illustrated in the diagram below. . Data quality using table rollback. Only metadata will be regenerated.

Metadata

Metadata Data Warehouse Snapshot Machine Learning

Drug Launch Case Study: Amazing Efficiency Using DataOps

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

Webinars

Trending Sources

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Webinars

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

Data Lakes on Cloud & it’s Usage in Healthcare

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Glue Data Quality is Generally Available

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

What is a Data Mesh?

An AI Chat Bot Wrote This Blog Post …

Fire Your Super-Smart Data Consultants with DataOps

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Data architecture strategy for data quality

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Navigating the Chaos of Unruly Data: Solutions for Data Teams

Accomplish Agile Business Intelligence & Analytics For Your Business

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

Automate large-scale data validation using Amazon EMR and Apache Griffin

Constructing A Digital Transformation Strategy: Putting the Data in Digital Transformation

A Day in the Life of a DataOps Engineer

Thank You Snowflake for Naming Alation the Data Governance Partner of the Year

Breaking State and Local Data Silos with Modern Data Architectures

Data Architecture and Strategy in the AI Era

Did Big Data Deliver Business Transformation & Improved CX?

Addressing the Elephant in the Room – Welcome to Today’s Cloudera

You Can’t Hit What You Can’t See

Demystifying Modern Data Platforms

Data Mesh vs. Data Fabric: A Love Story

What is an open data lakehouse and why you should care?

The Modern Data Lakehouse: An Architectural Innovation

Top 10 Data Governance Predictions for 2019

DataOps Observability: Taming the Chaos (Part 3)

Augmented data management: Data fabric versus data mesh

What Is a Data Catalog?

O’Reilly Releases First Chapters of a New Book about Logical Data Management

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Data Mesh 101: How Data Mesh Helps Organizations Be Data-Driven and Achieve Velocity

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift