Data Lake and Statistics - Data Leaders Brief

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. The team landed the data in a Data Lake implemented with cloud storage buckets and then loaded into Snowflake, enabling fast access and smooth integrations with analytical tools.

Data Quality

Data Quality Data Lake Testing Statistics

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings.

Statistics

Statistics Data Lake Optimization Data-driven

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

AWS Big Data

FEBRUARY 14, 2023

Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history.

Data Lake

Data Lake Statistics Data Architecture Finance

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights. From enhancing data lakes to empowering AI-driven analytics, AWS unveiled new tools and services that are set to shape the future of data and analytics.

Analytics

Analytics Data Lake Metadata Data Warehouse

2021 Gift Giving Guide for Data Nerds

DataKitchen

DECEMBER 7, 2021

This book is not available until January 2022, but considering all the hype around the data mesh, we expect it to be a best seller. In the book, author Zhamak Dehghani reveals that, despite the time, money, and effort poured into them, data warehouses and data lakes fail when applied at the scale and speed of today’s organizations.

Data-driven

Data-driven Data Governance Big Data Data Science

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. Outside of his work, Naidu practices yoga and goes trekking often.

Metadata

Metadata Data Lake Modeling Data Warehouse

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

These features allow efficient data corrections, gap-filling in time series, and historical data updates without disrupting ongoing analyses or compromising data integrity. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2023, we added support for column-level statistics for tables in the Data Catalog.

Data Lake

Data Lake Metadata Data Governance Statistics

Einstein Studio 1: What it is and what to expect

CIO Business Intelligence

JULY 31, 2024

With this platform, Salesforce seeks to help organizations apply the cleverness of LLMs to the customer data they have squirreled away in Salesforce data lakes in the hopes of selling more. Einstein 1 Studio handles the piping so the data from your Einstein 1 platform instance will flow smoothly into the AI.

Data Lake

Data Lake IT Sales Experimentation

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

In these instances, data feeds come largely from various advertising channels, and the reports they generate are designed to help marketers spend wisely. All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all. SAS Data Management. Of course, marketing also works.

Management

Management Advertising Data Lake Sales

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

DataKitchen

SEPTEMBER 18, 2024

According to DataKitchen’s 2024 market research, conducted with over three dozen data quality leaders, the complexity of data quality problems stems from the diverse nature of data sources, the increasing scale of data, and the fragmented nature of data systems.

Scorecard

Scorecard Data Quality Measurement Testing

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python.

Optimization

Optimization Statistics Metadata Data Lake

Top 8 predictive analytics tools compared

CIO Business Intelligence

MAY 12, 2022

The tools include sophisticated pipelines for gathering data from across the enterprise, add layers of statistical analysis and machine learning to make projections about the future, and distill these insights into useful summaries so that business users can act on them. On premises or in SAP cloud. Per user, per month. Free tier.

Predictive Analytics

Predictive Analytics Analytics Statistics Machine Learning

What is Data Pipeline? A Detailed Explanation

Smart Data Collective

OCTOBER 17, 2022

A point of data entry in a given pipeline. Examples of an origin include storage systems like data lakes, data warehouses and data sources that include IoT devices, transaction processing applications, APIs or social media. The final point to which the data has to be eventually transferred is a destination.

Data Warehouse

Data Warehouse Data Lake Visualization Big Data

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience.

Data Warehouse

Data Warehouse Analytics Data Lake Machine Learning

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures. Are data architects in demand?

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

AWS Big Data

AUGUST 9, 2024

These processes retrieve data from around 90 different data sources, resulting in updating roughly 2,000 tables in the data warehouse and 3,000 external tables in Parquet format, accessed through Amazon Redshift Spectrum and a data lake on Amazon Simple Storage Service (Amazon S3). We started with 115 dc2.large

Data Lake

Data Lake Analytics Data Warehouse Data-driven

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

Figure 2: Example data pipeline with DataOps automation. In this project, I automated data extraction from SFTP, the public websites, and the email attachments. The automated orchestration published the data to an AWS S3 Data Lake. Historic Balance – compares current data to previous or expected values.

Testing

Testing Metadata Dashboards Statistics

Top 15 data management platforms available today

CIO Business Intelligence

SEPTEMBER 22, 2023

What are the benefits of data management platforms? Modern, data-driven marketing teams must navigate a web of connected data sources and formats. All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all. Of course, marketing also works.

Management

Management Advertising Data Lake Sales

DataOps Observability: Taming the Chaos (Part 3)

DataKitchen

NOVEMBER 18, 2022

With this information in a shared context, your analyst working on a data lake will know if the 15 datasets she is viewing are accurate, the most recent, or of the same date range. And she’ll know when newer data will arrive. By setting process expectations, journeys can identify variances in each pipeline run.

Testing

Testing Statistics Measurement Metrics

Build a pseudonymization service on AWS to protect sensitive data: Part 2

AWS Big Data

MARCH 6, 2024

For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR. The following graph depicts the Invocations metric, with the statistic SUM in orange and RUNNING SUM in blue. AWS Glue, and Athena.

Metrics

Metrics Statistics Testing Data Lake

AWS Glue Data Quality is Generally Available

AWS Big Data

JUNE 6, 2023

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. It takes days for data engineers to identify and implement data quality rules.

Data Quality

Data Quality Statistics Data Lake Visualization

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. CDP Data Lake cluster versions – CM 7.4.0, Pre-Check: Data Lake Cluster. Understanding Ranger Policies in Data Lake Cluster. Runtime 7.2.8.

Data Lake

Data Lake Metadata Unstructured Data Management

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale. This is where the tagging feature in Apache Iceberg comes in handy.

Snapshot

Snapshot Data Lake Testing Strategy

Making the gen AI and data connection work

CIO Business Intelligence

AUGUST 9, 2024

Gartner agrees that synthetic data can help solve the data availability problem for AI products, as well as privacy, compliance, and anonymization challenges. Web scraping activity can be direct, carried out by the same subject who develops the model, or indirect, carried out from third-party data lakes.

Risk

Risk Measurement Data Lake Data Collection

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

AWS Big Data

NOVEMBER 20, 2023

Use case A typical workload for AWS Glue for Apache Spark jobs is to load data from a relational database to a data lake with SQL-based transformations. On the Graphed metrics tab, configure your preferred statistic, period, and so on. When the example job ran, the workerUtilization metrics showed the following trend.

Metrics

Metrics Data Lake Cost-Benefit Dashboards

Real estate CIOs drive deals with data

CIO Business Intelligence

JULY 26, 2023

But Cox and Djuric do know that 82% of Keller Williams’ agent have been active on the homegrown CRM application in the past 90 days and can deduce the high value of their data from that statistic alone.

Data Lake

Data Lake Digital Transformation Machine Learning Data Architecture

Quantitative and Qualitative Data: A Vital Combination

Sisense

OCTOBER 6, 2020

Let’s consider the differences between the two, and why they’re both important to the success of data-driven organizations. Digging into quantitative data. This is quantitative data. It’s “hard,” structured data that answers questions such as “how many?” Qualitative data benefits: Unlocking understanding.

Statistics

Statistics Unstructured Data Data-driven Visualization

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

AWS Big Data

NOVEMBER 8, 2023

Proposed Solution approach 2: Data Lake analytics The team used this approach with Redshift Spectrum to load only the required columns to Redshift Serverless, which avoided loading data into multiple yearly tables and directly to a single table. Create a data lake external schema and table in Redshift Serverless.

Data Lake

Data Lake Data Warehouse Cost-Benefit Optimization

Optimize your workloads with Amazon Redshift Serverless AI-driven scaling and optimization

AWS Big Data

AUGUST 21, 2024

Compute scales based on data volume. Use case 3 – A data lake query scanning large datasets (TBs). Compute scales based on the expected data to be scanned from the data lake. The expected data scan is predicted by machine learning (ML) models based on prior historical run statistics.

Optimization

Optimization Data Lake Data Warehouse Cost-Benefit

Four Topics That Should Be Top of Mind for SAP Partners

Timo Elliott

JUNE 19, 2023

All of the statistics from IDC and the others show that there’s a massive market for digital services. The next area is data. There’s a huge disruption around data. Increasingly now, we can bring the technology to the data rather than the other way around. The first is the new digital opportunities.

Data Lake

Data Lake Digital Transformation Recreation/Entertainment Technology

Data Champions: Balancing IT and Business Needs

Cloudera

SEPTEMBER 10, 2020

With that in mind, the agency uses open-source technology and high-performance hybrid cloud infrastructure to transform how it processes demographic and economic data with an Enterprise Data Lake (EDL).

IT

IT Business Objectives Digital Transformation Data-driven

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Sisense

DECEMBER 11, 2019

The easy set-up and access to embedded analytics enable them to measure KPIs, get game statistics, monetization and retention statistics that help them to optimize players’ experience, hone best practices and benchmarks, and maximize stickiness and profitability. Diving deeper into the datasphere: Data lakes — best practices.

Data Lake

Data Lake Big Data Sales Data-driven

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data.

OLAP

OLAP Data Lake Data-driven Online Analytical Processing

Unilever leverages ChatGPT to deliver business value

CIO Business Intelligence

MARCH 10, 2023

“The local team can be activated very quickly, ingest the data very quickly, and then create a statistical model and analytics model together with the business, sitting next to each other.

Forecasting

Forecasting Machine Learning Data Lake Digital Transformation

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Delta tables technical metadata is stored in the Data Catalog, which is a native source for creating assets in the Amazon DataZone business catalog. Access control is enforced using AWS Lake Formation , which manages fine-grained access control and data sharing on data lake data.

Data Governance

Data Governance Publishing Data-driven Metadata

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Drug Launch Case Study: Amazing Efficiency Using DataOps

Webinars

Trending Sources

Choosing an open table format for your transactional data lake on AWS

Webinars

Recap of Amazon Redshift key product announcements in 2024

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Enhance query performance using AWS Glue Data Catalog column-level statistics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

Top analytics announcements of AWS re:Invent 2024

2021 Gift Giving Guide for Data Nerds

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Build a high-performance quant research platform with Apache Iceberg

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Lake Formation 2023 year in review

Einstein Studio 1: What it is and what to expect

Top 15 data management platforms

Data Quality Power Moves: Scorecards & Data Checks for Organizational Impact

Speed up queries with the cost-based optimizer in Amazon Athena

Top 8 predictive analytics tools compared

What is Data Pipeline? A Detailed Explanation

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

What is a data architect? Skills, salaries, and how to become a data framework master

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

A Day in the Life of a DataOps Engineer

Top 15 data management platforms available today

DataOps Observability: Taming the Chaos (Part 3)

Build a pseudonymization service on AWS to protect sensitive data: Part 2

AWS Glue Data Quality is Generally Available

Migrate Hive data from CDH to CDP public cloud

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Making the gen AI and data connection work

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

Real estate CIOs drive deals with data

Quantitative and Qualitative Data: A Vital Combination

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

Optimize your workloads with Amazon Redshift Serverless AI-driven scaling and optimization

Four Topics That Should Be Top of Mind for SAP Partners

Data Champions: Balancing IT and Business Needs

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Unleashing the power of Presto: The Uber case study

Unilever leverages ChatGPT to deliver business value

HEMA accelerates their data governance journey with Amazon DataZone

Stay Connected