Data Lake, Metadata and Statistics

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. Table metadata is fetched from AWS Glue. The generated Athena SQL query is run.

Metadata

Metadata Data Lake Modeling Data Warehouse

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings.

Statistics

Statistics Data Lake Optimization Data-driven

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Top analytics announcements of AWS re:Invent 2024

AWS Big Data

FEBRUARY 26, 2025

Analytics remained one of the key focus areas this year, with significant updates and innovations aimed at helping businesses harness their data more efficiently and accelerate insights. From enhancing data lakes to empowering AI-driven analytics, AWS unveiled new tools and services that are set to shape the future of data and analytics.

Analytics

Analytics Data Lake Metadata Data Warehouse

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2022 , we talked about the enhancements we had done to these services. Bien intégré!

Data Lake

Data Lake Metadata Data Governance Statistics

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

Figure 2: Example data pipeline with DataOps automation. In this project, I automated data extraction from SFTP, the public websites, and the email attachments. The automated orchestration published the data to an AWS S3 Data Lake. Historic Balance – compares current data to previous or expected values.

Testing

Testing Metadata Dashboards Statistics

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

In these instances, data feeds come largely from various advertising channels, and the reports they generate are designed to help marketers spend wisely. All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all. Of course, marketing also works.

Management

Management Advertising Data Lake Sales

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python.

Optimization

Optimization Statistics Metadata Data Lake

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures. Are data architects in demand?

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience.

Data Warehouse

Data Warehouse Analytics Data Lake Machine Learning

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. CDP Data Lake cluster versions – CM 7.4.0, Pre-Check: Data Lake Cluster. Understanding Ranger Policies in Data Lake Cluster. Runtime 7.2.8.

Data Lake

Data Lake Metadata Unstructured Data Management

Top 15 data management platforms available today

CIO Business Intelligence

SEPTEMBER 22, 2023

What are the benefits of data management platforms? Modern, data-driven marketing teams must navigate a web of connected data sources and formats. All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all. Of course, marketing also works.

Management

Management Advertising Data Lake Sales

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

The business end-users were given a tool to discover data assets produced within the mesh and seamlessly self-serve on their data sharing needs. The integration of Databricks Delta tables into Amazon DataZone is done using the AWS Glue Data Catalog. The following figure illustrates the data mesh architecture.

Data Governance

Data Governance Publishing Data-driven Metadata

Four Topics That Should Be Top of Mind for SAP Partners

Timo Elliott

JUNE 19, 2023

All of the statistics from IDC and the others show that there’s a massive market for digital services. The next area is data. There’s a huge disruption around data. Increasingly now, we can bring the technology to the data rather than the other way around. The first is the new digital opportunities.

Data Lake

Data Lake Digital Transformation Recreation/Entertainment Technology

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale. This is where the tagging feature in Apache Iceberg comes in handy.

Snapshot

Snapshot Data Lake Testing Strategy

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

The snapshotId of the source tables involved in the materialized view are also maintained in the metadata. Incremental and full rebuild of materialized view We will insert rows into the base table and examine how the materialized view can be updated to reflect the new data. Furthermore, it is partitioned on the d_year column.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Mark: The first element in the process is the link between the source data and the entry point into the data platform. At Ramsey International (RI), we refer to that layer in the architecture as the foundation, but others call it a staging area, raw zone, or even a source data lake.

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

The Data Scientist’s Guide to the Data Catalog

Alation

JULY 19, 2022

A data catalog can assist directly with every step, but model development. And even then, information from the data catalog can be transferred to a model connector , allowing data scientists to benefit from curated metadata within those platforms. How Data Catalogs Help Data Scientists Ask Better Questions.

Metadata

Metadata Data Quality Statistics Data Science

Data Profiling: What It Is and How to Perfect It

Alation

APRIL 18, 2023

On the contrary, data profiling today describes an automated process, where a data user can “point and click” to return key results on a given asset, like aggregate functions, top patterns, outliers, inferred data types, and more. This can include rules for data completeness, consistency, accuracy, and validity.

IT

IT Data Quality Metadata Data Governance

Why We Started the Data Intelligence Project

Alation

JULY 7, 2022

In 2013 I joined American Family Insurance as a metadata analyst. I had always been fascinated by how people find, organize, and access information, so a metadata management role after school was a natural choice. The use cases for metadata are boundless, offering opportunities for innovation in every sector.

Metadata

Metadata Data-driven Insurance Statistics

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

erwin

JANUARY 11, 2019

Organizations have spent a lot of time and money trying to harmonize data across diverse platforms , including cleansing, uploading metadata, converting code, defining business glossaries, tracking data transformations and so on. If you want more control over and more value from all your data, join us for a demo of erwin MM.

Data Governance

Data Governance Risk Metadata Management

Themes and Conferences per Pacoid, Episode 12

Domino Data Lab

AUGUST 8, 2019

I mention this here because there was a lot of overlap between current industry data governance needs and what the scientific community is working toward for scholarly infrastructure. The gist is, leveraging metadata about research datasets, projects, publications, etc., It’s a quick way to clear the room. machine learning?

Data Science

Data Science Machine Learning Data Governance Statistics

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

Does Data warehouse as a software tool will play role in future of Data & Analytics strategy? You cannot get away from a formalized delivery capability focused on regular, scheduled, structured and reasonably governed data. Data lakes don’t offer this nor should they. E.g. Data Lakes in Azure – as SaaS.

Data Analytics

Data Analytics Analytics Data-driven Finance

Improving Data Processing with Spark 3.0 & Delta Lake

Smart Data Collective

AUGUST 5, 2021

What is Delta Lake? Developed at Databricks, “Delta Lake is an open-source data storage layer that runs on the existing Data Lake and is fully cooperative with Apache Spark APIs. Delta Lake uses versioned Parquet files to store data in the cloud. Advantages of using Delta Lakes.

Data Processing

Data Processing Metadata Broadcasting Statistics

Convergent Evolution

Peter James Thomas

AUGUST 18, 2018

That was the Science, here comes the Technology… A Brief Hydrology of Data Lakes. Even back then, these were used for activities such as Analytics , Dashboards , Statistical Modelling , Data Mining and Advanced Visualisation. This required additional investments in metadata.

Data Lake

Data Lake Data Warehouse Data mining Statistics

Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation

AWS Big Data

OCTOBER 25, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Statistics

Statistics Data Warehouse Metadata Data Lake

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Cloudera

OCTOBER 10, 2024

It combines the flexibility and scalability of data lake storage with the data analytics, data governance, and data management functionality of the data warehouse. Table Cleanup: As tables grow, they often accumulate unused data files, manifest files, and snapshots that aren’t needed anymore.

Optimization

Optimization Snapshot Data Lake Cost-Benefit

Data Leaders Brief

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Webinars

Trending Sources

Build a high-performance quant research platform with Apache Iceberg

Webinars

Recap of Amazon Redshift key product announcements in 2024

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Choosing an open table format for your transactional data lake on AWS

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Enhance query performance using AWS Glue Data Catalog column-level statistics

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Top analytics announcements of AWS re:Invent 2024

AWS Lake Formation 2023 year in review

A Day in the Life of a DataOps Engineer

Top 15 data management platforms

Speed up queries with the cost-based optimizer in Amazon Athena

What is a data architect? Skills, salaries, and how to become a data framework master

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Migrate Hive data from CDH to CDP public cloud

Top 15 data management platforms available today

HEMA accelerates their data governance journey with Amazon DataZone

Four Topics That Should Be Top of Mind for SAP Partners

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Materialized Views in Hive for Iceberg Table Format

Demystifying Modern Data Platforms

The Data Scientist’s Guide to the Data Catalog

Data Profiling: What It Is and How to Perfect It

Why We Started the Data Intelligence Project

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

Themes and Conferences per Pacoid, Episode 12

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Improving Data Processing with Spark 3.0 & Delta Lake

Convergent Evolution

Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation

Cloudera Lakehouse Optimizer Makes it Easier Than Ever to Deliver High-Performance Iceberg Tables

Stay Connected