Data Lake, Document and Reference

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. Refer to Using Amazon Athena Federated Query for further details.

Data Lake

Data Lake Analytics Cost-Benefit Management

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

Organizations are collecting and storing vast amounts of structured and unstructured data like reports, whitepapers, and research documents. By consolidating this information, analysts can discover and integrate data from across the organization, creating valuable data products based on a unified dataset.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

Under the hood, UniForm generates Iceberg metadata files (including metadata and manifest files) that are required for Iceberg clients to access the underlying data files in Delta Lake tables. Both Delta Lake and Iceberg metadata files reference the same data files. in Delta Lake public document.

Metadata

Metadata Data Warehouse Big Data Data Lake

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

AWS Big Data

NOVEMBER 11, 2024

For log workloads, restore only recent or relevant logs to save time and use this opportunity to purge unnecessary documents or indexes. Create indexes and populate them with documents. We refer to this role as TheSnapshotRole in this post. For the request structure, see Take snapshots in the OpenSearch documentation.

Snapshot

Snapshot Strategy Dashboards Data Lake

Amazon SageMaker Lakehouse now supports attribute-based access control

AWS Big Data

APRIL 24, 2025

These tags are assigned to IAM users or roles and can be used to define or restrict access to specific resources or data. For more details, refer to Tags for AWS Identity and Access Management resources and Pass session tags in AWS STS. Set up a data lake admin. For instructions, see Create a data lake administrator.

Sales

Sales Data Lake Management Data-driven

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on.

Metadata

Metadata Data Lake Modeling Data Warehouse

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

AWS Big Data

DECEMBER 12, 2024

One-time and complex queries are two common scenarios in enterprise data analytics. Complex queries, on the other hand, refer to large-scale data processing and in-depth analysis based on petabyte-level data warehouses in massive data scenarios. file, enter the preprocessing code for the raw lineage data.

Snapshot

Snapshot Recreation/Entertainment Experimentation Data Lake

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

AWS Big Data

OCTOBER 30, 2024

With this integration, you can now seamlessly query your governed data lake assets in Amazon DataZone using popular business intelligence (BI) and analytics tools, including partner solutions like Tableau. Refer to the detailed blog post on how you can use this to connect through various other tools.

Analytics

Analytics Visualization Data Governance Data-driven

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

Tens of thousands of customers use Amazon Redshift every day to run analytics, processing exabytes of data for business insights. times better price performance than other cloud data warehouses. For more details, refer to the BladeBridge Analyzer Demo. Amazon Redshift is built for scale and delivers up to 7.9

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

AWS Big Data

OCTOBER 30, 2024

Automate ingestion from a single data source With a auto-copy job, you can automate ingestion from a single data source by creating one job and specifying the path to the S3 objects that contain the data. The S3 object path can reference a set of folders that have the same key prefix.

Data Warehouse

Data Warehouse Sales Data Lake Recreation/Entertainment

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

AWS Big Data

AUGUST 3, 2023

Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. Apache Hudi connector for AWS Glue For this post, we use AWS Glue 4.0,

Data Lake

Data Lake Visualization Dashboards Insurance

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot

Snapshot Data Lake Metadata Optimization

Compose your ETL jobs for MongoDB Atlas with AWS Glue

AWS Big Data

MAY 3, 2023

In today’s data-driven business environment, organizations face the challenge of efficiently preparing and transforming large amounts of data for analytics and data science purposes. Businesses need to build data warehouses and data lakes based on operational data.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

In modern data architectures, Apache Iceberg has emerged as a popular table format for data lakes, offering key features including ACID transactions and concurrent write support. For more detailed configuration, refer to Write properties in the Iceberg documentation.

Snapshot

Snapshot Management Metadata Big Data

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

AWS Big Data

JANUARY 6, 2025

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Analytics

Analytics Data Warehouse Big Data Metrics

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

It enables data engineers, data scientists, and analytics engineers to define the business logic with SQL select statements and eliminates the need to write boilerplate data manipulation language (DML) and data definition language (DDL) expressions. Data engineers define dbt models for their data representations.

Data Lake

Data Lake Management Metrics Data Warehouse

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

To learn more about RAG, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart. A RAG-based generative AI application can only produce generic responses based on its training data and the relevant documents in the knowledge base.

Data Lake

Data Lake Unstructured Data Management Snapshot

Hybrid Vs. Multi-Cloud: 5 Key Comparisons in Kafka Architectures

Smart Data Collective

AUGUST 17, 2022

You can safely use an Apache Kafka cluster for seamless data movement from the on-premise hardware solution to the data lake using various cloud services like Amazon’s S3 and others. It will enable you to quickly transform and load the data results into Amazon S3 data lakes or JDBC data stores.

Data Lake

Data Lake Insurance Data-driven Data Processing

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

Data governance is a critical building block across all these approaches, and we see two emerging areas of focus. First, many LLM use cases rely on enterprise knowledge that needs to be drawn from unstructured data such as documents, transcripts, and images, in addition to structured data from data warehouses.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. For example, we are using a data lake administrator role called LF-Admin.

Data Lake

Data Lake Metadata Management Data Processing

Data Cataloging in the Data Lake: Alation + Kylo

Alation

FEBRUARY 20, 2020

When it was no longer a hard requirement that a physical data model be created upon the ingestion of data, there was a resulting drop in richness of the description and consistency of the data stored in Hadoop. You did not have to understand or prepare the data to get it into Hadoop, so people rarely did.

Data Lake

Data Lake Metadata Structured Data Big Data

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

AWS Big Data

OCTOBER 20, 2023

Today, we are pleased to announce new AWS Glue connectors for Azure Blob Storage and Azure Data Lake Storage that allow you to move data bi-directionally between Azure Blob Storage, Azure Data Lake Storage, and Amazon Simple Storage Service (Amazon S3). option("header","true").load("wasbs://yourblob@youraccountname.blob.core.windows.net/loadingtest-input/100mb")

Data Lake

Data Lake Big Data Data Warehouse Consulting

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 1

AWS Big Data

AUGUST 27, 2024

Data can be shared with a Redshift Serverless or provisioned cluster in the same Region or with a Redshift Serverless cluster in a different Region. To get an overview of Salesforce Zero Copy integration with Amazon Redshift, please refer to this Salesforce Blog. For more details, refer to Querying the AWS Glue Data Catalog.

Data Lake

Data Lake Analytics Data-driven Management

Data Modeling 301 for the cloud: data lake and NoSQL data modeling and design

erwin

AUGUST 15, 2022

For NoSQL, data lakes, and data lake houses—data modeling of both structured and unstructured data is somewhat novel and thorny. This blog is an introduction to some advanced NoSQL and data lake database design techniques (while avoiding common pitfalls) is noteworthy. Data Modeling.

Data Lake

Data Lake Modeling Unstructured Data Data Warehouse

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

You can use the plugin to set up different monitors, including cluster health, an individual document, a custom query, or aggregated data. For instructions, refer to Creating and managing Amazon OpenSearch Service domains. About the Authors Manikanta Gona is a Data and ML Engineer at AWS Professional Services.

Data Lake

Data Lake Dashboards Metrics Testing

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

Flexible and easy to use – The solutions should provide less restrictive, easy-to-access, and ready-to-use data. A data hub contains data at multiple levels of granularity and is often not integrated. It differs from a data lake by offering data that is pre-validated and standardized, allowing for simpler consumption by users.

Analytics

Analytics Data Warehouse Data Lake Metadata

Introducing Amazon Q data integration in AWS Glue

AWS Big Data

APRIL 30, 2024

Amazon Q Developer can now generate complex data integration jobs with multiple sources, destinations, and data transformations. and Amazon Q provides concise explanations along with references you can use to follow up on your questions and validate the guidance.

Data Integration

Data Integration Data Lake Data Warehouse Software

An Introduction to Disaster Recovery with the Cloudera Data Platform

Cloudera

AUGUST 9, 2022

The CDP Disaster Recovery Reference Architecture. Today we announce the official release of the CDP Disaster Recovery Reference Architecture (DRRA). The CDP Disaster Recovery Reference Architecture is available in our public documentation within the CDP Reference Architectures microsite. Cloudera Data Platform.

Data Lake

Data Lake Data Warehouse Data-driven IoT

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

Text, images, audio, and videos are common examples of unstructured data. Most companies produce and consume unstructured data such as documents, emails, web pages, engagement center phone calls, and social media. The steps of the workflow are as follows: Integrated AI services extract data from the unstructured data.

Unstructured Data

Unstructured Data Metadata Management Analytics

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

AWS Big Data

OCTOBER 2, 2023

For a deeper exploration on configuring and using streaming ingestion in Amazon Redshift , refer to Real-time analytics with Amazon Redshift streaming ingestion. For streams that contain the raw binary data encoded in JSON format, Amazon Redshift provides a variety of tools for parsing and managing the data.

Cost-Benefit

Cost-Benefit Metadata Structured Data Data-driven

Constructing A Digital Transformation Strategy: Putting the Data in Digital Transformation

erwin

JULY 17, 2019

It improves IT and business data literacy and knowledge, supporting enterprise data governance and business enablement. It helps solve the lack of visibility and control over “data at rest” in databases, data lakes and data warehouses and “data in motion” as it is integrated with and used by key applications.

Digital Transformation

Digital Transformation Strategy Metadata Data-driven

Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents

AWS Big Data

MAY 28, 2024

LLMs could automate the extraction and summarization of key information from these documents, enabling analysts to query the LLM and receive reliable summaries. This would allow analysts to process the documents to develop investment recommendations faster and more efficiently. If yes, run query to extract information.

Unstructured Data

Unstructured Data Data Warehouse Structured Data Testing

Why enterprise CIOs need to plan for Microsoft gen AI

CIO Business Intelligence

AUGUST 14, 2024

Organizations with experience building enterprise data lakes connecting to many different data sources have AI advantages. They’re more able to connect to a diverse set of data sources, build more complex context around queries going through the model, and do retrieval augmented generation,” he adds.

Enterprise

Enterprise Cost-Benefit Experimentation Modeling

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

Refer to Enabling AWS PrivateLink in the Snowflake documentation to verify the steps, required access level, and service level to set the configurations. For Data sources , search for and select Snowflake. To obtain the Snowflake PrivateLink account URL, refer to parameters obtained in the prerequisites. Choose Next.

Analytics

Analytics Data-driven Data Integration Data Lake

How Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on Amazon EMR

AWS Big Data

MAY 16, 2023

This issue becomes increasingly challenging for larger organizations that operate in multiple countries and states, because each country and state may have their own rules and regulations regarding data retention and deletion. Data ingestion workflow The following diagram illustrates the data ingestion workflow.

Data Lake

Data Lake Cost-Benefit Optimization Testing

Differentiate generative AI applications with your data using AWS analytics and managed databases

AWS Big Data

SEPTEMBER 12, 2024

The document and key value data models allow you the flexibility to adjust the schema of the conversation state over time. The application gets prompt templates from an S3 data lake and creates the engineered prompt. The user interaction is stored in a data lake for downstream usage and BI analysis.

Management

Management Analytics Data Lake Interactive

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

AWS Big Data

APRIL 3, 2023

Tens of thousands of customers run business-critical workloads on Amazon Redshift , AWS’s fast, petabyte-scale cloud data warehouse delivering the best price-performance. With Amazon Redshift, you can query data across your data warehouse, operational data stores, and data lake using standard SQL.

Data Warehouse

Data Warehouse Testing Data Lake Data-driven

Extract data from SAP ERP using AWS Glue and the SAP SDK

AWS Big Data

FEBRUARY 8, 2023

For more information, refer to Download and Installation of NW RFC SDK. For instructions, refer to Configuration basics. Clone the PyRFC module from GitHub For instructions for creating and connecting to an Amazon Linux 2 AMI EC2 instance, refer to Tutorial: Get started with Amazon EC2 Linux instances.

Testing

Testing Data Integration Data Lake Enterprise

Patterns for enterprise data sharing at scale

AWS Big Data

FEBRUARY 27, 2023

Confidential or restricted data access might involve aspects of identity and authorization management. Examples of confidential data include Social Security numbers and vehicle identification numbers. The data lake administrator of the recipient account can view the direct cross-account grants, but can’t revoke permissions.

Enterprise

Enterprise Publishing Data Lake Management

Multicloud data lake analytics with Amazon Athena

Enrich your serverless data lake with Amazon Bedrock

Webinars

Trending Sources

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Webinars

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Choosing an open table format for your transactional data lake on AWS

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Amazon SageMaker Lakehouse now supports attribute-based access control

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

Build a real-time GDPR-aligned Apache Iceberg data lake

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Compose your ETL jobs for MongoDB Atlas with AWS Glue

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Exploring real-time streaming for generative AI Applications

Hybrid Vs. Multi-Cloud: 5 Key Comparisons in Kafka Architectures

Data governance in the age of generative AI

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Data Cataloging in the Data Lake: Alation + Kylo

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 1

Data Modeling 301 for the cloud: data lake and NoSQL data modeling and design

Implement alerts in Amazon OpenSearch Service with PagerDuty

What is a data architect? Skills, salaries, and how to become a data framework master

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Introducing Amazon Q data integration in AWS Glue

An Introduction to Disaster Recovery with the Cloudera Data Platform

Unstructured data management and governance using AWS AI/ML and analytics services

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

Constructing A Digital Transformation Strategy: Putting the Data in Digital Transformation

Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents

Why enterprise CIOs need to plan for Microsoft gen AI

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

How Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on Amazon EMR

Differentiate generative AI applications with your data using AWS analytics and managed databases

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

Extract data from SAP ERP using AWS Glue and the SAP SDK

Patterns for enterprise data sharing at scale

Stay Connected