Data Lake, Reference and Testing

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

Initially, data warehouses were the go-to solution for structured data and analytical workloads but were limited by proprietary storage formats and their inability to handle unstructured data. Eventually, transactional data lakes emerged to add transactional consistency and performance of a data warehouse to the data lake.

Metadata

Metadata Data Lake Snapshot Data Warehouse

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. AWS Glue 3.0 The following diagram illustrates the solution architecture.

Data Lake

Data Lake Data Processing Metadata Snapshot

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

For many organizations, this centralized data store follows a data lake architecture. Although data lakes provide a centralized repository, making sense of this data and extracting valuable insights can be challenging. We recommend testing your use case and data with different models.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

We need robust versioning for data, models, code, and preferably even the internal state of applications—think Git on steroids to answer inevitable questions: What changed? The applications must be integrated to the surrounding business systems so ideas can be tested and validated in the real world in a controlled manner.

IT

IT Testing Experimentation Software

Implementing a Pharma Data Mesh using DataOps

DataKitchen

AUGUST 19, 2021

Figure 3 shows an example processing architecture with data flowing in from internal and external sources. Each data source is updated on its own schedule, for example, daily, weekly or monthly. The data scientists and analysts have what they need to build analytics for the user. The new Recipes run, and BOOM! Conclusion.

Data Warehouse

Data Warehouse Data Lake Manufacturing Testing

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

For detailed information on managing your Apache Hive metastore using Lake Formation permissions, refer to Query your Apache Hive metastore with AWS Lake Formation permissions. In this post, we present a methodology for deploying a data mesh consisting of multiple Hive data warehouses across EMR clusters.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

Iceberg offers distinct advantages through its metadata layer over Parquet, such as improved data management, performance optimization, and integration with various query engines. Unlike direct Amazon S3 access, Iceberg supports these operations on petabyte-scale data lakes without requiring complex custom code.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. For more information, refer to Retry Amazon S3 requests with EMRFS. availability.

Data Lake

Data Lake Snapshot Metadata Optimization

Navigating Data Entities, BYOD, and Data Lakes in Microsoft Dynamics

Jet Global

SEPTEMBER 4, 2020

Its solution was to replicate data from the production database, using data entities, into a traditional relational database. Microsoft referred to this approach as “bring your own database” (BYOD). There is an established body of practice around creating, managing, and accessing OLAP data (known as “cubes”). Data Lakes.

Data Lake

Data Lake OLAP Data Warehouse Unstructured Data

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Big Data

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Refer to Easy analytics and cost-optimization with Amazon Redshift Serverless to get started. Safety features Amazon Q generative SQL has built-in safety features to warn if a generated SQL statement will modify data and will only run based on user permissions. To test this, let’s ask Amazon Q to “delete data from web_sales table.”

Metadata

Metadata Sales Data Warehouse Optimization

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

Amazon Athena supports the MERGE command on Apache Iceberg tables, which allows you to perform inserts, updates, and deletes in your data lake at scale using familiar SQL statements that are compliant with ACID (Atomic, Consistent, Isolated, Durable).

Data Lake

Data Lake Snapshot Optimization Data Transformation

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

The DataFrame code generation now extends beyond AWS Glue DynamicFrame to support a broader range of data processing scenarios. Upon checking the S3 data target, we can see the S3 path is now a placeholder and the output format is Parquet. To learn more, refer to Amazon Q data integration in AWS Glue.

Data Integration

Data Integration Visualization Data Processing Data Lake

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

AWS Big Data

JULY 21, 2023

Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.

Data Lake

Data Lake Data Warehouse Marketing Management

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. A question arises on what level of details we need to include in the table metadata.

Metadata

Metadata Data Lake Modeling Data Warehouse

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Major market indexes, such as S&P 500, are subject to periodic inclusions and exclusions for reasons beyond the scope of this post (for an example, refer to CoStar Group, Invitation Homes Set to Join S&P 500; Others to Join S&P 100, S&P MidCap 400, and S&P SmallCap 600 ).

Snapshot

Snapshot Data Lake Testing Strategy

Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control

AWS Big Data

NOVEMBER 6, 2023

You can attach an EMR Studio Workspace to an EMR cluster, and use the compute power of the EMR cluster and run data science jobs on the cluster. Data is often stored in data lakes managed by AWS Lake Formation , enabling you to apply fine-grained access control through a simple grant or revoke mechanism.

Data Lake

Data Lake Sales Management Testing

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. This approach simplifies your data journey and helps you meet your security requirements. This new capability can simplify your data journey.

Visualization

Visualization Data Processing Testing Publishing

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

The data engineer then emails the BI Team, who refreshes a Tableau dashboard. Figure 1: Example data pipeline with manual processes. There are no automated tests , so errors frequently pass through the pipeline. Figure 2: Example data pipeline with DataOps automation. Adding Tests to Reduce Stress.

Testing

Testing Metadata Dashboards Statistics

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake. Data confidentiality and data quality are the two essential themes for data governance.

Data Quality

Data Quality Data Governance Data Lake Testing

Access Amazon Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

AWS Big Data

JUNE 25, 2024

This post is co-authored by Vijay Gopalakrishnan, Director of Product, Salesforce Data Cloud. In today’s data-driven business landscape, organizations collect a wealth of data across various touch points and unify it in a central data warehouse or a data lake to deliver business insights.

Data Lake

Data Lake Cost-Benefit Data-driven Data Warehouse

Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and Amazon S3 Access Grants

AWS Big Data

MAY 29, 2024

Refer to Configure SAML and SCIM with Okta and IAM Identity Center for instructions. You need to reference the bucket name and the certificate bundle.zip file in AWS CloudFormation. Refer to the following table for a list of important parameters. In this post, we use the us-east-1 Region. In this post, we grant access to Group1.

Data Lake

Data Lake Enterprise Management Business Intelligence

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

AWS Big Data

OCTOBER 9, 2024

Today, customers are embarking on data modernization programs by migrating on-premises data warehouses and data lakes to the AWS Cloud to take advantage of the scale and advanced analytical capabilities of the cloud. Compare ongoing data that is replicated from the source on-premises database to the target S3 data lake.

Data Quality

Data Quality Data Lake Data Warehouse Metrics

Why the Data Journey Manifesto?

DataKitchen

JUNE 12, 2023

I spent much time de-categorizing DataOps: we are not discussing ETL, Data Lake, or Data Science. Today we have had over 20,000 signatures , millions of page views, and copycat clones, and it is frequently used as a reference guide. It’s Customer Journey for data analytic systems.

Testing

Testing Dashboards Data Lake Data Science

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

AWS Big Data

AUGUST 5, 2024

These business units have varying landscapes, where a data lake is managed by Amazon Simple Storage Service (Amazon S3) and analytics workloads are run on Amazon Redshift , a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data.

Data Lake

Data Lake Finance Sales Management

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

AWS Big Data

FEBRUARY 16, 2024

Many customers are extending their data warehouse capabilities to their data lake with Amazon Redshift. They are looking to further enhance their security posture where they can enforce access policies on their data lakes based on Amazon Simple Storage Service (Amazon S3). Choose Create endpoint.

Data Lake

Data Lake Data Warehouse Testing Business Objectives

Integrate custom applications with AWS Lake Formation – Part 1

AWS Big Data

NOVEMBER 19, 2024

Due to these limitations, the application should not be used for arbitrary tests. In this post, we provide instructions on how to deploy a sample API application integrated with Lake Formation that implements the solution architecture. We also show how to test the function with Lambda tests.

Data Lake

Data Lake Metadata Testing Data Processing

Accelerate data integration with Salesforce and AWS using AWS Glue

AWS Big Data

SEPTEMBER 4, 2024

This solution also allows you to update certain fields of the account object in the data lake and push it back to Salesforce. To achieve this, you create two ETL jobs using AWS Glue with the Salesforce connector, and create a transactional data lake on Amazon S3 using Apache Iceberg.

Data Integration

Data Integration Data Lake Data-driven Cost-Benefit

Automate schema evolution at scale with Apache Hudi in AWS Glue

AWS Big Data

FEBRUARY 7, 2023

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. Apache Hudi supports ACID transactions and CRUD operations on a data lake. You don’t alter queries separately in the data lake.

Data Lake

Data Lake Testing Big Data Structured Data

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

AWS Big Data

OCTOBER 20, 2023

Today, we are pleased to announce new AWS Glue connectors for Azure Blob Storage and Azure Data Lake Storage that allow you to move data bi-directionally between Azure Blob Storage, Azure Data Lake Storage, and Amazon Simple Storage Service (Amazon S3). option("header","true").load("wasbs://yourblob@youraccountname.blob.core.windows.net/loadingtest-input/100mb")

Data Lake

Data Lake Big Data Data Warehouse Consulting

What is Data Pipeline? A Detailed Explanation

Smart Data Collective

OCTOBER 17, 2022

A point of data entry in a given pipeline. Examples of an origin include storage systems like data lakes, data warehouses and data sources that include IoT devices, transaction processing applications, APIs or social media. The final point to which the data has to be eventually transferred is a destination.

Data Warehouse

Data Warehouse Data Lake Visualization Big Data

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

Access Amazon Athena in your applications using the WebSocket API

AWS Big Data

MARCH 2, 2023

Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products.

Data Lake

Data Lake Testing Interactive Unstructured Data

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

For instructions, refer to Creating and managing Amazon OpenSearch Service domains. Choose Send test message and test to make sure you receive an alert on the PagerDuty service. This notification can be safely acknowledged and resolved from PagerDuty because this is was a test.

Data Lake

Data Lake Dashboards Metrics Testing

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime. During feature development, data engineers require a seamless interface to the EDW. Previous solution process In the previous solution, product team data engineers spent 30 minutes per run to manually expose Redshift data to Spark.

Data Processing

Data Processing Data Lake Data Warehouse Optimization

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Use Lake Formation to grant permissions to users to access data. Test the solution by accessing data with a corporate identity. Audit user data access. For a complete guide on creating and providing a certificate, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption.

Analytics

Analytics Data Lake Management Enterprise

Run Apache XTable in AWS Lambda for background conversion of open table formats

Recap of Amazon Redshift key product announcements in 2024

Webinars

Trending Sources

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Enrich your serverless data lake with Amazon Bedrock

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Choosing an open table format for your transactional data lake on AWS

MLOps and DevOps: Why Data Makes It Different

Implementing a Pharma Data Mesh using DataOps

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Build a real-time GDPR-aligned Apache Iceberg data lake

Build a high-performance quant research platform with Apache Iceberg

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Navigating Data Entities, BYOD, and Data Lakes in Microsoft Dynamics

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

A Day in the Life of a DataOps Engineer

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Access Amazon Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and Amazon S3 Access Grants

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

Why the Data Journey Manifesto?

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

Integrate custom applications with AWS Lake Formation – Part 1

Accelerate data integration with Salesforce and AWS using AWS Glue

Automate schema evolution at scale with Apache Hudi in AWS Glue

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

What is Data Pipeline? A Detailed Explanation

What is a data architect? Skills, salaries, and how to become a data framework master

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Access Amazon Athena in your applications using the WebSocket API

Implement alerts in Amazon OpenSearch Service with PagerDuty

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Stay Connected