Data Lake, Data Processing and Reference

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

For many organizations, this centralized data store follows a data lake architecture. Although data lakes provide a centralized repository, making sense of this data and extracting valuable insights can be challenging. About the Authors Dave Horne is a Sr.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

For detailed information on managing your Apache Hive metastore using Lake Formation permissions, refer to Query your Apache Hive metastore with AWS Lake Formation permissions. In this post, we present a methodology for deploying a data mesh consisting of multiple Hive data warehouses across EMR clusters.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

The DataFrame code generation now extends beyond AWS Glue DynamicFrame to support a broader range of data processing scenarios. Next, the merged data is filtered to include only a specific geographic region. Then the transformed output data is saved to Amazon S3 for further processing in future.

Data Integration

Data Integration Visualization Data Processing Data Lake

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

AWS Big Data

NOVEMBER 11, 2024

The workflow consists of the following initial steps: OpenSearch Service is hosted in the primary Region, and all the active traffic is routed to the OpenSearch Service domain in the primary Region. We refer to this role as TheSnapshotRole in this post. For instructions, refer to the earlier section in this post.

Snapshot

Snapshot Strategy Dashboards Data Lake

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

On your project, in the navigation pane, choose Data. For Add data source , choose Add connection. For Host , enter your host name of your Aurora PostgreSQL database cluster. format(connection_properties["HOST"],connection_properties["PORT"],connection_properties["DATABASE"]) df.write.format("jdbc").option("url",

Visualization

Visualization Data Processing Testing Publishing

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

AWS Big Data

AUGUST 3, 2023

Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. Apache Hudi connector for AWS Glue For this post, we use AWS Glue 4.0,

Data Lake

Data Lake Visualization Dashboards Insurance

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. For example, we are using a data lake administrator role called LF-Admin.

Data Lake

Data Lake Metadata Management Data Processing

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Therefore, organizations have come to host huge volumes of metadata of their structured datasets in the Hive metastore.

Data Lake

Data Lake Metadata Data Processing Big Data

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

Access Amazon Athena in your applications using the WebSocket API

AWS Big Data

MARCH 2, 2023

Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products.

Data Lake

Data Lake Testing Interactive Unstructured Data

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

This involves creating VPC endpoints in both the AWS and Snowflake VPCs, making sure data transfer remains within the AWS network. Use Amazon Route 53 to create a private hosted zone that resolves the Snowflake endpoint within your VPC. For Data sources , search for and select Snowflake. Choose Create connection. Choose Next.

Analytics

Analytics Data-driven Data Integration Data Lake

AWS Glue crawlers support cross-account crawling to support data mesh architecture

AWS Big Data

MARCH 27, 2023

Data lakes have come a long way, and there’s been tremendous innovation in this space. Today’s modern data lakes are cloud native, work with multiple data types, and make this data easily available to diverse stakeholders across the business. In the navigation pane, under Data catalog , choose Settings.

Data Lake

Data Lake Data-driven Management Data Architecture

Run Spark SQL on Amazon Athena Spark

AWS Big Data

OCTOBER 23, 2023

Modern applications store massive amounts of data on Amazon Simple Storage Service (Amazon S3) data lakes, providing cost-effective and highly durable storage, and allowing you to run analytics and machine learning (ML) from your data lake to generate insights on your data.

Data Lake

Data Lake Visualization Optimization Interactive

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. For more details, refer to Creating Apache Iceberg tables. The customer wants to make product data accessible to analyst personas for interactive analysis using Athena.

Data Lake

Data Lake Metadata Snapshot Management

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines. Data quality at rest focuses on validating the data stored in data lakes, databases, or data warehouses. It ensures that the data meets specific quality standards before it is consumed.

Data Quality

Data Quality Data Lake Visualization Data-driven

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

For instructions, refer to Creating and managing Amazon OpenSearch Service domains. For Host , enter events.PagerDuty.com. For more information about anomaly detection and alerts in OpenSearch Service, refer to Anomaly Detection in Amazon OpenSearch and Configuring Alerts in Amazon OpenSearch. For Method , choose POST.

Data Lake

Data Lake Dashboards Metrics Testing

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. If Lake Formation is not already enabled, refer to Getting started with Lake Formation.

Analytics

Analytics Data Lake Management Enterprise

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

AWS Big Data

MAY 30, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Structured Data

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

For additional details, refer to Automated snapshots. For additional details, refer to Manual snapshots. Amazon Redshift integrates with AWS Backup to help you centralize and automate data protection across all your AWS services, in the cloud, and on premises. This can result in recovery times between 10–60 minutes.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Attribute Amazon EMR on EC2 costs to your end-users

AWS Big Data

AUGUST 27, 2024

Refer to How do I set up a NAT gateway for a private subnet in Amazon VPC? For more information, refer to Prerequisites. For more information, refer to Storing database credentials in AWS Secrets Manager. For instructions to set up AWS Cloud9, refer to Getting started: basic tutorials for AWS Cloud9. manylinux2014_x86_64.whl

Metrics

Metrics Dashboards Data Lake Optimization

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

IBM Big Data Hub

MAY 9, 2023

Over the past decade, deep learning arose from a seismic collision of data availability and sheer compute power, enabling a host of impressive AI capabilities. Slate refers to a family of encoder-only (RoBERTa-based) models, which while not generative, are fast and effective for many enterprise NLP tasks. All watsonx.ai

Enterprise

Enterprise Technology Modeling Cost-Benefit

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. with Apache Spark version 3.3.0)

Data Lake

Data Lake Dashboards Metrics Metadata

Building a scalable streaming data platform that enables real-time and batch analytics of electric vehicles on AWS

AWS Big Data

JULY 17, 2024

In scenarios where a pipeline ingests data from multiple topics, the topic with the highest number of partitions should be used as a reference to configure the maximum OCUs. sink: - opensearch: # Provide an AWS OpenSearch Service domain endpoint hosts: [ "[link]. > region: " >" msk: # Provide the MSK ARN.

Analytics

Analytics IoT Dashboards Data Lake

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

AWS Big Data

FEBRUARY 16, 2023

For more details, refer to Amazon QuickSight resource type reference. In this post, we show how to automate the deployment of a QuickSight analysis connecting to an Amazon Redshift data warehouse with a CloudFormation template. For instructions, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters.

Data Warehouse

Data Warehouse Sales Visualization Data Processing

Amazon Redshift data ingestion options

AWS Big Data

SEPTEMBER 5, 2024

Amazon Redshift , a warehousing service, offers a variety of options for ingesting data from diverse sources into its high-performance, scalable environment. This native feature of Amazon Redshift uses massive parallel processing (MPP) to load objects directly from data sources into Redshift tables. example.com:9092,broker-2.example.com:9092'

IoT

IoT Data Warehouse Cost-Benefit Reporting

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. This solution uses Amazon Aurora MySQL hosting the example database salesdb.

Data Warehouse

Data Warehouse Snapshot Data Processing Internet of Things

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

AWS Big Data

JUNE 6, 2023

For instructions, refer to Installing or updating the latest version of the AWS CLI. For instructions, refer to Configuration and credential file settings. Also, note that we are using data-team-a as a namespace and emr-job-execution-sa as a service account, which we created in the previous step.

Optimization

Optimization Data Lake Cost-Benefit Management

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. Over time, workloads start processing more data, tenants start onboarding more workloads, and administrators (admins) start onboarding more tenants. Cloudera Manager (CM) 6.2

Metadata

Metadata Data Lake Optimization Strategy

Build a pseudonymization service on AWS to protect sensitive data: Part 2

AWS Big Data

MARCH 6, 2024

The account on the right hosts the pseudonymization service, which you can deploy using the instructions provided in the Part 1 of this series. For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR.

Metrics

Metrics Statistics Testing Data Lake

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

It is prudent to consolidate this data into a single customer view, serving as a primary reference for downstream applications, ranging from ecommerce platforms to CRM systems. This consolidated view acts as a liaison between the data platform and customer-centric applications.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

AWS Big Data

FEBRUARY 2, 2023

Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data. This could be your data lake or application S3 bucket.

Reporting

Reporting Data Lake Management Optimization

The Strategy Behind our Denodo Partner Program

Data Virtualization

SEPTEMBER 23, 2020

I’m referring not only to our technology partners, but also to our cloud partners that host the Denodo Platform, Denodo is a very partner-friendly company, and here I’d like to share some thoughts about how Denodo works with our partners.

Strategy

Strategy Data Processing Technology Digital Transformation

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

Depending on your enterprise’s culture and goals, your migration pattern of a legacy multi-tenant data platform to Amazon Redshift could use one of the following strategies: Leapfrog strategy – In this strategy, you move to an AWS modern data architecture and migrate one tenant at a time. Vijay Bagur is a Sr. Technical Account Manager.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Data as a product Treating data as a product entails three key components: the data itself, the metadata, and the associated code and infrastructure. In this approach, teams responsible for generating data are referred to as producers.

Data-driven

Data-driven Advertising Metadata Data Architecture

10 Keys to a Secure Cloud Data Lakehouse

Cloudera

OCTOBER 25, 2022

The data lakehouse is gaining in popularity because it enables a single platform for all your enterprise data with the flexibility to run any analytic and machine learning (ML) use case. Cloud data lakehouses provide significant scaling, agility, and cost advantages compared to cloud data lakes and cloud data warehouses.

Data Processing

Data Processing Data Lake Cost-Benefit Risk

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

AWS Big Data

SEPTEMBER 22, 2023

At Stitch Fix, we have been powered by data science since its foundation and rely on many modern data lake and data processing technologies. In our infrastructure, Apache Kafka has emerged as a powerful tool for managing event streams and facilitating real-time data processing.

Management

Management Metrics Cost-Benefit Data Lake

Accelerate your data warehouse migration to Amazon Redshift – Part 7

AWS Big Data

OCTOBER 17, 2023

Tens of thousands of customers use Amazon Redshift to gain business insights from their data. With Amazon Redshift, you can use standard SQL to query data across your data warehouse, operational data stores, and data lake. For more information, refer to Configuring ongoing data replication.

Data Warehouse

Data Warehouse Data Processing Data Lake Management

Why enterprise CIOs need to plan for Microsoft gen AI

CIO Business Intelligence

AUGUST 14, 2024

Start where your data is Using your own enterprise data is the major differentiator from open access gen AI chat tools, so it makes sense to start with the provider already hosting your enterprise data. Walker refers to “guided play sessions” and users were encouraged to share what worked with their peers.

Enterprise

Enterprise Cost-Benefit Experimentation Modeling

Accelerating revenue growth with real-time analytics: Poshmark’s journey

AWS Big Data

MARCH 20, 2023

Top line revenue refers to the total value of sales of an organization’s services or products. The data from the S3 data lake is used for batch processing and analytics through Amazon EMR and Amazon Redshift. Operational dashboards are hosted on Grafana integrated with Druid.

Analytics

Analytics Slice and Dice Data Processing Data Lake

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

“Flashpoint” (2018) – GDPR went into effect, plus major data blunders happened seemingly everywhere. Fun fact : I co-founded an e-commerce company (realistically, a mail-order catalog hosted online) in December 1992 using one of those internetworking applications called Gopher , which was vaguely popular at the time. Upcoming Events.

Machine Learning

Machine Learning Data Governance Metadata Data Science

Enrich your serverless data lake with Amazon Bedrock

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Webinars

Trending Sources

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Webinars

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Query your Apache Hive metastore with AWS Lake Formation permissions

Governing data in relational databases using Amazon DataZone

Access Amazon Athena in your applications using the WebSocket API

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Glue crawlers support cross-account crawling to support data mesh architecture

Run Spark SQL on Amazon Athena Spark

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Enhance query performance using AWS Glue Data Catalog column-level statistics

Implement alerts in Amazon OpenSearch Service with PagerDuty

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

Implement disaster recovery with Amazon Redshift

Attribute Amazon EMR on EC2 costs to your end-users

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Building a scalable streaming data platform that enables real-time and batch analytics of electric vehicles on AWS

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

Amazon Redshift data ingestion options

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

Improving Multi-tenancy with Virtual Private Clusters

Build a pseudonymization service on AWS to protect sensitive data: Part 2

Create an end-to-end data strategy for Customer 360 on AWS

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

The Strategy Behind our Denodo Partner Program

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Design a data mesh on AWS that reflects the envisioned organization

10 Keys to a Secure Cloud Data Lakehouse

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

Accelerate your data warehouse migration to Amazon Redshift – Part 7

Why enterprise CIOs need to plan for Microsoft gen AI

Accelerating revenue growth with real-time analytics: Poshmark’s journey

Themes and Conferences per Pacoid, Episode 8

Stay Connected