Blog - Data Leaders Brief

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

This post explores how to start using Delta Lake UniForm on Amazon Web Services (AWS). Note that the extra package ( delta-iceberg ) is required to create a UniForm table in AWS Glue Data Catalog. Amazon S3 and AWS Glue Data Catalog : These are used to manage the underlying files and the catalog of the Delta Lake UniForm table.

Metadata

Metadata Data Warehouse Big Data Data Lake

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

To implement this solution, complete the following steps: Set up Zero-ETL integration from the AWS Management Console for Amazon Relational Database Service (Amazon RDS). An AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and related AWS services.

Data Warehouse

Data Warehouse Analytics Testing Sales

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. These examples use synthetic datasets created in AWS Glue and Amazon S3. Table metadata is fetched from AWS Glue.

Metadata

Metadata Data Lake Modeling Data Warehouse

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

2021 Data/AI Salary Survey

O'Reilly on Data

SEPTEMBER 15, 2021

Cloud certifications, specifically in AWS and Microsoft Azure, were most strongly associated with salary increases. 64% of the respondents took part in training or obtained certifications in the past year, and 31% reported spending over 100 hours in training programs, ranging from formal graduate degrees to reading blog posts.

Machine Learning

Machine Learning Statistics Reporting Consulting

Build a secure data visualization application using the Amazon Redshift Data API with AWS IAM Identity Center

AWS Big Data

MARCH 6, 2025

In this post, we dive into the newly released feature of Amazon Redshift Data API support for SSO, Amazon Redshift RBAC for row-level security (RLS) and column-level security (CLS), and trusted identity propagation with AWS IAM Identity Center to let corporate identities connect to AWS services securely.

Visualization

Visualization Sales Data Warehouse Management

Simplify your query performance diagnostics in Amazon Redshift with Query profiler

AWS Big Data

OCTOBER 23, 2024

The performance data you can use on the Amazon Redshift console falls into two categories: Amazon CloudWatch metrics – Helps you monitor the physical aspects of your cluster or serverless, such as resource utilization, latency, and throughput. Ekta Ahuja is an Amazon Redshift Specialist Solutions Architect at AWS.

Data Warehouse

Data Warehouse Metrics Broadcasting Dashboards

The DataOps Vendor Landscape, 2021

DataKitchen

APRIL 13, 2021

Read the complete blog below for a more detailed description of the vendors and their capabilities. Because it is such a new category, both overly narrow and overly broad definitions of DataOps abound. AWS Code Deploy. AWS Code Pipeline. Download the 2021 DataOps Vendor Landscape here. DataOps is a hot topic in 2021.

Testing

Testing Machine Learning Consulting Data Science

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

In this post, we showcase how to use AWS Glue with AWS Glue Data Quality , sensitive data detection transforms , and AWS Lake Formation tag-based access control to automate data governance. We use AWS CloudFormation to provision the resources. This gets tedious and delays the data adoption across the enterprise.

Data Quality

Data Quality Data Governance Data Lake Testing

Simplify data transfer: Google BigQuery to Amazon S3 using Amazon AppFlow

AWS Big Data

OCTOBER 5, 2023

Amazon AppFlow , a fully managed data integration service, has been at the forefront of streamlining data transfer between AWS services, software as a service (SaaS) applications, and now Google BigQuery. Next, provide AWS Glue Data Catalog settings to create a table for further analysis. Choose Create bucket. Choose Create bucket.

Data Warehouse

Data Warehouse Machine Learning Data Integration Data-driven

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

AWS Big Data

APRIL 24, 2023

In 2022, we announced that you can enforce fine-grained access control policies using AWS Lake Formation and query data stored in any supported file format using table formats such as Apache Iceberg , Apache Hudi, and more using Amazon Athena queries. An AWS Glue crawler is integrated on top of S3 buckets to automatically detect the schema.

Data Lake

Data Lake Data Governance Machine Learning Cost-Benefit

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

AWS Big Data

FEBRUARY 16, 2024

Redshift Spectrum uses the AWS Glue Data Catalog as a Hive metastore. AWS Lake Formation offers a straightforward and centralized approach to access management for S3 data sources. Lake Formation uses the AWS Glue Data Catalog to provide access control for Amazon S3. Lake Formation interface endpoint. Amazon S3 gateway endpoint.

Data Lake

Data Lake Data Warehouse Testing Business Objectives

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Many AWS customers adopted Apache Hudi on their data lakes built on top of Amazon S3 using AWS Glue , a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

Data Lake

Data Lake Snapshot Metadata Optimization

Cloud Data Science News – Beta 8

Data Science 101

DECEMBER 27, 2019

AWS Deep Learning Containers now support Tensorflow 2.0 AWS Deep Learning Containers are docker images which are preconfigured for deep learning tasks. Build a custom classifier using AWS Comprehend AWS Comprehend is a Natural Language Processing (NLP) service. This blog post acts more like a step-by-step tutorial.

Data Science

Data Science Deep Learning Data-driven IT

Understanding ETL Tools as a Data-Centric Organization

Smart Data Collective

SEPTEMBER 8, 2021

It is considered a “complex to license and expensive tool” that often overlaps with other products in this category. AWS Data Pipeline : AWS Data Pipeline can be used to schedule regular processing activities such as SQL transforms, custom scripts, MapReduce applications, and distributed data copy. Conclusion.

Data Warehouse

Data Warehouse Data Integration Marketing Software

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg integration is supported by AWS analytics services including Amazon EMR , Amazon Athena , and AWS Glue. In early 2022, AWS announced general availability of Athena ACID transactions, powered by Apache Iceberg. AWS Glue 3.0 The following is an example Iceberg catalog with AWS Glue implementation.

Data Lake

Data Lake Data Processing Metadata Snapshot

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

AWS Big Data

JUNE 19, 2023

With OCSF support, the service can normalize and combine security data from AWS and a broad range of enterprise security data sources. We also walk you through how to use a series of prebuilt visualizations to view events across multiple AWS data sources provided by Security Lake.

Publishing

Publishing Dashboards Visualization Management

Automate Amazon Redshift Advisor recommendations with email alerts using an API

AWS Big Data

AUGUST 12, 2024

You can also use the list-recommendations command in the AWS Command Line Interface (AWS CLI) to invoke the Advisor recommendations from the command line and automate the workflow through scripts. For Stack name , enter a name for the stack, for example, blog-redshift-advisor-recommendations.

Cost-Benefit

Cost-Benefit Data Warehouse Optimization Data Lake

Combine AWS Glue and Amazon MWAA to build advanced VPC selection and failover strategies

AWS Big Data

FEBRUARY 21, 2024

AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. Furthermore, each node (driver or worker) in an AWS Glue job requires an IP address assigned from the subnet.

Strategy

Strategy Visualization Management IT

Introducing enhanced support for tagging, cross-account access, and network security in AWS Glue interactive sessions

AWS Big Data

SEPTEMBER 20, 2023

AWS Glue interactive sessions allow you to run interactive AWS Glue workloads on demand, which enables rapid development by issuing blocks of code on a cluster and getting prompt results. This feature existed for AWS Glue jobs and is now available for interactive sessions.

Interactive

Interactive Management Reporting IT

NiFi as a Function in DataFlow Service

Cloudera

NOVEMBER 16, 2021

You can find more information in this release announcement blog post and in this technical deep dive blog post. Functions as a Service (FaaS) is a category of cloud computing services that all main cloud providers are offering (AWS Lambda, Azure Functions, Google Cloud Functions, etc). Functions as a Service.

KPI

KPI Data-driven IoT Optimization

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

AWS Lake Formation is a fully managed service that simplifies building, securing, and managing data lakes. In this post, we share the solution using Amazon Redshift role based access control (RBAC) and AWS Lake Formation tag-based access control for federated users to query your data lake using Amazon Redshift Spectrum.

Data Lake

Data Lake Data Warehouse Management Risk

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

The company uses AWS Cloud services to build data-driven products and scale engineering best practices. It’s worth mentioning that they also have other product and tech teams, including operational or business teams, without AWS accounts. Each team owns at least two AWS accounts, up to 10 accounts, depending on the ownership.

Data-driven

Data-driven Advertising Metadata Data Architecture

Build multimodal search with Amazon OpenSearch Service

AWS Big Data

JUNE 18, 2024

This blog post provides a step-by-step guide for building a multimodal search solution using OpenSearch Service. This direct integration eliminates the need for an additional component (for example, AWS Lambda ) to facilitate the exchange between the two services. Select Map and confirm the user or role shows up under Mapped users.

Dashboards

Dashboards Metadata Modeling Visualization

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

Still, to make your data lake workloads highly available in an unlikely outage situation, you can replicate your S3 data to another AWS Region as a backup. The following examples are also available in the sample notebook in the aws-samples GitHub repo for quick experimentation. availability. parquet 2021-11-01 06:00:10 6.1

Data Lake

Data Lake Snapshot Metadata Optimization

Choosing Your Upgrade or Migration Path to Cloudera Data Platform

Cloudera

AUGUST 5, 2021

In our previous blog, we talked about the four paths to Cloudera Data Platform. . If you haven’t read that yet, we invite you to take a moment and run through the scenarios in that blog. As we touched on in the previous blog, the decision to upgrade or migrate may seem difficult to evaluate at first glance. In-place Upgrade.

Testing

Testing Risk Finance Reporting

Empowering Digital Innovation Through Data and the Public Cloud Together with Amazon Web Services

Cloudera

NOVEMBER 25, 2021

At Cloudera, supporting our customers through their complete data journey also means providing access to game-changing technologies with trusted partners like Amazon Web Services (AWS). . Cloudera and AWS: Harnessing the Power of Data and Cloud . Customer use cases can be grouped into three categories. .

Machine Learning

Machine Learning IoT Risk Business Objectives

Nexthink scales to trillions of events per day with Amazon MSK

AWS Big Data

MARCH 29, 2024

AWS offers a broad selection of managed real-time data streaming services to effortlessly run these workloads at any scale. Experiencing business hyper-growth, Nexthink migrated to AWS to overcome the scaling limitations of on-premises solutions. Simone Pomata is Senior Solutions Architect at AWS.

Data-driven

Data-driven Cost-Benefit Metrics Management

Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control

AWS Big Data

NOVEMBER 6, 2023

Data is often stored in data lakes managed by AWS Lake Formation , enabling you to apply fine-grained access control through a simple grant or revoke mechanism. The jobs on the EMR cluster will use this runtime role to access AWS resources. Raw data is stored in S3 buckets and catalogued in the AWS Glue Data Catalog.

Data Lake

Data Lake Sales Management Testing

Snowflake: Data Ingestion Using Snowpipe and AWS Glue

BizAcuity

NOVEMBER 22, 2022

Using Snowpipe for data ingestion to AWS. Snowpipe data ingestion might be too slow for three use categories: real-time personalization, operational analytics, and security. AWS Glue to Snowflake ingestion. AWS Glue provides a fully managed environment that integrates easily with Snowflake’s data warehouse as a service.

Data Warehouse

Data Warehouse Cost-Benefit Data Lake Internet of Things

$100M+ ARR: Alation Achieves Centaur Status

Alation

SEPTEMBER 30, 2022

In this blog, I’ll talk about the data catalog and data intelligence markets, and the future for Alation. While we’re widely credited with driving the creation of the data catalog category 1 , Alation isn’t just a data catalog company. We’re excited to continue to innovate and lead the data intelligence category for years to come!

Measurement

Measurement Metrics Data Governance Sales

Real-time inference using deep learning within Amazon Kinesis Data Analytics for Apache Flink

AWS Big Data

JUNE 1, 2023

In this blog post, we demonstrate how you can use DJL within Kinesis Data Analytics for Apache Flink for real-time machine learning inference. We provide sample code, architecture diagrams, and an AWS CloudFormation template so you can follow along and employ ResNet-18 as your classifier to make real-time predictions.

Deep Learning

Deep Learning Data Analytics Analytics Machine Learning

Backup Practices Could Cost You

Laminar Security

SEPTEMBER 18, 2023

Automatic cloud platform backups, using tools from the CSP platforms, like AWS. Database backups: The third category of backups are database backups. zip` in AWS, making them portable and easy to transfer between different systems and environments. Manual backups, created at will by the user. S3 Cross-Region Replication (CRR) a.

Risk

Risk Testing Data Governance Metadata

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

AWS Big Data

MARCH 9, 2023

It contains different categories of columns: Keys – It contains two types of keys: customer_sk is the primary key of this table. Use an AWS Glue crawler to parse the data files and register tables in the AWS Glue Data Catalog. Create an external schema in Amazon Redshift to point to the AWS Glue database containing these tables.

Slice and Dice

Slice and Dice Data Warehouse Metrics Metadata

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

AWS Big Data

SEPTEMBER 12, 2024

In this blog post, we will highlight how ZS Associates used multiple AWS services to build a highly scalable, highly performant, clinical document search platform. We developed and host several applications for our customers on Amazon Web Services (AWS). Specialist SA-Data working with AWS India Public sector customers.

Unstructured Data

Unstructured Data Metadata Machine Learning Consulting

Sirius Achieves AWS Storage Competency Status

CDW Research Hub

JUNE 4, 2019

Sirius continues to build on its strategic partnership with Amazon Web Services (AWS) by tirelessly working to elevate our technical expertise and our ability to help businesses succeed. Because of this, we are proud to announce that we have successfully achieved AWS Storage Competency status.

Consulting

Consulting Optimization Strategy Technology

Digital Attribution's Ladder of Awesomeness: Nine Critical Steps

Occam's Razor

OCTOBER 24, 2016

To address that, on this blog I've shared something I call the ladders of awesomeness – my view of what the entire evolutionary path looks like. It is very hard to capture an entire keynote, and a life-time of bruises that the wisdom above reflects, in a simple blog post. Why do I say irritating? I love this report.

Metrics

Metrics Marketing Optimization Modeling

An A-Z Data Adventure on Cloudera’s Data Platform

Cloudera

DECEMBER 21, 2020

In this blog we will take you through a persona-based data adventure, with short demos attached, to show you the A-Z data worker workflow expedited and made easier through self-service, seamless integration, and cloud-native technologies. The post An A-Z Data Adventure on Cloudera’s Data Platform appeared first on Cloudera Blog.

Dashboards

Dashboards Visualization Data Warehouse Data Lake

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

AWS Big Data

SEPTEMBER 22, 2023

If you’d like to know more background about how we use Kafka at Stitch Fix, please refer to our previously published blog post, Putting the Power of Kafka into the Hands of Data Scientists. We have been using separate production and staging VPCs since we initially started using AWS.

Management

Management Metrics Cost-Benefit Data Lake

Dynamic KPI Threshold in Tabular or Power Pivot

Ms SQL Girl

DECEMBER 3, 2014

For example, if you have a requirement such that one Product Category would have different Threshold values (e.g. KPI Configuration Table to store the upper and lower Yellow values for each Product Category, and for a defined set of periods. Use the AW-DimProductCategoryKPI.sql table creation script as a sample. Prerequisites.

KPI

KPI Sales Modeling Measurement

Streaming Edge Data Collection and Global Data Distribution

Cloudera

JUNE 9, 2022

In the first blog of the Universal Data Distribution blog series , we discussed the emerging need within enterprise organizations to take control of their data flows. In this second installment of the Universal Data Distribution blog series, we will discuss a few different data distribution use cases and deep dive into one of them. .

Data Collection

Data Collection IoT Data Lake Unstructured Data

The Biggest Mistake Web Analysts Make… And How To Avoid It!

Occam's Razor

FEBRUARY 20, 2012

If you fall in the "Analyst unwilling to do the hard work" category, I'm afraid I can't help you. If you fall into the "Analyst really wanting to do the hard work but does not have the connection to Superiors, or other teams, and looking for any way out to identify business purpose" category. Aw, come on!

IT

IT Measurement Marketing Metrics

Alation + AWS Strengthen Partnership with Data & Analytics Competency

Alation

SEPTEMBER 8, 2022

That’s why when it was announced that Alation achieved Amazon Web Services (AWS) Data and Analytics Competency in the data governance and security category, we were not only honored to receive this coveted designation, but we were also proud that it confirms the synergy — and customer benefits — of our AWS partnership.

Data Analytics

Data Analytics Analytics Data Governance Strategy

Strengthening cybersecurity in life sciences with IBM and AWS

IBM Big Data Hub

MAY 26, 2023

The role of AWS and cloud security in life sciences However, with greater power comes great responsibility. Most life sciences companies are raising their security posture with AWS infrastructure and services. Organizations like Moderna and Bristol Myers Squibb have chosen AWS to run their regulated workloads.

Consulting

Consulting Cost-Benefit Machine Learning Optimization

Leveraging AI to discover and classify your data in a complex and dynamic landscape

Laminar Security

DECEMBER 13, 2023

For example, an AI system could be trained to classify emails into categories like “sensitive” or “restricted” based on patterns it has learned from a training dataset. It could be further trained to classify data into important categories such as PII, PHI and PCI, increasing efficiency in both data classification, and ultimately security.

Data-driven

Data-driven Machine Learning Deep Learning Risk

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Webinars

Trending Sources

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Webinars

2021 Data/AI Salary Survey

Build a secure data visualization application using the Amazon Redshift Data API with AWS IAM Identity Center

Simplify your query performance diagnostics in Amazon Redshift with Query profiler

The DataOps Vendor Landscape, 2021

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Simplify data transfer: Google BigQuery to Amazon S3 using Amazon AppFlow

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

Introducing Apache Hudi support with AWS Glue crawlers

Cloud Data Science News – Beta 8

Understanding ETL Tools as a Data-Centric Organization

Use Apache Iceberg in a data lake to support incremental data processing

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

Automate Amazon Redshift Advisor recommendations with email alerts using an API

Combine AWS Glue and Amazon MWAA to build advanced VPC selection and failover strategies

Introducing enhanced support for tagging, cross-account access, and network security in AWS Glue interactive sessions

NiFi as a Function in DataFlow Service

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Design a data mesh on AWS that reflects the envisioned organization

Build multimodal search with Amazon OpenSearch Service

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Choosing Your Upgrade or Migration Path to Cloudera Data Platform

Empowering Digital Innovation Through Data and the Public Cloud Together with Amazon Web Services

Nexthink scales to trillions of events per day with Amazon MSK

Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control

Snowflake: Data Ingestion Using Snowpipe and AWS Glue

$100M+ ARR: Alation Achieves Centaur Status

Real-time inference using deep learning within Amazon Kinesis Data Analytics for Apache Flink

Backup Practices Could Cost You

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

Sirius Achieves AWS Storage Competency Status

Digital Attribution's Ladder of Awesomeness: Nine Critical Steps

An A-Z Data Adventure on Cloudera’s Data Platform

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

Dynamic KPI Threshold in Tabular or Power Pivot

Streaming Edge Data Collection and Global Data Distribution

The Biggest Mistake Web Analysts Make… And How To Avoid It!

Alation + AWS Strengthen Partnership with Data & Analytics Competency

Strengthening cybersecurity in life sciences with IBM and AWS

Leveraging AI to discover and classify your data in a complex and dynamic landscape

Stay Connected