2012, Data Lake and Testing - Data Leaders Brief

2012

Data Lake

Testing

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

AWS Big Data

JULY 18, 2024

Over the years, organizations have invested in creating purpose-built, cloud-based data lakes that are siloed from one another. A major challenge is enabling cross-organization discovery and access to data across these multiple data lakes, each built on different technology stacks.

Data Lake

Data Lake Publishing Metadata Data-driven

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Use cases for Hive metastore federation for Amazon EMR Hive metastore federation for Amazon EMR is applicable to the following use cases: Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. This approach simplifies your data journey and helps you meet your security requirements. Choose the created IAM role.

Visualization

Visualization Data Processing Testing Publishing

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Accelerate data integration with Salesforce and AWS using AWS Glue

AWS Big Data

SEPTEMBER 4, 2024

This solution also allows you to update certain fields of the account object in the data lake and push it back to Salesforce. To achieve this, you create two ETL jobs using AWS Glue with the Salesforce connector, and create a transactional data lake on Amazon S3 using Apache Iceberg.

Data Integration

Data Integration Data Lake Data-driven Cost-Benefit

Integrate custom applications with AWS Lake Formation – Part 1

AWS Big Data

NOVEMBER 19, 2024

Due to these limitations, the application should not be used for arbitrary tests. In this post, we provide instructions on how to deploy a sample API application integrated with Lake Formation that implements the solution architecture. We also show how to test the function with Lambda tests.

Data Lake

Data Lake Metadata Testing Data Processing

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

AWS Big Data

AUGUST 5, 2024

These business units have varying landscapes, where a data lake is managed by Amazon Simple Storage Service (Amazon S3) and analytics workloads are run on Amazon Redshift , a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data.

Data Lake

Data Lake Finance Sales Management

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Use Lake Formation to grant permissions to users to access data. Test the solution by accessing data with a corporate identity. Audit user data access. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane. Select Named Data Catalog resources.

Analytics

Analytics Data Lake Management Enterprise

Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider

AWS Big Data

FEBRUARY 15, 2024

You might be modernizing your data architecture using Amazon Redshift to enable access to your data lake and data in your data warehouse, and are looking for a centralized and scalable way to define and manage the data access based on IdP identities. For IAM role , choose a Lake Formation user-defined role.

Management

Management Data Lake Sales Data Warehouse

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset.

Data Quality

Data Quality Measurement Testing Visualization

Run Spark SQL on Amazon Athena Spark

AWS Big Data

OCTOBER 23, 2023

Modern applications store massive amounts of data on Amazon Simple Storage Service (Amazon S3) data lakes, providing cost-effective and highly durable storage, and allowing you to run analytics and machine learning (ML) from your data lake to generate insights on your data.

Data Lake

Data Lake Visualization Optimization Interactive

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

AWS Big Data

JANUARY 30, 2023

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities. Choose Create policy.

Insurance

Insurance Data Lake Data-driven Analytics

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

json ) to DynamoDB (for more information, refer to Write data to a table using the console or AWS CLI ): { "name": "step1.q", sample_data/us_current.csv s3://$s3_bucket_name/covid-19-testing-data/base/source_us_current/; Copy states_current.csv : aws s3 cp./sample_data/states_current.csv sample_oozie_job_name/step1/step1.json

Metadata

Metadata Data Lake Testing Consulting

Migrate workloads from AWS Data Pipeline

AWS Big Data

JULY 25, 2024

AWS Data Pipeline helps customers automate the movement and transformation of data. With Data Pipeline, customers can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. You can visually create, run, and monitor ETL pipelines to load data into your data lakes.

Visualization

Visualization Management Data Integration Testing

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Founded in 2012, SumUp is the financial partner for more than 4 million small merchants in over 35 markets worldwide, helping them start, run and grow their business. Unless, of course, the rest of their data also resides in the Google Cloud. The Data Science teams also use this data for churn prediction and CLTV modeling.

Analytics

Analytics Data Lake Testing Optimization

Attribute Amazon EMR on EC2 costs to your end-users

AWS Big Data

AUGUST 27, 2024

His background is in data warehouse/data lake – architecture, development and administration. He is in data and analytical field for over 14 years. Ramesh Raghupathy is a Senior Data Architect with WWCO ProServe at AWS. While not at work, Ramesh enjoys traveling, spending time with family, and yoga.

Metrics

Metrics Dashboards Data Lake Optimization

Simplify and speed up Apache Spark applications on Amazon Redshift data with Amazon Redshift integration for Apache Spark

AWS Big Data

APRIL 20, 2023

For sales across multiple markets, the product sales data such as orders, transactions, and shipment data is available on Amazon S3 in the data lake. The data engineering team can use Apache Spark with Amazon EMR or AWS Glue to analyze this data in Amazon S3. enableHiveSupport().getOrCreate()

Data Lake

Data Lake Data Warehouse Sales Data-driven

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. We deploy Debezium MySQL source Kafka connector on Amazon MSK Connect.

Data Warehouse

Data Warehouse Snapshot Data Processing Internet of Things

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

Test the application Let’s invoke the application you have created to seamlessly sign in to QuickSight using the following URL. Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.

Metadata

Metadata Dashboards Business Intelligence Data Lake

How The Cloud Made ‘Data-Driven Culture’ Possible | Part 1

BizAcuity

MAY 10, 2022

2012: Amazon Redshift, the first of its kind cloud-based data warehouse service comes into existence. Google launches BigQuery, its own data warehousing tool and Microsoft introduces Azure SQL Data Warehouse and Azure Data Lake Store. Data lakes or data lake houses alone cannot solve the efficiency problem.

Data-driven

Data-driven IoT Unstructured Data Data Lake

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

Somehow, the gravity of the data has a geological effect that forms data lakes. Also, data science workflows begin to create feedback loops from the big data side of the illo above over to the DW side. DG emerges for the big data side of the world, e.g., the Alation launch in 2012.

Machine Learning

Machine Learning Data Governance Metadata Data Science

Themes and Conferences per Pacoid, Episode 12

Domino Data Lab

AUGUST 8, 2019

Once upon a time, circa 2012-ish, data science conferences were replete with talks about an industry hellbent on loading amazing enormous Big Data into some kind of data lake, and applying all kinds of odd astrophysics-ish approaches…for eventual PROFIT! Or something. Nothing Spreads Like Fear”. No big deal.”.

Data Science

Data Science Machine Learning Data Governance Statistics

Q&A with Greg Rahn – The changing Data Warehouse market

Cloudera

DECEMBER 12, 2018

And so I actually transitioned out of that group and into the Big Data Appliance group at Oracle, but soon realized that if that was what I wanted to keep doing, this up and coming company called Cloudera might be a better place to do it since these new technologies weren’t just a hobby at Cloudera. As you mentioned, Qlik is in there.

Data Warehouse

Data Warehouse Marketing Big Data Data-driven

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

Solution overview One of the common functionalities involved in data pipelines is extracting data from multiple data sources and exporting it to a data lake or synchronizing the data to another database. Choose the workflow named ETL_Process. Run the workflow with default input.

Metadata

Metadata Visualization Data-driven Data Lake

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

AWS Big Data

APRIL 21, 2025

Without the DATA LOCATION permission, write workloads will fail. Test the access to the table by writing new records to the table as the IAM role. Add SELECT table permissions to the Data-Analyst role in Lake Formation. Test access to the table as the Data-Analyst by running SELECT queries in Athena.

Data Lake

Data Lake Metadata Interactive Big Data

Redefining enterprise transformation in the age of intelligent ecosystems

CIO Business Intelligence

JANUARY 16, 2025

The mega-vendor era By 2020, the basis of competition for what are now referred to as mega-vendors was interoperability, automation and intra-ecosystem participation and unlocking access to data to drive business capabilities, value and manage risk. edge compute data distribution that connect broad, deep PLM eco-systems.

Enterprise

Enterprise Digital Transformation Scorecard Interactive

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

AWS Big Data

MAY 9, 2025

Many organizations build and operate enterprise-wide data mesh architectures using the AWS Glue Data Catalog and AWS Lake Formation for their Amazon Simple Storage Service (Amazon S3) based data lakes. AWS Glue is a serverless service that makes data integration simpler, faster, and cheaper. and Python 3.11.

Data Lake

Data Lake Data Warehouse Marketing Management

Automate replication of row-level security from AWS Lake Formation to Amazon QuickSight

AWS Big Data

MAY 7, 2025

When extracting data filter rules for the table in another account, the execution role must have necessary access to the tables in the other account. Use case overview For this post, lets consider a large financial institution that has implemented Lake Formation as its central data lake and entitlement management system.

Dashboards

Dashboards Publishing Data-driven Business Intelligence

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Webinars

Trending Sources

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Webinars

Accelerate data integration with Salesforce and AWS using AWS Glue

Integrate custom applications with AWS Lake Formation – Part 1

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider

Measure performance of AWS Glue Data Quality for ETL pipelines

Run Spark SQL on Amazon Athena Spark

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Migrate workloads from AWS Data Pipeline

How SumUp made digital analytics more accessible using AWS Glue

Attribute Amazon EMR on EC2 costs to your end-users

Simplify and speed up Apache Spark applications on Amazon Redshift data with Amazon Redshift integration for Apache Spark

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Federate Amazon QuickSight access with open-source identity provider Keycloak

How The Cloud Made ‘Data-Driven Culture’ Possible | Part 1

Themes and Conferences per Pacoid, Episode 8

Themes and Conferences per Pacoid, Episode 12

Q&A with Greg Rahn – The changing Data Warehouse market

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

Redefining enterprise transformation in the age of intelligent ecosystems

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

Automate replication of row-level security from AWS Lake Formation to Amazon QuickSight

Stay Connected