Data Lake, Data Processing and Testing

Oracle Wants to Be the Database for AI

David Menninger's Analyst Perspectives

MAY 15, 2025

Oracle recently hosted its annual Database Analyst Summit, sharing the vision and strategy for its data platform. While much of the event was under non-disclosure as product plans and launch schedules are finalized, it still served as a useful recap of the broad portfolio of data platform capabilities that Oracle has to offer.

Data Lake

Data Lake Data Warehouse Machine Learning Software

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights. Open AWS Glue Studio. Choose ETL Jobs.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Important Considerations When Migrating to a Data Lake

Smart Data Collective

MARCH 30, 2022

Azure Data Lake Storage Gen2 is based on Azure Blob storage and offers a suite of big data analytics features. If you don’t understand the concept, you might want to check out our previous article on the difference between data lakes and data warehouses. Migrate data, workloads, and applications.

Data Lake

Data Lake Cost-Benefit Data Warehouse Big Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

Amazon DataZone now launched authentication supports through the Amazon Athena JDBC driver, allowing data users to seamlessly query their subscribed data lake assets via popular business intelligence (BI) and analytics tools like Tableau, Power BI, Excel, SQL Workbench, DBeaver, and more. Choose Test connection.

Visualization

Visualization Data Lake Testing Data Governance

Eight Top DataOps Trends for 2022

DataKitchen

NOVEMBER 29, 2021

In 2022, data organizations will institute robust automated processes around their AI systems to make them more accountable to stakeholders. Model developers will test for AI bias as part of their pre-deployment testing. Quality test suites will enforce “equity,” like any other performance metric.

Testing

Testing Data Lake Data Architecture Manufacturing

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Use cases for Hive metastore federation for Amazon EMR Hive metastore federation for Amazon EMR is applicable to the following use cases: Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

For many organizations, this centralized data store follows a data lake architecture. Although data lakes provide a centralized repository, making sense of this data and extracting valuable insights can be challenging. We recommend testing your use case and data with different models.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

The DataFrame code generation now extends beyond AWS Glue DynamicFrame to support a broader range of data processing scenarios. Next, the merged data is filtered to include only a specific geographic region. Then the transformed output data is saved to Amazon S3 for further processing in future.

Data Integration

Data Integration Visualization Data Processing Big Data

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. This approach simplifies your data journey and helps you meet your security requirements. On your project, in the navigation pane, choose Data. Choose Next.

Visualization

Visualization Data Processing Testing Publishing

Accomplish Agile Business Intelligence & Analytics For Your Business

datapine

APRIL 15, 2020

Your Chance: Want to test an agile business intelligence solution? Business intelligence is moving away from the traditional engineering model: analysis, design, construction, testing, and implementation. You need to determine if you are going with an on-premise or cloud-hosted strategy. Finalize testing. Train end-users.

Business Intelligence

Business Intelligence Analytics Testing Dashboards

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all. Marketing-focused or not, DMPs excel at negotiating with a wide array of databases, data lakes, or data warehouses, ingesting their streams of data and then cleaning, sorting, and unifying the information therein.

Management

Management Advertising Data Lake Sales

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

Data storage databases. Your SaaS company can store and protect any amount of data using Amazon Simple Storage Service (S3), which is ideal for data lakes, cloud-native applications, and mobile apps. Well, let’s find out. Artificial intelligence (AI). Easy to use.

Cost-Benefit

Cost-Benefit Data Lake Software Machine Learning

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Therefore, organizations have come to host huge volumes of metadata of their structured datasets in the Hive metastore.

Data Lake

Data Lake Metadata Data Processing Big Data

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

For Host , enter events.PagerDuty.com. Choose Send test message and test to make sure you receive an alert on the PagerDuty service. This notification can be safely acknowledged and resolved from PagerDuty because this is was a test. Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services.

Data Lake

Data Lake Dashboards Metrics Testing

Access Amazon Athena in your applications using the WebSocket API

AWS Big Data

MARCH 2, 2023

Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products.

Data Lake

Data Lake Testing Interactive Unstructured Data

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. BMS’s EDLS platform hosts over 5,000 jobs and is growing at 15% YoY (year over year). About the authors Sivaprasad Mahamkali is a Senior Streaming Data Engineer at AWS Professional Services.

Metadata

Metadata Data Lake Visualization Data Quality

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

AWS Big Data

NOVEMBER 8, 2023

They recently needed to do a monthly load of 140 TB of uncompressed healthcare claims data in under 24 hours after receiving it to provide analysts and data scientists with up-to-date information on a patient’s healthcare journey. This data volume is expected to increase monthly and is fully refreshed each month.

Data Lake

Data Lake Data Warehouse Cost-Benefit Optimization

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

AWS Big Data

MAY 30, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Structured Data

FINRA CIO Steve Randich pushes the public cloud forward

CIO Business Intelligence

FEBRUARY 10, 2023

But for two years, we were testing limits within the public cloud.” While managing unstructured data remains a challenge for 36% of organizations, according to the 2022 Foundry Data and Analytics Research survey, many IT leaders are actively seeking ways of harnessing all types of data stored in data lakes.

Unstructured Data

Unstructured Data Data Lake Machine Learning Enterprise

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

This will enable right-sizing the Redshift data warehouse to meet workload demands cost-effectively. Thorough testing and performance optimization will facilitate a smooth transition with minimal disruption to end-users, fostering exceptional user experiences and satisfaction.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Top 15 data management platforms available today

CIO Business Intelligence

SEPTEMBER 22, 2023

All this data arrives by the terabyte, and a data management platform can help marketers make sense of it all. DMPs excel at negotiating with a wide array of databases, data lakes, or data warehouses, ingesting their streams of data and then cleaning, sorting, and unifying the information therein.

Management

Management Advertising Data Lake Sales

BusinessObjects in the Cloud – No Big Rush and No Big Deal

Paul Blogs on BI

SEPTEMBER 8, 2021

Well firstly, if the main data warehouses, repositories, or application databases that BusinessObjects accesses are on premise, it makes no sense to move BusinessObjects to the cloud until you move its data sources to the cloud. You also have the option of hosting with a third party.

Data Warehouse

Data Warehouse Data Processing Data Lake Testing

Run Spark SQL on Amazon Athena Spark

AWS Big Data

OCTOBER 23, 2023

Modern applications store massive amounts of data on Amazon Simple Storage Service (Amazon S3) data lakes, providing cost-effective and highly durable storage, and allowing you to run analytics and machine learning (ML) from your data lake to generate insights on your data.

Data Lake

Data Lake Visualization Optimization Interactive

Dairyland powers up for a generative AI edge

CIO Business Intelligence

APRIL 9, 2024

Beginning in 2021, the Minneapolis-based Microsoft partner helped Dairyland migrate from several custom legacy applications to a commercial implementation of Dynamics 365 and an Azure data lake, which set the stage for the power company’s early foray into AI, according to the systems integrator.

Digital Transformation

Digital Transformation Machine Learning Data Lake Software

High Availability (Multi-AZ) for Cloudera Operational Database

Cloudera

FEBRUARY 13, 2024

In this post, we’ll perform a similar test to validate that the feature works as expected in Azure, too. Below is the Azure CLI command: Cloudera allows FreeIPA servers, enterprise data lake, and data hub to be configured as Multi-AZ deployment.

Data Lake

Data Lake Testing Data Processing Enterprise

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

Test out the disaster recovery plan by simulating a failover event in a non-production environment. Our pre-launch tests found that the RTO with Amazon Redshift Multi-AZ deployments is under 60 seconds or less in the unlikely case of an Availability Zone failure. Choose your hosted zone. Choose your hosted zone.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Build a pseudonymization service on AWS to protect sensitive data: Part 2

AWS Big Data

MARCH 6, 2024

The account on the right hosts the pseudonymization service, which you can deploy using the instructions provided in the Part 1 of this series. For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR.

Metrics

Metrics Statistics Testing Data Lake

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Each service is hosted in a dedicated AWS account and is built and maintained by a product owner and a development team, as illustrated in the following figure. This separation means changes can be tested thoroughly before being deployed to live operations. The following figure illustrates the data mesh architecture.

Data Governance

Data Governance Publishing Data-driven Metadata

Aaand the New NiFi Champion is…

Cloudera

JUNE 5, 2023

The flow he built differentiates between test or true API call before initiating a secure log in. Completeness is estimated by comparing a test result with “estimated total.” RK built some simple flows to pull streaming data into Google Cloud Storage and Snowflake. The brilliant part comes next.

Testing

Testing Data Lake Data Processing IT

Attribute Amazon EMR on EC2 costs to your end-users

AWS Big Data

AUGUST 27, 2024

His background is in data warehouse/data lake – architecture, development and administration. He is in data and analytical field for over 14 years. Ramesh Raghupathy is a Senior Data Architect with WWCO ProServe at AWS. While not at work, Ramesh enjoys traveling, spending time with family, and yoga.

Metrics

Metrics Dashboards Data Lake Optimization

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Use Lake Formation to grant permissions to users to access data. Test the solution by accessing data with a corporate identity. Audit user data access. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane. Select Named Data Catalog resources.

Analytics

Analytics Data Lake Management Enterprise

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. This solution uses Amazon Aurora MySQL hosting the example database salesdb.

Data Warehouse

Data Warehouse Snapshot Data Processing Internet of Things

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

AWS Big Data

NOVEMBER 8, 2023

Putting your data to work with generative AI – Innovation Talk Thursday, November 30 | 12:30 – 1:30 PM PST | The Venetian Join Mai-Lan Tomsen Bukovec, Vice President, Technology at AWS to learn how you can turn your data lake into a business advantage with generative AI. Reserve your seat now! Reserve your seat now!

Data-driven

Data-driven Machine Learning Data Lake Cost-Benefit

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

AWS Big Data

JULY 30, 2024

Two private subnets are used to set up the Amazon MWAA environment, and the third private subnet is used to host the AWS Lambda authorizer function. Test the solution Now that the SAML configuration and relevant AWS services are created, it’s time to access the Amazon MWAA environment. For EntraIDLoginURL , enter the Azure IdP URI.

Metadata

Metadata Enterprise Data Lake Management

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. Over time, workloads start processing more data, tenants start onboarding more workloads, and administrators (admins) start onboarding more tenants. Cloudera Manager (CM) 6.2

Metadata

Metadata Data Lake Optimization Strategy

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

bridgei2i

MARCH 3, 2021

Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities. Unlocking the Value of Enterprise AI with Data Engineering Capabilities. They discuss how the data engineering team is instrumental in easing collaboration between analysts, data scientists and ML engineers to build enterprise AI solutions.

Enterprise

Enterprise Digital Transformation Data-driven Interactive

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

AWS Big Data

FEBRUARY 16, 2023

As a QuickSight administrator, you can use AWS CloudFormation templates to migrate assets between distinct environments from development, to test, to production. Create an Amazon Redshift data source in AWS CloudFormation In this step, we add the AWS::QuickSight::DataSource section of the CloudFormation template.

Data Warehouse

Data Warehouse Sales Visualization Data Processing

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Those decentralization efforts appeared under different monikers through time, e.g., data marts versus data warehousing implementations (a popular architectural debate in the era of structured data) then enterprise-wide data lakes versus smaller, typically BU-Specific, “data ponds”.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Quantitative and Qualitative Data: A Vital Combination

Sisense

OCTOBER 6, 2020

Additionally, quantitative data forms the basis on which you can confidently infer, estimate, and project future performance, using techniques such as regression analysis, hypothesis testing, and Monte Carlo simulations. Despite its many uses, quantitative data presents two main challenges for a data-driven organization.

Statistics

Statistics Unstructured Data Data-driven Visualization

Oracle Wants to Be the Database for AI

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Webinars

Trending Sources

Important Considerations When Migrating to a Data Lake

Webinars

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Eight Top DataOps Trends for 2022

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Enrich your serverless data lake with Amazon Bedrock

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Accomplish Agile Business Intelligence & Analytics For Your Business

Top 15 data management platforms

Governing data in relational databases using Amazon DataZone

10 Things AWS Can Do for Your SaaS Company

Query your Apache Hive metastore with AWS Lake Formation permissions

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Implement alerts in Amazon OpenSearch Service with PagerDuty

Access Amazon Athena in your applications using the WebSocket API

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

FINRA CIO Steve Randich pushes the public cloud forward

Enhance query performance using AWS Glue Data Catalog column-level statistics

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Top 15 data management platforms available today

BusinessObjects in the Cloud – No Big Rush and No Big Deal

Run Spark SQL on Amazon Athena Spark

Dairyland powers up for a generative AI edge

High Availability (Multi-AZ) for Cloudera Operational Database

Implement disaster recovery with Amazon Redshift

Build a pseudonymization service on AWS to protect sensitive data: Part 2

HEMA accelerates their data governance journey with Amazon DataZone

Aaand the New NiFi Champion is…

Attribute Amazon EMR on EC2 costs to your end-users

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

Improving Multi-tenancy with Virtual Private Clusters

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Quantitative and Qualitative Data: A Vital Combination

Stay Connected