Data Integration, Data Lake and Machine Learning

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Five steps to jumpstart your data integration journey

IBM Big Data Hub

JUNE 26, 2020

Organizations need to collect, organize, and analyze their data across multi-cloud, hybrid cloud, and data lakes. In turn, enterprises are increasingly looking for machine-learning-powered integration tools to synchronize data for analytics, improve employee productivity, and prepare data for analytics.

Data Integration

Data Integration Data Lake Machine Learning Enterprise

Webinars

Apache Airflow®: The Ultimate Guide to DAG Writing

MORE WEBINARS

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

AWS Big Data

SEPTEMBER 10, 2024

We often see requests from customers who have started their data journey by building data lakes on Microsoft Azure, to extend access to the data to AWS services. In such scenarios, data engineers face challenges in connecting and extracting data from storage containers on Microsoft Azure.

Data Lake

Data Lake Metadata Management Software

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

This amalgamation empowers vendors with authority over a diverse range of workloads by virtue of owning the data. This authority extends across realms such as business intelligence, data engineering, and machine learning thus limiting the tools and capabilities that can be used. 5 seconds $0.08 8 seconds $0.07

Data Lake

Data Lake Metadata Snapshot Analytics

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Use cases for Hive metastore federation for Amazon EMR Hive metastore federation for Amazon EMR is applicable to the following use cases: Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Talend Data Fabric Simplifies Data Life Cycle Management

David Menninger's Analyst Perspectives

NOVEMBER 16, 2021

Talend is a data integration and management software company that offers applications for cloud computing, big data integration, application integration, data quality and master data management.

Management

Management Data Warehouse Data Quality Data Integration

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

CIO Business Intelligence

AUGUST 9, 2024

At Atlanta’s Hartsfield-Jackson International Airport, an IT pilot has led to a wholesale data journey destined to transform operations at the world’s busiest airport, fueled by machine learning and generative AI. Data integrity presented a major challenge for the team, as there were many instances of duplicate data.

Data Transformation

Data Transformation Machine Learning Data Lake Dashboards

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

This cloud service was a significant leap from the traditional data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience.

Data Warehouse

Data Warehouse Data Lake Analytics Machine Learning

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

Our customers are telling us that they are seeing their analytics and AI workloads increasingly converge around a lot of the same data, and this is changing how they are using analytics tools with their data. This innovation drives an important change: you’ll no longer have to copy or move data between data lake and data warehouses.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Access Amazon Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

AWS Big Data

JUNE 25, 2024

In today’s data-driven business landscape, organizations collect a wealth of data across various touch points and unify it in a central data warehouse or a data lake to deliver business insights. It provides secure, real-time access to Redshift data without copying, keeping enterprise data in place.

Data Lake

Data Lake Cost-Benefit Data-driven Data Warehouse

Unlocking the Potential of Machine Learning in a Data Lake

Data Virtualization

MARCH 27, 2019

With data becoming the brain food to the intelligence of every organization, regardless of size or sector, it has become crucial to harness this data to achieve the best results, make the most informed decisions and improve productivity. However, with.

Data Lake

Data Lake Machine Learning IT Data Integration

Preparing the foundations for Generative AI

CIO Business Intelligence

FEBRUARY 20, 2024

Data also needs to be sorted, annotated and labelled in order to meet the requirements of generative AI. No wonder CIO’s 2023 AI Priorities study found that data integration was the number one concern for IT leaders around generative AI integration, above security and privacy and the user experience.

Cost-Benefit

Cost-Benefit Data Lake Data Warehouse Data Processing

Machine Learning and AI Underpin Predictive Analytics to Achieve Clinical Breakthroughs

Cloudera

JULY 18, 2018

To arrive at quality data, organizations are spending significant levels of effort on data integration, visualization, and deployment activities. Additionally, organizations are increasingly restrained due to budgetary constraints and having limited data sciences resources.

Machine Learning

Machine Learning Predictive Analytics Analytics Prescriptive Analytics

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

AWS Big Data

DECEMBER 9, 2024

The importance of publishing only high-quality data cant be overstatedits the foundation for accurate analytics, reliable machine learning (ML) models, and sound decision-making. AWS Glue is a serverless data integration service that you can use to effectively monitor and manage data quality through AWS Glue Data Quality.

Data Quality

Data Quality Publishing Snapshot Data Lake

Databricks’ new data lakehouse aims at media, entertainment sector

CIO Business Intelligence

APRIL 25, 2022

The data lakehouse is a relatively new data architecture concept, first championed by Cloudera, which offers both storage and analytics capabilities as part of the same solution, in contrast to the concepts for data lake and data warehouse which, respectively, store data in native format, and structured data, often in SQL format.

Recreation/Entertainment

Recreation/Entertainment Data Lake Data Warehouse Unstructured Data

Compose your ETL jobs for MongoDB Atlas with AWS Glue

AWS Big Data

MAY 3, 2023

In today’s data-driven business environment, organizations face the challenge of efficiently preparing and transforming large amounts of data for analytics and data science purposes. Businesses need to build data warehouses and data lakes based on operational data.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

Amazon Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless. These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime. options(**read_config).option("query",

Data Processing

Data Processing Data Lake Data Warehouse Optimization

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

AWS Big Data

OCTOBER 20, 2023

Today, we are pleased to announce new AWS Glue connectors for Azure Blob Storage and Azure Data Lake Storage that allow you to move data bi-directionally between Azure Blob Storage, Azure Data Lake Storage, and Amazon Simple Storage Service (Amazon S3). option("header","true").load("wasbs://yourblob@youraccountname.blob.core.windows.net/loadingtest-input/100mb")

Data Lake

Data Lake Big Data Consulting Data Warehouse

What is Data Pipeline? A Detailed Explanation

Smart Data Collective

OCTOBER 17, 2022

A point of data entry in a given pipeline. Examples of an origin include storage systems like data lakes, data warehouses and data sources that include IoT devices, transaction processing applications, APIs or social media. The final point to which the data has to be eventually transferred is a destination.

Data Warehouse

Data Warehouse Data Lake Visualization Big Data

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

ChatGPT> DataOps is a term that refers to the set of practices and tools that organizations use to improve the quality and speed of data analytics and machine learning. It involves bringing together people, processes, and technology to enable data-driven decision making and improve the efficiency of data-related workflows.

Machine Learning

Machine Learning Data-driven Optimization Data Analytics

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

AWS Big Data

DECEMBER 15, 2023

Even after identification, it’s cumbersome to implement redaction, masking, or encryption of sensitive data at scale. In this post, we provide an automated solution to detect PII data in Amazon Redshift using AWS Glue. For our solution, we use Amazon Redshift to store the data.

Data Lake

Data Lake Data Warehouse Big Data Structured Data

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

It’s not your data. It’s how you use it. Unlock the power of data & build foundations of a data driven organisation

CIO Business Intelligence

MAY 24, 2022

Data analytics, applied effectively, can provide extremely valuable guidance to identify trends and inform business decision making, but the data has to be accessible to these data analytics tools if they are to deliver actionable insights. analyse the data, using business intelligence, visualisation or data science tools.

Data-driven

Data-driven Data Lake Data Warehouse IT

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. Under Administration , choose Data catalog settings.

Data Lake

Data Lake Snapshot Metadata Optimization

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

DataKitchen

JULY 27, 2023

Let’s go through the ten Azure data pipeline tools Azure Data Factory : This cloud-based data integration service allows you to create data-driven workflows for orchestrating and automating data movement and transformation. You can use it for big data analytics and machine learning workloads.

Machine Learning

Machine Learning Cost-Benefit Data Transformation Testing

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Amazon Redshift is a fully managed data warehousing service that offers both provisioned and serverless options, making it more efficient to run and scale analytics without having to manage your data warehouse. Additionally, data is extracted from vendor APIs that includes data related to product, marketing, and customer experience.

Data Warehouse

Data Warehouse Analytics Data Lake Data Science

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

As organizations increasingly rely on data stored across various platforms, such as Snowflake , Amazon Simple Storage Service (Amazon S3), and various software as a service (SaaS) applications, the challenge of bringing these disparate data sources together has never been more pressing.

Analytics

Analytics Data-driven Data Integration Data Lake

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

AWS Big Data

FEBRUARY 22, 2023

In this post, we show how Ruparupa implemented an incrementally updated data lake to get insights into their business using Amazon Simple Storage Service (Amazon S3), AWS Glue , Apache Hudi , and Amazon QuickSight. An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data.

Data Lake

Data Lake Dashboards Cost-Benefit Data Warehouse

Data replication holds the key to hybrid cloud effectiveness

CIO Business Intelligence

MARCH 18, 2024

But when it comes to getting the most value out of hybrid cloud, one of the most crucial capabilities required is data replication and synchronization—what enables businesses to efficiently capture data changes and unify various data stores while ensuring low latency, high availability, and data integrity.

Cost-Benefit

Cost-Benefit Data Lake Machine Learning Data Integration

Prepare and load Amazon S3 data into Teradata using AWS Glue through its native connector for Teradata Vantage

AWS Big Data

NOVEMBER 30, 2023

In this post, we explore how to use the AWS Glue native connector for Teradata Vantage to streamline data integrations and unlock the full potential of your data. Businesses often rely on Amazon Simple Storage Service (Amazon S3) for storing large amounts of data from various data sources in a cost-effective and secure manner.

IT

IT Visualization Machine Learning Data Integration

Modernizing the Data Warehouse: Challenges and Benefits

BI-Survey

AUGUST 21, 2020

The primary modernization approach is data warehouse/ETL automation, which helps promote broad usage of the data warehouse but can only partially improve efficiency in data management processes. However, an automation approach alone is of limited usefulness when data management processes are inefficient.

Data Warehouse

Data Warehouse Data Lake Data Governance Data Architecture

Don’t Fear Artificial Intelligence; Embrace it Through Data Governance

CIO Business Intelligence

APRIL 29, 2022

Software development, once solely the domain of human programmers, is now increasingly the by-product of data being carefully selected, ingested, and analysed by machine learning (ML) systems in a recurrent cycle. Further, data management activities don’t end once the AI model has been developed. era is upon us.

Data Governance

Data Governance IT Risk Data Lake

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Transformation

What is data architecture? A framework to manage data

CIO Business Intelligence

DECEMBER 20, 2024

Beyond breaking down silos, modern data architectures need to provide interfaces that make it easy for users to consume data using tools fit for their jobs. Data must be able to freely move to and from data warehouses, data lakes, and data marts, and interfaces must make it easy for users to consume that data.

Data Architecture

Data Architecture Management Consulting Internet of Things

Straumann Group is transforming dentistry with data, AI

CIO Business Intelligence

FEBRUARY 16, 2023

My vision is that I can give the keys to my businesses to manage their data and run their data on their own, as opposed to the Data & Tech team being at the center and helping them out,” says Iyengar, director of Data & Tech at Straumann Group North America. The company’s Findability.ai

Unstructured Data

Unstructured Data Data Lake Prescriptive Analytics Data Warehouse

Automate schema evolution at scale with Apache Hudi in AWS Glue

AWS Big Data

FEBRUARY 7, 2023

This post focuses on such schema changes in file-based tables and shows how to automatically replicate the schema evolution of structured data from table formats in databases to the tables stored as files in cost-effective way. Apache Hudi supports ACID transactions and CRUD operations on a data lake. and save it.

Data Lake

Data Lake Testing Big Data Structured Data

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

AWS Big Data

DECEMBER 21, 2023

As the volume and complexity of analytics workloads continue to grow, customers are looking for more efficient and cost-effective ways to ingest and analyse data. AWS Glue provides both visual and code-based interfaces to make data integration effortless.

Analytics

Analytics IT Data Lake Visualization

P&G turns to AI to create digital manufacturing of the future

CIO Business Intelligence

OCTOBER 1, 2022

P&G is also piloting the use of IIoT, advanced algorithms, machine learning (ML), and predictive analytics to improve manufacturing efficiencies in the production of paper towels. The end-to-end process requires several steps, including data integration and algorithm development, training, and deployment.

Manufacturing

Manufacturing Digital Transformation IoT Internet of Things

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

However, enterprise data generated from siloed sources combined with the lack of a data integration strategy creates challenges for provisioning the data for generative AI applications. As part of the transformation, the objects need to be treated to ensure data privacy (for example, PII redaction).

Data Governance

Data Governance Unstructured Data Metadata Data Lake

ThoughtSpot Enables Simpler Analytics with AI and NLP

David Menninger's Analyst Perspectives

JANUARY 21, 2022

To derive value from this data, organizations must query the data regularly and share insights with relevant teams and departments. Some organizations have started using NLP in self-service analytics to quickly identify patterns and simplify data visualization.

Analytics

Analytics Machine Learning Visualization Reporting

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

AWS Big Data

MARCH 27, 2024

AWS has invested in a zero-ETL (extract, transform, and load) future so that builders can focus more on creating value from data, instead of having to spend time preparing data for analysis. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog.

Data Analytics

Data Analytics Analytics Data Warehouse Data Lake

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Webinars

Trending Sources

Five steps to jumpstart your data integration journey

Webinars

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Talend Data Fabric Simplifies Data Life Cycle Management

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Data transformation takes flight at Atlanta’s Hartsfield-Jackson airport

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Access Amazon Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

Unlocking the Potential of Machine Learning in a Data Lake

Preparing the foundations for Generative AI

Machine Learning and AI Underpin Predictive Analytics to Achieve Clinical Breakthroughs

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Databricks’ new data lakehouse aims at media, entertainment sector

Compose your ETL jobs for MongoDB Atlas with AWS Glue

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

What is Data Pipeline? A Detailed Explanation

An AI Chat Bot Wrote This Blog Post …

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue

Recap of Amazon Redshift key product announcements in 2024

It’s not your data. It’s how you use it. Unlock the power of data & build foundations of a data driven organisation

Introducing Apache Hudi support with AWS Glue crawlers

The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

Data replication holds the key to hybrid cloud effectiveness

Prepare and load Amazon S3 data into Teradata using AWS Glue through its native connector for Teradata Vantage

Modernizing the Data Warehouse: Challenges and Benefits

Don’t Fear Artificial Intelligence; Embrace it Through Data Governance

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

What is data architecture? A framework to manage data

Straumann Group is transforming dentistry with data, AI

Automate schema evolution at scale with Apache Hudi in AWS Glue

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

P&G turns to AI to create digital manufacturing of the future

Data governance in the age of generative AI

ThoughtSpot Enables Simpler Analytics with AI and NLP

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

Stay Connected