Blog, Data Lake and Optimization

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

A Drug Launch Case Study in the Amazing Efficiency of a Data Team Using DataOps How a Small Team Powered the Multi-Billion Dollar Acquisition of a Pharma Startup When launching a groundbreaking pharmaceutical product, the stakes and the rewards couldnt be higher. It is necessary to have more than a data lake and a database.

Data Quality

Data Quality Data Lake Testing Statistics

Differentiating Between Data Lakes and Data Warehouses

Smart Data Collective

SEPTEMBER 23, 2020

While there is a lot of discussion about the merits of data warehouses, not enough discussion centers around data lakes. We talked about enterprise data warehouses in the past, so let’s contrast them with data lakes. Both data warehouses and data lakes are used when storing big data.

Data Lake

Data Lake Data Warehouse Unstructured Data Big Data

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization. We have launched new RA3.large large instances.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. AWS Glue 3.0 The following diagram illustrates the solution architecture.

Data Lake

Data Lake Data Processing Metadata Snapshot

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

Data Lake

Data Lake Analytics Cost-Benefit Management

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

Cloudinary is a cloud-based media management platform that provides a comprehensive set of tools and services for managing, optimizing, and delivering images, videos, and other media assets on websites and mobile applications.

Data Lake

Data Lake Metadata Snapshot Analytics

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

Observability in DataOps refers to the ability to monitor and understand the performance and behavior of data-related systems and processes, and to use that information to improve the quality and speed of data-driven decision making. Overall, DataOps observability is an essential component of modern data-driven organizations.

Machine Learning

Machine Learning Data-driven Optimization Data Analytics

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

AWS Big Data

JULY 31, 2024

In the current industry landscape, data lakes have become a cornerstone of modern data architecture, serving as repositories for vast amounts of structured and unstructured data. Maintaining data consistency and integrity across distributed data lakes is crucial for decision-making and analytics.

Data Lake

Data Lake Marketing Data Processing Management

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. This property is set to true by default. availability.

Data Lake

Data Lake Snapshot Metadata Optimization

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

AWS Big Data

OCTOBER 30, 2024

With this integration, you can now seamlessly query your governed data lake assets in Amazon DataZone using popular business intelligence (BI) and analytics tools, including partner solutions like Tableau. Use case Amazon DataZone addresses your data sharing challenges and optimizes data availability.

Analytics

Analytics Visualization Data Governance Data-driven

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines.

Data Integration

Data Integration Data Lake Statistics Data-driven

Deriving Value from Data Lakes with AI

Sisense

DECEMBER 23, 2019

Artificial Intelligence and machine learning are the future of every industry, especially data and analytics. AI and ML are the only ways to derive value from massive data lakes, cloud-native data warehouses, and other huge stores of information. Use AI to tackle huge datasets.

Data Lake

Data Lake Machine Learning Data Warehouse Digital Transformation

How Salesforce optimized their detection and response platform using AWS managed services

AWS Big Data

APRIL 18, 2024

This is a guest blog post co-authored with Atul Khare and Bhupender Panwar from Salesforce. In this post, we discuss how the Salesforce TIP team optimized their architecture using Amazon Web Services (AWS) managed services to achieve better scalability, cost, and operational efficiency. Headquartered in San Francisco, Salesforce, Inc.

Optimization

Optimization Data Lake Management Key Performance Indicator

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. By providing a standardized framework for data representation, open table formats break down data silos, enhance data quality, and accelerate analytics at scale.

Snapshot

Snapshot Metadata Data Lake Optimization

Centralize Your Data Processes With a DataOps Process Hub

DataKitchen

NOVEMBER 4, 2021

It expands beyond tools and data architecture and views the data organization from the perspective of its processes and workflows. The DataKitchen Platform is a “ process hub” that masters and optimizes those processes. Cloud computing has made it much easier to integrate data sets, but that’s only the beginning.

Data Processing

Data Processing Data Lake Cost-Benefit Testing

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. Moreover, the framework should consume compute resources as optimally as possible per the size of the operational tables.

Data Lake

Data Lake Data Processing Metadata Snapshot

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. For InitialRunFlag , choose Setup.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

Implementing a Pharma Data Mesh using DataOps

DataKitchen

AUGUST 19, 2021

Figure 3 shows an example processing architecture with data flowing in from internal and external sources. Each data source is updated on its own schedule, for example, daily, weekly or monthly. The data scientists and analysts have what they need to build analytics for the user. The new Recipes run, and BOOM! Conclusion.

Data Warehouse

Data Warehouse Data Lake Manufacturing Testing

Deploy and Optimize Your Snowflake Environment Faster With Accelerators

CDW Research Hub

JULY 18, 2022

One modern data platform solution that provides simplicity and flexibility to grow is Snowflake’s data cloud and platform. These Snowflake accelerators reduce the time to analytics for your users at all levels so you can make data-driven decisions faster. Security Data Lake. Optimizing Snowflake functionality.

Optimization

Optimization Data Lake Data Warehouse Manufacturing

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

AWS Big Data

JUNE 15, 2023

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems. Review and update the crawler settings.

Data Lake

Data Lake Metadata Cost-Benefit Management

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

AWS Big Data

APRIL 24, 2023

Building a data lake on Amazon Simple Storage Service (Amazon S3) provides numerous benefits for an organization. However, many use cases, like performing change data capture (CDC) from an upstream relational database to an Amazon S3-based data lake, require handling data at a record level.

Data Lake

Data Lake Data Governance Machine Learning Cost-Benefit

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

AWS Big Data

JUNE 23, 2023

This blog post is co-written with Ori Nakar from Imperva. Events and many other security data types are stored in Imperva’s Threat Research Multi-Region data lake. Imperva harnesses data to improve their business outcomes. Imperva’s data lake has a few dozen different datasets, in the scale of petabytes.

Data Lake

Data Lake Dashboards Cost-Benefit Data Warehouse

How to modernize data lakes with a data lakehouse architecture

IBM Big Data Hub

JULY 5, 2023

Data Lakes have been around for well over a decade now, supporting the analytic operations of some of the largest world corporations. Such data volumes are not easy to move, migrate or modernize. The challenges of a monolithic data lake architecture Data lakes are, at a high level, single repositories of data at scale.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. You can add more such query optimization rules to the instructions.

Metadata

Metadata Data Lake Modeling Data Warehouse

Introducing generative AI upgrades for Apache Spark in AWS Glue (preview)

AWS Big Data

NOVEMBER 22, 2024

Important considerations for preview As you begin using automated Spark upgrades during the preview period, there are several important aspects to consider for optimal usage of the service: Service scope and limitations – The preview release focuses on PySpark code upgrades from AWS Glue versions 2.0 option("recursiveFileLookup", "true").option("path",

Cost-Benefit

Cost-Benefit Data-driven Software Testing

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

AWS Big Data

OCTOBER 30, 2024

However, you can use the same file name as long as it’s from different auto-copy jobs: job_customerA_sales – s3://redshift-blogs/sales/customerA/2022-10-15-sales.csv job_customerB_sales – s3://redshift-blogs/sales/customerB/2022-10-15-sales.csv Do not update file contents. Do not overwrite existing files.

Data Warehouse

Data Warehouse Sales Data Lake Recreation/Entertainment

The Differences Between Data Warehouses and Data Lakes

Sisense

APRIL 9, 2021

The amount of data being generated and stored every day has exploded. Companies of all kinds are sitting on stockpiles of data that could someday prove valuable. Until then though, they don’t necessarily want to spend the time and resources necessary to create a schema to house this data in a traditional data warehouse.

Data Lake

Data Lake Data Warehouse Unstructured Data Structured Data

Build a cost-efficient data lake strategy with The Denodo Platform

Data Virtualization

NOVEMBER 25, 2021

The market for data lakes has recently seen an impressive wave of new-generation engines that provide highly efficient processing of very large data volumes stored in distributed file systems, like S3, ADLS and others. With low cost of storage in.

Data Lake

Data Lake Strategy Marketing Optimization

Build a cost-efficient data lake strategy with The Denodo Platform

Data Virtualization

NOVEMBER 25, 2021

The market for data lakes has recently seen an impressive wave of new-generation engines that provide highly efficient processing of very large data volumes stored in distributed file systems, like S3, ADLS and others. With low cost of storage in.

Data Lake

Data Lake Strategy Marketing Optimization

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. For CoW tables, queries see the latest data committed.

Data Lake

Data Lake Snapshot Metadata Optimization

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

Your SaaS company can store and protect any amount of data using Amazon Simple Storage Service (S3), which is ideal for data lakes, cloud-native applications, and mobile apps. Management of data. This blog post has demonstrated how AWS can greatly benefit your SaaS company, on multiple levels. Conclusions.

Cost-Benefit

Cost-Benefit Data Lake Software Machine Learning

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Alation

FEBRUARY 20, 2020

For many enterprises, a hybrid cloud data lake is no longer a trend, but becoming reality. Not only can resources be quickly provisioned and optimized for different workloads and processing needs, but it can be done cost effectively. The Problem with Hybrid Cloud Environments. How to Catalog AWS S3 with Alation.

Data Lake

Data Lake ROI Metadata Cost-Benefit

IBM showcases Gen AI-driven Concert to monitor and manage enterprise applications

CIO Business Intelligence

MAY 21, 2024

The Concert offering focuses on dependency mapping across an increasingly broad set of data sources (or entities) to enable developers, site reliability engineers (SREs) , and operations to better understand potential problems at the application layer, according to IDC Research Vice President Stephen Elliot. “As

Enterprise

Enterprise Management Data Lake Risk

Databricks’ new data lakehouse aims at media, entertainment sector

CIO Business Intelligence

APRIL 25, 2022

Now generally available, the M&E data lakehouse comes with industry use-case specific features that the company calls accelerators, including real-time personalization, said Steve Sobel, the company’s global head of communications, in a blog post. Features focus on media and entertainment firms.

Recreation/Entertainment

Recreation/Entertainment Data Lake Data Warehouse Unstructured Data

Moving Enterprise Data From Anywhere to Any System Made Easy

Cloudera

JUNE 2, 2022

CDF-PC is a cloud native universal data distribution service powered by Apache NiFi on Kubernetes, ??allowing allowing developers to connect to any data source anywhere with any structure, process it, and deliver to any destination. This blog aims to answer two questions: What is a universal data distribution service?

Enterprise

Enterprise Data Lake Data Collection Data-driven

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Data-driven

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Apache Ozone is one of the major innovations introduced in CDP, which provides the next generation storage architecture for Big Data applications, where data blocks are organized in storage containers for larger scale and to handle small objects. Cloudera will publish separate blog posts with results of performance benchmarks.

Data Lake

Data Lake Cost-Benefit Metadata Big Data

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings.

Statistics

Statistics Data Lake Optimization Data-driven

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

However, enterprises often encounter challenges with data silos, insufficient access controls, poor governance, and quality issues. Embracing data as a product is the key to address these challenges and foster a data-driven culture. In this context, Amazon DataZone is the optimal choice for managing the enterprise data platform.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Drug Launch Case Study: Amazing Efficiency Using DataOps

Differentiating Between Data Lakes and Data Warehouses

Webinars

Trending Sources

Recap of Amazon Redshift key product announcements in 2024

Webinars

Use Apache Iceberg in a data lake to support incremental data processing

Multicloud data lake analytics with Amazon Athena

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

An AI Chat Bot Wrote This Blog Post …

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Deriving Value from Data Lakes with AI

How Salesforce optimized their detection and response platform using AWS managed services

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Centralize Your Data Processes With a DataOps Process Hub

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Implementing a Pharma Data Mesh using DataOps

Deploy and Optimize Your Snowflake Environment Faster With Accelerators

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

How to modernize data lakes with a data lakehouse architecture

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Introducing generative AI upgrades for Apache Spark in AWS Glue (preview)

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

The Differences Between Data Warehouses and Data Lakes

Build a cost-efficient data lake strategy with The Denodo Platform

Build a cost-efficient data lake strategy with The Denodo Platform

Introducing Apache Hudi support with AWS Glue crawlers

10 Things AWS Can Do for Your SaaS Company

Driving Business Value and ROI from a Hybrid Cloud Data Lake

IBM showcases Gen AI-driven Concert to monitor and manage enterprise applications

Databricks’ new data lakehouse aims at media, entertainment sector

Moving Enterprise Data From Anywhere to Any System Made Easy

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

The Future of the Data Lakehouse – Open

Apache Ozone and Dense Data Nodes

Enhance query performance using AWS Glue Data Catalog column-level statistics

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Stay Connected