Data Analytics, Data Lake and Data Processing

Data Analytics

Data Lake

Data Processing

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

Important Considerations When Migrating to a Data Lake

Smart Data Collective

MARCH 30, 2022

Azure Data Lake Storage Gen2 is based on Azure Blob storage and offers a suite of big data analytics features. If you don’t understand the concept, you might want to check out our previous article on the difference between data lakes and data warehouses. Determine your preparedness.

Data Lake

Data Lake Cost-Benefit Data Warehouse Big Data

Join 42,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

For many organizations, this centralized data store follows a data lake architecture. Although data lakes provide a centralized repository, making sense of this data and extracting valuable insights can be challenging. Clean up To avoid incurring future charges, delete the resources you created.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Use cases for Hive metastore federation for Amazon EMR Hive metastore federation for Amazon EMR is applicable to the following use cases: Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Eight Top DataOps Trends for 2022

DataKitchen

NOVEMBER 29, 2021

A domain has an important job and a dedicated team – five to nine members – who develop an intimate knowledge of data sources, data consumers and functional nuances. For example, managing ordered data dependencies, inter-domain communication, shared infrastructure, and incoherent workflows. Rise of the DataOps Engineer.

Testing

Testing Data Lake Data Architecture Manufacturing

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

The DataFrame code generation now extends beyond AWS Glue DynamicFrame to support a broader range of data processing scenarios. Next, the merged data is filtered to include only a specific geographic region. Then the transformed output data is saved to Amazon S3 for further processing in future.

Data Integration

Data Integration Visualization Data Processing Big Data

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

In addition to real-time analytics and visualization, the data needs to be shared for long-term data analytics and machine learning applications. The applications are hosted in dedicated AWS accounts and require a BI dashboard and reporting services based on Tableau. datazone_env_twinsimsilverdata"."cycle_end";')

IoT

IoT Machine Learning Metadata Data-driven

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Data Quality

Scaling RISE with SAP data and AWS Glue

AWS Big Data

NOVEMBER 29, 2024

Customers often want to augment and enrich SAP source data with other non-SAP source data. Such analytic use cases can be enabled by building a data warehouse or data lake. Customers can now use the AWS Glue SAP OData connector to extract data from SAP.

Visualization

Visualization Data Processing Data-driven Cost-Benefit

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

AWS Big Data

NOVEMBER 11, 2024

The workflow consists of the following initial steps: OpenSearch Service is hosted in the primary Region, and all the active traffic is routed to the OpenSearch Service domain in the primary Region. Samir works directly with enterprise customers to design and build customized solutions catered to their data analytics and cybersecurity needs.

Snapshot

Snapshot Strategy Dashboards Data Lake

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

AWS Big Data

AUGUST 3, 2023

With the rapid growth of technology, more and more data volume is coming in many different formats—structured, semi-structured, and unstructured. Data analytics on operational data at near-real time is becoming a common need. Then we can query the data with Amazon Athena visualize it in Amazon QuickSight.

Data Lake

Data Lake Visualization Dashboards Insurance

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Traditional batch ingestion and processing pipelines that involve operations such as data cleaning and joining with reference data are straightforward to create and cost-efficient to maintain. options(**additional_options).mode("append").save(s3_output_folder)

Data Lake

Data Lake Data Analytics Analytics Data Processing

Accomplish Agile Business Intelligence & Analytics For Your Business

datapine

APRIL 15, 2020

You need to determine if you are going with an on-premise or cloud-hosted strategy. These basic steps will enable you to deliver agile data analytics and BI methodology into practice, no matter the size of your company. Top 10 Tips For Agile BI & Analytics Development. Construction Iterations.

Business Intelligence

Business Intelligence Analytics Testing Dashboards

DS Smith sets a single-cloud agenda for sustainability

CIO Business Intelligence

DECEMBER 6, 2023

Its digital transformation began with an application modernization phase, in which Dickson and her IT teams determined which applications should be hosted in the public cloud and which should remain on a private cloud. Here, Dickson sees data generated from its industrial machines being very productive.

Manufacturing

Manufacturing Data Lake Machine Learning Digital Transformation

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

Data storage databases. Your SaaS company can store and protect any amount of data using Amazon Simple Storage Service (S3), which is ideal for data lakes, cloud-native applications, and mobile apps. Well, let’s find out. Artificial intelligence (AI). Easy to use.

Cost-Benefit

Cost-Benefit Data Lake Software Machine Learning

Announcing the 2020 Data Impact Award Winners

Cloudera

NOVEMBER 18, 2020

It hosts over 150 big data analytics sandboxes across the region with over 200 users utilizing the sandbox for data discovery. With this functionality, business units can now leverage big data analytics to develop better and faster insights to help achieve better revenues, higher productivity, and decrease risk. .

Internet Publishing and Broadcasting

Internet Publishing and Broadcasting Data-driven Broadcasting Digital Transformation

Use AWS Glue to streamline SFTP data processing

AWS Big Data

AUGUST 13, 2024

It enables you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Introducing the SFTP connector for AWS Glue The SFTP connector for AWS Glue simplifies the process of connecting AWS Glue jobs to extract data from SFTP storage and to load data into SFTP storage.

Data Processing

Data Processing Visualization Data Lake Data Processing

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

The Solution: CDP Private Cloud brings a next-generation hybrid architecture with cloud-native benefits to HBL’s data platform. HBL started their data journey in 2019 when data lake initiative was started to consolidate complex data sources and enable the bank to use single version of truth for decision making.

Management

Management Data Lake Consulting Unstructured Data

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

AWS Big Data

NOVEMBER 8, 2023

They recently needed to do a monthly load of 140 TB of uncompressed healthcare claims data in under 24 hours after receiving it to provide analysts and data scientists with up-to-date information on a patient’s healthcare journey. This data volume is expected to increase monthly and is fully refreshed each month.

Data Lake

Data Lake Data Warehouse Cost-Benefit Optimization

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. BMS’s EDLS platform hosts over 5,000 jobs and is growing at 15% YoY (year over year). About the authors Sivaprasad Mahamkali is a Senior Streaming Data Engineer at AWS Professional Services.

Metadata

Metadata Data Lake Visualization Data Quality

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

AWS Big Data

FEBRUARY 27, 2024

Each data producer within the organization has its own data lake in Apache Hudi format, ensuring data sovereignty and autonomy. These datasets are pivotal for reporting and analytics use cases, powered by services like Amazon Redshift and tools like Power BI.

Data-driven

Data-driven Data Lake Data Quality Data Governance

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

This involves creating VPC endpoints in both the AWS and Snowflake VPCs, making sure data transfer remains within the AWS network. Use Amazon Route 53 to create a private hosted zone that resolves the Snowflake endpoint within your VPC. He has helped technology companies design and implement data analytics solutions and products.

Analytics

Analytics Data-driven Data Integration Data Lake

Access Amazon Athena in your applications using the WebSocket API

AWS Big Data

MARCH 2, 2023

Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products.

Data Lake

Data Lake Testing Interactive Unstructured Data

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

For Host , enter events.PagerDuty.com. At AWS, he is focused on Data Lake implementations, and Search, Analytical workloads using Amazon OpenSearch Service. Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. For Channel type , choose Custom webhook.

Data Lake

Data Lake Dashboards Metrics Testing

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

These nodes can implement analytical platforms like data lake houses, data warehouses, or data marts, all united by producing data products. By treating the data as a product, the outcome is a reusable asset that outlives a project and meets the needs of the enterprise consumer.

Metadata

Metadata Data Governance Data Quality Data-driven

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

AWS Big Data

MAY 30, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. The system had an integration with legacy backend services that were all hosted on premises.

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Structured Data

Run Spark SQL on Amazon Athena Spark

AWS Big Data

OCTOBER 23, 2023

Modern applications store massive amounts of data on Amazon Simple Storage Service (Amazon S3) data lakes, providing cost-effective and highly durable storage, and allowing you to run analytics and machine learning (ML) from your data lake to generate insights on your data.

Data Lake

Data Lake Visualization Optimization Interactive

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Sisense

DECEMBER 11, 2019

This was the key learning from the Sisense event heralding the launch of Periscope Data in Tel Aviv, Israel — the beating heart of the startup nation. An exciting slate of presentations took them on a journey from why to how they should use data analytics to optimize their operations successfully and maximize their business opportunities.

Data Lake

Data Lake Big Data Sales Data-driven

The disruptive potential of open data lakehouse architectures and IBM watsonx.data

IBM Big Data Hub

JUNE 15, 2023

It is comprised of commodity cloud object storage, open data and open table formats, and high-performance open-source query engines. To help organizations scale AI workloads, we recently announced IBM watsonx.data , a data store built on an open data lakehouse architecture and part of the watsonx AI and data platform.

Data Warehouse

Data Warehouse Data Lake Optimization Data-driven

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

AWS Big Data

NOVEMBER 8, 2023

Putting your data to work with generative AI – Innovation Talk Thursday, November 30 | 12:30 – 1:30 PM PST | The Venetian Join Mai-Lan Tomsen Bukovec, Vice President, Technology at AWS to learn how you can turn your data lake into a business advantage with generative AI. Reserve your seat now! Reserve your seat now!

Data-driven

Data-driven Machine Learning Data Lake Cost-Benefit

TDC Digital leverages IBM Cloud for transparent billing and improved customer satisfaction

IBM Big Data Hub

MAY 19, 2023

Furthermore, TDC Digital had not used any cloud storage solution and experienced latency and downtime while hosting the application in its data center. TDC Digital is excited about its plans to host its IT infrastructure in IBM data centers, offering better scalability, performance and security.

Unstructured Data

Unstructured Data Data Processing Manufacturing Data Lake

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes.

Data Lake

Data Lake Dashboards Metrics Metadata

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

The workflow contains the following steps: Data is saved by the producer in their own Amazon Simple Storage Service (Amazon S3) buckets. Data source locations hosted by the producer are created within the producer’s AWS Glue Data Catalog. Data source locations are registered with Lake Formation.

Finance

Finance Metadata Big Data Recreation/Entertainment

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. This solution uses Amazon Aurora MySQL hosting the example database salesdb. Vishal Khatri is a Sr.

Data Warehouse

Data Warehouse Snapshot Data Processing Internet of Things

Attribute Amazon EMR on EC2 costs to your end-users

AWS Big Data

AUGUST 27, 2024

About the Authors Raj Patel is AWS Lead Consultant for Data Analytics solutions based out of India. He specializes in building and modernising analytical solutions. His background is in data warehouse/data lake – architecture, development and administration.

Metrics

Metrics Dashboards Data Lake Optimization

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

The AWS modern data architecture shows a way to build a purpose-built, secure, and scalable data platform in the cloud. Learn from this to build querying capabilities across your data lake and the data warehouse. About the Authors Ismail Makhlouf is a Senior Specialist Solutions Architect for Data Analytics at AWS.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Accelerating revenue growth with real-time analytics: Poshmark’s journey

AWS Big Data

MARCH 20, 2023

In this post, we share how Poshmark improved CX and accelerated revenue growth by using a real-time analytics solution. High-level challenge: The need for real-time analytics Previous efforts at Poshmark for improving CX through analytics were based on batch processing of analytics data and using it on a daily basis to improve CX.

Analytics

Analytics Data Processing Slice and Dice Data Lake

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.

Analytics

Analytics Data Lake Management Enterprise

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

2020 saw us hosting our first ever fully digital Data Impact Awards ceremony, and it certainly was one of the highlights of our year. We saw a record number of entries and incredible examples of how customers were using Cloudera’s platform and services to unlock the power of data.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

Build a pseudonymization service on AWS to protect sensitive data: Part 2

AWS Big Data

MARCH 6, 2024

The account on the right hosts the pseudonymization service, which you can deploy using the instructions provided in the Part 1 of this series. For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR.

Statistics

Statistics Metrics Testing Data Lake

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

We can determine the following are needed: An open data format ingestion architecture processing the source dataset and refining the data in the S3 data lake. This requires a dedicated team of 3–7 members building a serverless data lake for all data sources. Vijay Bagur is a Sr.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

AWS Big Data

JUNE 6, 2023

Verify the job by running the following command: kubectl get pods -n data-team-a Enable access to the Spark UI The Spark UI is an important tool for data engineers because it allows you to track the progress of tasks, view detailed job and stage information, and analyze resource utilization to identify bottlenecks and optimize your code.

Optimization

Optimization Data Lake Cost-Benefit Management

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

On January 4th I had the pleasure of hosting a webinar. It was titled, The Gartner 2021 Leadership Vision for Data & Analytics Leaders. This was for the Chief Data Officer, or head of data and analytics. Does Data warehouse as a software tool will play role in future of Data & Analytics strategy?

Data Analytics

Data Analytics Analytics Data-driven Finance

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Important Considerations When Migrating to a Data Lake

Webinars

Trending Sources

Enrich your serverless data lake with Amazon Bedrock

Webinars

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Eight Top DataOps Trends for 2022

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

How EUROGATE established a data mesh architecture using Amazon DataZone

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Scaling RISE with SAP data and AWS Glue

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Accomplish Agile Business Intelligence & Analytics For Your Business

DS Smith sets a single-cloud agenda for sustainability

10 Things AWS Can Do for Your SaaS Company

Announcing the 2020 Data Impact Award Winners

Use AWS Glue to streamline SFTP data processing

Habib Bank manages data at scale with Cloudera Data Platform

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

Access Amazon Athena in your applications using the WebSocket API

Implement alerts in Amazon OpenSearch Service with PagerDuty

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

Run Spark SQL on Amazon Athena Spark

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

The disruptive potential of open data lakehouse architectures and IBM watsonx.data

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

TDC Digital leverages IBM Cloud for transparent billing and improved customer satisfaction

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Attribute Amazon EMR on EC2 costs to your end-users

Create an end-to-end data strategy for Customer 360 on AWS

Accelerating revenue growth with real-time analytics: Poshmark’s journey

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Announcing the 2021 Data Impact Awards

Build a pseudonymization service on AWS to protect sensitive data: Part 2

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Stay Connected