Data Lake, Data Processing and Events

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

The DataFrame code generation now extends beyond AWS Glue DynamicFrame to support a broader range of data processing scenarios. The TICKIT dataset records sales activities on the fictional TICKIT website, where users can purchase and sell tickets online for different types of events such as sports games, shows, and concerts.

Data Integration

Data Integration Visualization Data Processing Data Lake

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

For many organizations, this centralized data store follows a data lake architecture. Although data lakes provide a centralized repository, making sense of this data and extracting valuable insights can be challenging. Amazon S3 emits an object created event and matches an EventBridge rule.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

AWS Big Data

NOVEMBER 11, 2024

Disaster recovery is vital for organizations, offering a proactive strategy to mitigate the impact of unforeseen events like system failures, natural disasters, or cyberattacks. In the event of data loss or system failure, these snapshots will be used to restore the domain to a specific point in time.

Snapshot

Snapshot Strategy Dashboards Data Lake

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

However, enterprises often encounter challenges with data silos, insufficient access controls, poor governance, and quality issues. Embracing data as a product is the key to address these challenges and foster a data-driven culture. To incorporate this third-party data, AWS Data Exchange is the logical choice.

Sales

Sales Data-driven Data Processing Key Performance Indicator

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

cycle_end"', "sagemakedatalakeenvironment_sub_db", ctas_approach=False) A similar approach is used to connect to shared data from Amazon Redshift, which is also shared using Amazon DataZone. The applications are hosted in dedicated AWS accounts and require a BI dashboard and reporting services based on Tableau.

IoT

IoT Machine Learning Metadata Data-driven

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

AWS Big Data

JUNE 19, 2023

Security Lake automatically centralizes security data from cloud, on-premises, and custom sources into a purpose-built data lake stored in your account. With Security Lake, you can get a more complete understanding of your security data across your entire organization.

Publishing

Publishing Dashboards Visualization Management

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

AWS Big Data

AUGUST 3, 2023

Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. For more information, see Changing the default settings for your data lake.

Data Lake

Data Lake Visualization Dashboards Insurance

BMC on BMC: How the company enables IT observability with BMC Helix and AIOps

CIO Business Intelligence

DECEMBER 7, 2023

As a global company with more than 6,000 employees, BMC faces many of the same data challenges that other large enterprises face. The organization has 500 applications for business services, 80,000 VMs, 3,000 hosts, and more than 100,000 containers. Given the sheer volume of enterprise data, it’s impossible to do this manually.

IT

IT Data Lake Business Services Data Processing

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Use AWS Glue to streamline SFTP data processing

AWS Big Data

AUGUST 13, 2024

With AWS Glue, you can discover and connect to hundreds of diverse data sources and manage your data in a centralized data catalog. It enables you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Choose Store a new secret.

Data Processing

Data Processing Visualization Data Lake Data Processing

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Apache Flink is a widely used data processing engine for scalable streaming ETL, analytics, and event-driven applications.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Sisense

DECEMBER 11, 2019

With data growing at a staggering rate, managing and structuring it is vital to your survival. In our Event Spotlight series, we cover the biggest industry events helping builders learn about the latest tech, trends, and people innovating in the space. In this piece, we detail the Israeli debut of Periscope Data.

Data Lake

Data Lake Big Data Sales Data-driven

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

These nodes can implement analytical platforms like data lake houses, data warehouses, or data marts, all united by producing data products. By treating the data as a product, the outcome is a reusable asset that outlives a project and meets the needs of the enterprise consumer.

Metadata

Metadata Data Governance Data Quality Data-driven

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

With automated alerting with a third-party service like PagerDuty , an incident management platform, combined with the robust and powerful alerting plugin provided by OpenSearch Service, businesses can proactively manage and respond to critical events. For Host , enter events.PagerDuty.com. Leave the defaults and choose Next.

Data Lake

Data Lake Dashboards Metrics Testing

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. This solution uses Amazon Aurora MySQL hosting the example database salesdb.

Data Warehouse

Data Warehouse Snapshot Data Processing Internet of Things

Announcing the 2020 Data Impact Award Winners

Cloudera

NOVEMBER 18, 2020

During the first-ever virtual broadcast of our annual Data Impact Awards (DIA) ceremony, we had the great pleasure of announcing this year’s finalists and winners. In a year marked by unusual events, and disruption to our “normal” lives, it was a pleasure to recognize our customers’ most impressive achievements.

Internet Publishing and Broadcasting

Internet Publishing and Broadcasting Data-driven Broadcasting Digital Transformation

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Amazon Redshift data ingestion options

AWS Big Data

SEPTEMBER 5, 2024

Amazon Redshift , a warehousing service, offers a variety of options for ingesting data from diverse sources into its high-performance, scalable environment. This native feature of Amazon Redshift uses massive parallel processing (MPP) to load objects directly from data sources into Redshift tables. Launch an Amazon EMR (emr-6.9.0)

IoT

IoT Data Warehouse Cost-Benefit Reporting

Access Amazon Athena in your applications using the WebSocket API

AWS Big Data

MARCH 2, 2023

Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products.

Data Lake

Data Lake Testing Interactive Unstructured Data

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

IBM Big Data Hub

MAY 9, 2023

Over the past decade, deep learning arose from a seismic collision of data availability and sheer compute power, enabling a host of impressive AI capabilities. Our work in this area includes FairIJ , which identifies biased data points in data used to tune a model, so that they can be edited out. All watsonx.ai

Enterprise

Enterprise Technology Modeling Cost-Benefit

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

AWS Big Data

NOVEMBER 8, 2023

Putting your data to work with generative AI – Innovation Talk Thursday, November 30 | 12:30 – 1:30 PM PST | The Venetian Join Mai-Lan Tomsen Bukovec, Vice President, Technology at AWS to learn how you can turn your data lake into a business advantage with generative AI. Reserve your seat now! Reserve your seat now!

Data-driven

Data-driven Machine Learning Data Lake Cost-Benefit

How data literacy allows gen AI to drive productivity at Dow

CIO Business Intelligence

JULY 31, 2024

These models allow us to predict failures early, and we forecast a 20% reduction in furnace unplanned events, improving repair times by at least two days. Also, last August, we ran an AI immersion day, which the CEO Jim Fitterling and I co-hosted for our top 200 leaders. So AI helps us have fewer emergencies.

Manufacturing

Manufacturing Cost-Benefit Digital Transformation Forecasting

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Those decentralization efforts appeared under different monikers through time, e.g., data marts versus data warehousing implementations (a popular architectural debate in the era of structured data) then enterprise-wide data lakes versus smaller, typically BU-Specific, “data ponds”.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

Cloudera’s Data Warehouse service allows raw data to be stored in the cloud storage of your choice (S3, ADLSg2). It will be stored in your own namespace, and not force you to move data into someone else’s proprietary file formats or hosted storage. Proprietary file formats mean no one else is invited in! Separate compute.

Data Warehouse

Data Warehouse Data Lake IT Analytics

Implement disaster recovery with Amazon Redshift

AWS Big Data

JUNE 27, 2024

This enables you to use your data to acquire new insights for your business and customers. The objective of a disaster recovery plan is to reduce disruption by enabling quick recovery in the event of a disaster that leads to system failure. In the event of a cluster failure, you must restore the cluster from a snapshot.

Snapshot

Snapshot Data Warehouse Data Processing Strategy

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

AWS Big Data

SEPTEMBER 22, 2023

At Stitch Fix, we have been powered by data science since its foundation and rely on many modern data lake and data processing technologies. In our infrastructure, Apache Kafka has emerged as a powerful tool for managing event streams and facilitating real-time data processing.

Management

Management Metrics Cost-Benefit Data Lake

Aaand the New NiFi Champion is…

Cloudera

JUNE 5, 2023

Cloudera VP of Engineering and NiFi founder Joe Witt, joined by principal committers Mark Payne and Matt Gillman, addressed the global community via a virtual event dubbed “ Meet the Committers.” As part of the event, Cloudera kicked off the “Best in Flow” contest. On the verge of the release of NiFi 2.0,

Testing

Testing Data Lake Data Processing IT

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.

Analytics

Analytics Data Lake Management Enterprise

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

AWS Big Data

MAY 30, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Structured Data

Why Big Data Needs A Robust Off-Site Data Backup Method

Smart Data Collective

OCTOBER 26, 2019

While there is more of a push to use cloud data for off-site backup , this method comes with its own caveats. In the event of a network shutdown or failure, it may take much longer to restore functionality (and therefore connection) to a cloud-hosted off-site backup. Big Data Storage Concerns.

Big Data

Big Data Data Lake Cost-Benefit Measurement

Accelerating revenue growth with real-time analytics: Poshmark’s journey

AWS Big Data

MARCH 20, 2023

Poshmark wanted to address the following business use cases via the real-time analytics platform: Sessionization – Poshmark captures both server-side application events and client-side tracking events. They wanted to use these events to identify and analyze user sessions to track behavior. The event data format is nested JSON.

Analytics

Analytics Data Processing Slice and Dice Data Lake

CDP Private Cloud is a Game-changer for Partners

Cloudera

SEPTEMBER 2, 2020

Additionally, lines of business (LOBs) are able to gain access to a shared data lake that is secured and governed by the use of Cloudera Shared Data Experience (SDX). According to 451 Research’s Voice of the Enterprise: Cloud, Hosting & Managed Services study, 58% of Enterprises are moving towards a hybrid IT environment.

Cost-Benefit

Cost-Benefit Data Warehouse Data Lake Machine Learning

10 Keys to a Secure Cloud Data Lakehouse

Cloudera

OCTOBER 25, 2022

The data lakehouse is gaining in popularity because it enables a single platform for all your enterprise data with the flexibility to run any analytic and machine learning (ML) use case. Cloud data lakehouses provide significant scaling, agility, and cost advantages compared to cloud data lakes and cloud data warehouses.

Data Processing

Data Processing Data Lake Cost-Benefit Risk

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

bridgei2i

MARCH 3, 2021

Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities. Unlocking the Value of Enterprise AI with Data Engineering Capabilities. They discuss how the data engineering team is instrumental in easing collaboration between analysts, data scientists and ML engineers to build enterprise AI solutions.

Enterprise

Enterprise Digital Transformation Data-driven Interactive

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

AWS Big Data

JULY 30, 2024

Two private subnets are used to set up the Amazon MWAA environment, and the third private subnet is used to host the AWS Lambda authorizer function. Vijay Velpula is a Data Lake Architect with AWS Professional Services. He assists customers in building modern data platforms by implementing big data and analytics solutions.

Metadata

Metadata Enterprise Data Lake Management

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

You can subscribe to data products that help enrich customer profiles, for example demographics data, advertising data, and financial markets data. Amazon Kinesis ingests streaming events in real time from point-of-sales systems, clickstream data from mobile apps and websites, and social media data.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

AWS Big Data

FEBRUARY 16, 2023

For this post, we use the TICKIT database created in an Amazon Redshift data warehouse, which consists of seven tables: two fact tables and five dimensions, as shown in the following figure. Choose the data source you created in the previous step. On the Datasets page, choose New dataset. Choose Use custom SQL.

Data Warehouse

Data Warehouse Sales Visualization Data Processing

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

AWS Big Data

JUNE 6, 2023

Verify the job by running the following command: kubectl get pods -n data-team-a Enable access to the Spark UI The Spark UI is an important tool for data engineers because it allows you to track the progress of tasks, view detailed job and stage information, and analyze resource utilization to identify bottlenecks and optimize your code.

Optimization

Optimization Data Lake Cost-Benefit Management

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

Every data domain in NNEDH has isolated permissions synthesized by NNEDH as the central governance management layer. This is a similar pattern to what is adopted for other domain-oriented data management solutions. Two event-based mechanisms are used to maintain all the layers synchronized, as detailed in this section.

Data Governance

Data Governance Management Data-driven Analytics

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Webinars

Trending Sources

Enrich your serverless data lake with Amazon Bedrock

Webinars

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

How EUROGATE established a data mesh architecture using Amazon DataZone

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

BMC on BMC: How the company enables IT observability with BMC Helix and AIOps

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Use AWS Glue to streamline SFTP data processing

Build a data lake with Apache Flink on Amazon EMR

Governing data in relational databases using Amazon DataZone

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Implement alerts in Amazon OpenSearch Service with PagerDuty

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Announcing the 2020 Data Impact Award Winners

Enhance query performance using AWS Glue Data Catalog column-level statistics

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Amazon Redshift data ingestion options

Access Amazon Athena in your applications using the WebSocket API

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

How data literacy allows gen AI to drive productivity at Dow

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Implement disaster recovery with Amazon Redshift

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

Aaand the New NiFi Champion is…

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

Why Big Data Needs A Robust Off-Site Data Backup Method

Accelerating revenue growth with real-time analytics: Poshmark’s journey

CDP Private Cloud is a Game-changer for Partners

10 Keys to a Secure Cloud Data Lakehouse

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

Create an end-to-end data strategy for Customer 360 on AWS

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

How Novo Nordisk built distributed data governance and control at scale

Stay Connected