Data Processing, Data Transformation and Reference

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

AWS Big Data

DECEMBER 20, 2024

Your generated jobs can use a variety of data transformations, including filters, projections, unions, joins, and aggregations, giving you the flexibility to handle complex data processing requirements. In this post, we discuss how Amazon Q data integration transforms ETL workflow development.

Data Integration

Data Integration Visualization Data Processing Data Lake

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

AWS Big Data

DECEMBER 16, 2024

With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of. On your project, in the navigation pane, choose Data. For Add data source , choose Add connection. Choose the plus sign.

Visualization

Visualization Data Processing Testing Publishing

Automating the Automators: Shift Change in the Robot Factory

O'Reilly on Data

JANUARY 17, 2023

” I, thankfully, learned this early in my career, at a time when I could still refer to myself as a software developer. Especially when you consider how Certain Big Cloud Providers treat autoML as an on-ramp to model hosting. Is autoML the bait for long-term model hosting? But that’s a story for another day.)

Machine Learning

Machine Learning Predictive Modeling Software Modeling

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Reporting being part of an effective DQM, we will also go through some data quality metrics examples you can use to assess your efforts in the matter. But first, let’s define what data quality actually is. What is the definition of data quality? Why Do You Need Data Quality Management? date, month, and year).

Data Quality

Data Quality Metrics Data-driven Management

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Together with price-performance, Amazon Redshift offers capabilities such as serverless architecture, machine learning integration within your data warehouse and secure data sharing across the organization. dbt Cloud is a hosted service that helps data teams productionize dbt deployments. Choose Create.

Data Warehouse

Data Warehouse Analytics Testing Modeling

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases.

Management

Management Metadata Analytics Dashboards

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

This involves creating VPC endpoints in both the AWS and Snowflake VPCs, making sure data transfer remains within the AWS network. Use Amazon Route 53 to create a private hosted zone that resolves the Snowflake endpoint within your VPC. For Data sources , search for and select Snowflake. Choose Create connection. Choose Next.

Analytics

Analytics Data-driven Data Integration Data Lake

Use Snowflake with Amazon MWAA to orchestrate data pipelines

AWS Big Data

OCTOBER 31, 2023

citibike-tripdata-destination-ACCOUNT_ID – The bucket used for storing the transformed dataset. When implementing the solution in this post, replace references to airflow-blog-bucket-ACCOUNT_ID and citibike-tripdata-destination-ACCOUNT_ID with the names of your own S3 buckets. Choose Next. Run the DAG Let’s look at how to run the DAGs.

Data Processing

Data Processing Management Publishing Visualization

Amazon Redshift data ingestion options

AWS Big Data

SEPTEMBER 5, 2024

The currently available choices include: The Amazon Redshift COPY command can load data from Amazon Simple Storage Service (Amazon S3), Amazon EMR , Amazon DynamoDB , or remote hosts over SSH. This native feature of Amazon Redshift uses massive parallel processing (MPP) to load objects directly from data sources into Redshift tables.

IoT

IoT Data Warehouse Cost-Benefit Reporting

Stream data to Amazon S3 for real-time analytics using the Oracle GoldenGate S3 handler

AWS Big Data

AUGUST 8, 2024

Oracle GoldenGate for Oracle Database and Big Data adapters Oracle GoldenGate is a real-time data integration and replication tool used for disaster recovery, data migrations, high availability. You can use temporary credentials; for more details, refer to Using temporary credentials with AWS resources.

Analytics

Analytics Big Data Software Data Integration

Data Integrity, the Basis for Reliable Insights

Sisense

AUGUST 28, 2020

Uncomfortable truth incoming: Most people in your organization don’t think about the quality of their data from intake to production of insights. However, as a data team member, you know how important data integrity (and a whole host of other aspects of data management) is. Means of ensuring data integrity.

Data Integration

Data Integration Testing Data Quality Data-driven

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

AWS Big Data

JULY 26, 2023

You can use your preferred IDE to implement AWS resource definition using the AWS Cloud Development Kit (AWS CDK) or AWS CloudFormation , and also the business logic of AWS Glue job scripts for data integration. To learn more about how to implement your AWS Glue job scripts locally, refer to Develop and test AWS Glue version 3.0

Data Integration

Data Integration Snapshot Testing Visualization

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches. Refer to the instructions in the README file for steps on how to provision and decommission this solution. You’re now ready to query the tables using Athena.

Analytics

Analytics IoT Metadata Internet of Things

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

Solution overview The following diagram illustrates the solution architecture: The solution uses AWS Glue as an ETL engine to extract data from the source Amazon RDS database. Built-in data transformations then scrub columns containing PII using pre-defined masking functions. PII detection and scrubbing.

Visualization

Visualization Metadata Data Transformation Testing

Enable advanced search capabilities for Amazon Keyspaces data by integrating with Amazon OpenSearch Service

AWS Big Data

FEBRUARY 26, 2024

Additionally, you can configure OpenSearch Ingestion to apply data transformations before delivery. The content includes a reference architecture, a step-by-step guide on infrastructure setup, sample code for implementing the solution within a use case, and an AWS Cloud Development Kit (AWS CDK) application for deployment.

Dashboards

Dashboards Testing Metrics Optimization

Enable data analytics with Talend and Amazon Redshift Serverless

AWS Big Data

JULY 25, 2023

For Host , enter the Redshift Serverless endpoint’s host URL. For more information on how to connect to a database, refer to tDBConnection. The output component defines that the data being processed in the job’s workflow will land in Redshift Serverless. For Host , enter the Redshift Serverless endpoint’s host URL.

Data Analytics

Data Analytics Analytics Data Warehouse Data Processing

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

The Delta tables created by the EMR Serverless application are exposed through the AWS Glue Data Catalog and can be queried through Amazon Athena. Data ingestion – Steps 1 and 2 use AWS DMS, which connects to the source database and moves full and incremental data (CDC) to Amazon S3 in Parquet format. EMR Serverless version 6.9.0

Data Lake

Data Lake Dashboards Metrics Metadata

Cross-account integration between SaaS platforms using Amazon AppFlow

AWS Big Data

APRIL 25, 2023

On many occasions, they need to apply business logic to the data received from the source SaaS platform before pushing it to the target SaaS platform. AnyCompany’s marketing team hosted an event at the Anaheim Convention Center, CA. Let’s take an example. The marketing team created leads based on the event in Adobe Marketo.

Sales

Sales Visualization Software Metadata

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

However, you might face significant challenges when planning for a large-scale data warehouse migration. For an example, refer to How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform. Platform architects define a well-architected platform.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Migrate your existing SQL-based ETL workload to an AWS serverless ETL infrastructure using AWS Glue

AWS Big Data

JULY 31, 2023

Customers often use many SQL scripts to select and transform the data in relational databases hosted either in an on-premises environment or on AWS and use custom workflows to manage their ETL. AWS Glue is a serverless data integration and ETL service with the ability to scale on demand. Select s3_crawler and choose Run.

Sales

Sales Data Warehouse Visualization Testing

The Modern Data Stack Explained: What The Future Holds

Alation

JANUARY 17, 2023

A modern data stack relies on cloud computing, whereas a legacy data stack stores data on servers instead of in the cloud. Modern data stacks provide access for more data professionals than a legacy data stack. Examples of data transformation tools include dbt and dataform.

Data Warehouse

Data Warehouse Cost-Benefit Data Science Data Transformation

How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with Amazon EMR on Amazon EKS

AWS Big Data

FEBRUARY 21, 2023

We use Apache Spark as our main data processing engine and have over 1,000 Spark applications running over massive amounts of data every day. These Spark applications implement our business logic ranging from data transformation, machine learning (ML) model inference, to operational tasks. Their costs were climbing.

Cost-Benefit

Cost-Benefit Informatics Optimization Management

The Rising Need for Data Governance in Healthcare

Alation

OCTOBER 28, 2021

Protect data at the source. Put data into action to optimize the patient experience and adapt to changing business models. What is Data Governance in Healthcare? Data governance in healthcare refers to how data is collected and used by hospitals, pharmaceutical companies, and other healthcare organizations and service providers.

Data Governance

Data Governance Measurement Data Quality Metrics

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

For example, the Flink FileSystem connector has FileSystemTableFactory to read/write data in Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3), the Flink HBase connector has HBase2DynamicTableFactory to read/write data in HBase, and the Flink Kafka connector has KafkaDynamicTableFactory to read/write data in Kafka.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In addition, more data is becoming available for processing / enrichment of existing and new use cases e.g., recently we have experienced a rapid growth in data collection at the edge and an increase in availability of frameworks for processing that data. As a result, alternative data integration technologies (e.g.,

Data Processing

Data Processing Data Warehouse Enterprise Visualization

Empowering data mesh: The tools to deliver BI excellence

erwin

APRIL 16, 2024

In this blog, we’ll delve into the critical role of governance and data modeling tools in supporting a seamless data mesh implementation and explore how erwin tools can be used in that role. erwin also provides data governance, metadata management and data lineage software called erwin Data Intelligence by Quest.

Metadata

Metadata Data Quality Data Governance Modeling

Building and operating data pipelines at scale using CI/CD, Amazon MWAA and Apache Spark on Amazon EMR by Wipro

AWS Big Data

FEBRUARY 25, 2025

Amazon EC2 to host and run a Jenkins build server. Solution walkthrough The solution architecture is shown in the preceding figure and includes: Continuous integration and delivery ( CI/CD) for data processing Data engineers can define the underlying data processing job within a JSON template.

Data Processing

Data Processing Machine Learning Data-driven Cost-Benefit

Hybrid big data analytics with Amazon EMR on AWS Outposts

AWS Big Data

JANUARY 29, 2025

Additionally, we show you how to submit batch jobs to Amazon EMR using EMR steps for automated, scheduled data processing. This method is ideal for recurring tasks or large-scale data transformations. We accessed the data interactively using EMR Studio notebooks and processed it as a batch job using EMR steps.

Big Data

Big Data Data Analytics Analytics Interactive

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

that gathers data from many sources. Strategic Objective Create a complete, user-friendly view of the data by preparing it for analysis. Requirement Multi-Source Data Blending Data from multiple sources is compiled and the output is a single view, metric, or visualization. Ask your vendors for references.

Analytics

Analytics Cost-Benefit Visualization Dashboards

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

This field guide to data mapping will explore how data mapping connects volumes of data for enhanced decision-making. Why Data Mapping is Important Data mapping is a critical element of any data management initiative, such as data integration, data migration, data transformation, data warehousing, or automation.

Data Warehouse

Data Warehouse Reporting Data Transformation Visualization

Data Leaders Brief

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Webinars

Trending Sources

Automating the Automators: Shift Change in the Robot Factory

Webinars

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

Use Snowflake with Amazon MWAA to orchestrate data pipelines

Amazon Redshift data ingestion options

Stream data to Amazon S3 for real-time analytics using the Oracle GoldenGate S3 handler

Data Integrity, the Basis for Reliable Insights

End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

Enable advanced search capabilities for Amazon Keyspaces data by integrating with Amazon OpenSearch Service

Enable data analytics with Talend and Amazon Redshift Serverless

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Cross-account integration between SaaS platforms using Amazon AppFlow

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Migrate your existing SQL-based ETL workload to an AWS serverless ETL infrastructure using AWS Glue

The Modern Data Stack Explained: What The Future Holds

How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with Amazon EMR on Amazon EKS

The Rising Need for Data Governance in Healthcare

Build a data lake with Apache Flink on Amazon EMR

Addressing the Three Scalability Challenges in Modern Data Platforms

Empowering data mesh: The tools to deliver BI excellence

Building and operating data pipelines at scale using CI/CD, Amazon MWAA and Apache Spark on Amazon EMR by Wipro

Hybrid big data analytics with Amazon EMR on AWS Outposts

What Is Embedded Analytics?

What is Data Mapping?

Stay Connected