Data Transformation, Modeling and Reference

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

Let’s start by considering the job of a non-ML software engineer: writing traditional software deals with well-defined, narrowly-scoped inputs, which the engineer can exhaustively and cleanly model in the code. Not only is data larger, but models—deep learning models in particular—are much larger than before.

IT

IT Testing Experimentation Software

Automating the Automators: Shift Change in the Robot Factory

O'Reilly on Data

JANUARY 17, 2023

” I, thankfully, learned this early in my career, at a time when I could still refer to myself as a software developer. Given that, what would you say is the job of a data scientist (or ML engineer, or any other such title)? Building Models. A common task for a data scientist is to build a predictive model.

Machine Learning

Machine Learning Predictive Modeling Software Modeling

Reference guide to build inventory management and forecasting solutions on AWS

AWS Big Data

APRIL 11, 2023

Finally, we can use Amazon SageMaker to build forecasting models that can predict inventory demand and optimize stock levels. ElastiCache manages the real-time application data caching, allowing your customers to experience microsecond response times while supporting high-throughput handling of hundreds of millions of operations per second.

Forecasting

Forecasting Management IoT Data-driven

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Generative AI models can translate natural language questions into valid SQL queries, a capability known as text-to-SQL generation.

Metadata

Metadata Data Lake Modeling Data Warehouse

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Reporting being part of an effective DQM, we will also go through some data quality metrics examples you can use to assess your efforts in the matter. But first, let’s define what data quality actually is. What is the definition of data quality? Why Do You Need Data Quality Management? 2 – Data profiling.

Data Quality

Data Quality Metrics Data-driven Management

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

AWS Big Data

NOVEMBER 27, 2024

Together with price-performance, Amazon Redshift offers capabilities such as serverless architecture, machine learning integration within your data warehouse and secure data sharing across the organization. dbt Cloud is a hosted service that helps data teams productionize dbt deployments. Create dbt models in dbt Cloud.

Data Warehouse

Data Warehouse Analytics Testing Sales

Accelerate your data workflows with Amazon Redshift Data API persistent sessions

AWS Big Data

NOVEMBER 22, 2024

Amazon Redshift has launched a session reuse capability for the Data API that can significantly streamline multi-step, stateful workloads such as exchange, transform, and load (ETL) pipelines, reporting processes, and other flows that involve sequential queries. Please refer to Redshift Quotas and Limits here.

Data Warehouse

Data Warehouse Recreation/Entertainment Cost-Benefit Data-driven

Data Engineers Are Using AI to Verify Data Transformations

Wayne Yaddow

FEBRUARY 26, 2025

AI is transforming how senior data engineers and data scientists validate data transformations and conversions. Artificial intelligence-based verification approaches aid in the detection of anomalies, the enforcement of data integrity, and the optimization of pipelines for improved efficiency.

Data Transformation

Data Transformation Testing Data-driven Data Quality

Building Better Data Models to Unlock Next-Level Intelligence

Sisense

MAY 11, 2021

You can’t talk about data analytics without talking about data modeling. The reasons for this are simple: Before you can start analyzing data, huge datasets like data lakes must be modeled or transformed to be usable. Building the right data model is an important part of your data strategy.

Modeling

Modeling Big Data IoT Data Warehouse

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

The goal of DataOps is to help organizations make better use of their data to drive business decisions and improve outcomes. ChatGPT> DataOps is a term that refers to the set of practices and tools that organizations use to improve the quality and speed of data analytics and machine learning.

Machine Learning

Machine Learning Data-driven Optimization Data Analytics

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

Tableau further democratizes analytics with AI-fueled features

CIO Business Intelligence

APRIL 30, 2024

Einstein Copilot for Tableau remains in beta, but Tableau announced two new features for the AI assistant as well: AI-assisted data transformation. This feature can automate a data transformation pipeline with step-by-step suggestions for preparing data for analysis.

Analytics

Analytics Metrics Visualization Dashboards

Time for New Partnership Paradigms to Be Future-fit

CIO Business Intelligence

DECEMBER 6, 2023

However, this partnership model cannot keep pace with an always-changing technology landscape in which the skill gaps and lack of resources are increasing. The new models recognise this, drawing tech vendors to shift toward innovation-focused roles and become partners in the client’s success.

Digital Transformation

Digital Transformation Software Cost-Benefit Manufacturing

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard. Additionally, it manages table definitions in the AWS Glue Data Catalog , containing references to data sources and targets of extract, transform, and load (ETL) jobs in AWS Glue.

Dashboards

Dashboards Analytics Metadata Data Warehouse

How the BMW Group analyses semiconductor demand with AWS Glue

AWS Big Data

APRIL 26, 2023

The Cloud Data Hub processes and combines anonymized data from vehicle sensors and other sources across the enterprise to make it easily accessible for internal teams creating customer-facing and internal applications. To learn more about the Cloud Data Hub, refer to BMW Group Uses AWS-Based Data Lake to Unlock the Power of Data.

Forecasting

Forecasting Manufacturing Data Lake Big Data

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

AI governance refers to the practice of directing, managing and monitoring an organization’s AI activities. It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. Foundation models can use language, vision and more to affect the real world.

Risk

Risk Modeling Management Metadata

7 key Microsoft Azure analytics services (plus one extra)

CIO Business Intelligence

JUNE 29, 2022

Taking the broadest possible interpretation of data analytics , Azure offers more than a dozen services — and that’s before you include Power BI, with its AI-powered analysis and new datamart option , or governance-oriented approaches such as Microsoft Purview. Azure Data Factory. Azure Synapse Analytics. Analytics, Microsoft Azure

Data Lake

Data Lake Analytics Data Warehouse Machine Learning

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases.

Management

Management Metadata Analytics Dashboards

Enrich, standardize, and translate streaming data in Amazon Redshift with generative AI

AWS Big Data

AUGUST 6, 2024

Amazon Redshift ML is a feature of Amazon Redshift that enables you to build, train, and deploy machine learning (ML) models directly within the Redshift environment. Generative AI models can derive new features from your data and enhance decision-making. Create a materialized view to load the raw streaming data.

Data Warehouse

Data Warehouse Data-driven Modeling Internet of Things

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

As with all AWS services, Amazon Redshift is a customer-obsessed service that recognizes there isn’t a one-size-fits-all for customers when it comes to data models, which is why Amazon Redshift supports multiple data models such as Star Schemas, Snowflake Schemas and Data Vault. Data Vault 2.0

Enterprise

Enterprise Data Warehouse Data Lake Optimization

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

AWS Big Data

AUGUST 19, 2024

Efficient data integration – AWS Glue simplifies the ETL process, providing a scalable and flexible solution for data integration between Snowflake and Amazon S3. Scalability and flexibility – The architecture supports scalable data transfers and can be extended to integrate additional data sources and destinations as needed.

Analytics

Analytics Data-driven Data Integration Data Lake

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

AWS Big Data

JANUARY 30, 2025

We set up our AWS CDK to refer to the contents of a specific directory and define a resource (for example, an AWS Step Functions state machine or an AWS Glue job) for each file it found in that directory. We also used it as a repository for storing code that could be retrieved and used by other services.

Data Warehouse

Data Warehouse Data Architecture Machine Learning Data Transformation

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

For full instructions, refer to Jira Cloud connector for Amazon AppFlow. You can do this by updating the CloudFormation stack with a flag that includes the CDC and data transformation steps. This will enable both the CDC steps and the data transformation steps for the Jira data. Choose Update.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

Improve observability across Amazon MWAA tasks

AWS Big Data

FEBRUARY 6, 2023

Operationalizing machine learning (ML) is another growing use case, where data has to be transformed and normalized before it can be loaded into an ML model. In both use cases, the data pipeline is preparing the data for consumption by ingesting data from different sources and transforming it through a series of steps.

Management

Management Interactive Publishing Metadata

Amazon Redshift data ingestion options

AWS Big Data

SEPTEMBER 5, 2024

If storing operational data in a data warehouse is a requirement, synchronization of tables between operational data stores and Amazon Redshift tables is supported. In scenarios where data transformation is required, you can use Redshift stored procedures to modify data in Redshift tables.

IoT

IoT Data Warehouse Cost-Benefit Reporting

How healthcare organizations can analyze and create insights using price transparency data

AWS Big Data

OCTOBER 11, 2023

Under the Transparency in Coverage (TCR) rule , hospitals and payors to publish their pricing data in a machine-readable format. For more information, refer to Delivering Consumer-friendly Healthcare Transparency in Coverage On AWS. The Data Catalog now contains references to the machine-readable data.

Visualization

Visualization Dashboards Data-driven Gap analysis

Automate discovery of data relationships using ML and Amazon Neptune graph technology

AWS Big Data

APRIL 19, 2023

Encounter 4 appears to refer to the customer with ID 8, but the email doesn’t match, and no Customer_ID is given. A knowledge graph allows us to combine data from different sources to gain a better understanding of a specific problem domain. With the AWS SDK for Pandas, we combine this data by running a query against the Neptune graph.

Technology

Technology Data-driven Machine Learning Sales

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In legacy analytical systems such as enterprise data warehouses, the scalability challenges of a system were primarily associated with computational scalability, i.e., the ability of a data platform to handle larger volumes of data in an agile and cost-efficient way. As a result, alternative data integration technologies (e.g.,

Data Processing

Data Processing Data Warehouse Enterprise Visualization

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

You can modify the Lambda function to fetch additional vehicle information from a separate data store (for example, a DynamoDB table or a Customer Relationship Management system) to enrich the data, before storing the results in an S3 bucket. In this model, the Lambda function is invoked for each incoming event.

Analytics

Analytics IoT Metadata Internet of Things

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

AWS Big Data

DECEMBER 21, 2023

OpenSearch Service also has vector database capabilities that let you implement semantic search and Retrieval Augmented Generation (RAG) with large language models (LLMs) to build recommendation and media search engines. For setup instructions, refer to Getting started with Amazon OpenSearch Service.

Analytics

Analytics IT Data Lake Visualization

Measuring Maturity

Peter James Thomas

MARCH 9, 2020

However, in this initial article, I wanted to to focus on one tool that I have used as part of my Data Strategy engagements; a Data Maturity Model. I used to talk about carrying out a Situational Analysis of Data Capabilities, nowadays I am more likely to refer to a Data Capability Review.

Measurement

Measurement Data Strategy Strategy Modeling

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

AWS Big Data

MARCH 15, 2023

To populate the database, the Infomedia team developed a data pipeline using Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue for data transformations, and Apache Hudi for CDC and record-level updates.

Cost-Benefit

Cost-Benefit Data Processing Optimization Data-driven

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

AWS Big Data

MARCH 16, 2023

A source of unpredictable workloads is dbt Cloud , which SafetyCulture uses to manage data transformations in the form of models. Whenever models are created or modified, a dbt Cloud CI job is triggered to test the models by materializing the models in Amazon Redshift.

Data Warehouse

Data Warehouse Testing Snapshot Modeling

Ten new visual transforms in AWS Glue Studio

AWS Big Data

MAY 9, 2023

AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. It allows you to visually compose data transformation workflows using nodes that represent different data handling steps, which later are converted automatically into code to run.

Visualization

Visualization Marketing Big Data IT

A step-by-step guide to setting up a data governance program

IBM Big Data Hub

FEBRUARY 9, 2023

In our last blog , we delved into the seven most prevalent data challenges that can be addressed with effective data governance. Today we will share our approach to developing a data governance program to drive data transformation and fuel a data-driven culture.

Data Governance

Data Governance Business Objectives Data Quality Measurement

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

AWS Big Data

AUGUST 26, 2024

In this post, I’ll walk you through how to copy data from one Amazon Relational Database Service (Amazon RDS) for PostgreSQL database to another, while scrubbing PII along the way using AWS Glue. Built-in data transformations then scrub columns containing PII using pre-defined masking functions. PII detection and scrubbing.

Visualization

Visualization Metadata Data Transformation Testing

Manual Feature Engineering

Domino Data Lab

AUGUST 20, 2019

Feature engineering is useful for data scientists when assessing tradeoff decisions regarding the impact of their ML models. It is a framework for approaching ML as well as providing techniques for extracting features from raw data that can be used within the models. Feature Engineering Terminology and Motivation.

Testing

Testing Modeling Interactive Measurement

Database vs. Data Warehouse: What’s the Difference?

Jet Global

MAY 28, 2019

A data warehouse stores transactional level details and serves the broader reporting and analytical needs of an organization – creating one source of truth for building semantic models or serving structured, simplified and harmonized data to tools like Power BI, Excel or even SSRS.

Data Warehouse

Data Warehouse Reporting Business Intelligence Sales

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

By preserving historical versions, data lake time travel provides benefits such as auditing and compliance, data recovery and rollback, reproducible analysis, and data exploration at different points in time. Another popular transaction data lake use case is incremental query. Make sure you are in the us-east-1 Region.

Data Lake

Data Lake Snapshot Big Data Data-driven

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

AWS Big Data

AUGUST 1, 2024

However, you might face significant challenges when planning for a large-scale data warehouse migration. For an example, refer to How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform. Platform architects define a well-architected platform.

Data Warehouse

Data Warehouse KPI Optimization Cost-Benefit

Bringing MMM to 21st Century with Machine Learning and Automation?

DataRobot Blog

APRIL 4, 2022

MMM stands for Marketing Mix Model and it is one of the oldest and most well-established techniques to measure the sales impact of marketing activity statistically. Data Requirements. As with any type of statistical model, data is key and GIGO (“Garbage In, Garbage Out”) principle definitely applies. What is MMM?

Machine Learning

Machine Learning Sales Measurement ROI

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset.

Reporting

Reporting Metrics Optimization Data Lake

Data Integration Patterns in Knowledge Graph Building with GraphDB

Ontotext

AUGUST 24, 2023

The solution is choosing one of the standard provenance models. Standard provenance models Graph Replace is probably the most straightforward model. Trade-offs of the standard provenance models Graph Replace is fast and simple to implement and we recommend it to people with batch updates. Persistent or non-persistent IDs?

Data Integration

Data Integration Modeling Business Objectives Optimization

Cross-account integration between SaaS platforms using Amazon AppFlow

AWS Big Data

APRIL 25, 2023

The following AWS services are used for data ingestion, processing, and load: Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications like Salesforce, SAP, Marketo, Slack, and ServiceNow, and AWS services like Amazon S3 and Amazon Redshift , in just a few clicks.

Sales

Sales Visualization Software Metadata

MLOps and DevOps: Why Data Makes It Different

Automating the Automators: Shift Change in the Robot Factory

Webinars

Trending Sources

Reference guide to build inventory management and forecasting solutions on AWS

Webinars

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

Accelerate your data workflows with Amazon Redshift Data API persistent sessions

Data Engineers Are Using AI to Verify Data Transformations

Building Better Data Models to Unlock Next-Level Intelligence

An AI Chat Bot Wrote This Blog Post …

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Tableau further democratizes analytics with AI-fueled features

Time for New Partnership Paradigms to Be Future-fit

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

How the BMW Group analyses semiconductor demand with AWS Glue

How to use foundation models and trusted governance to manage AI workflow risk

7 key Microsoft Azure analytics services (plus one extra)

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Enrich, standardize, and translate streaming data in Amazon Redshift with generative AI

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Improve observability across Amazon MWAA tasks

Amazon Redshift data ingestion options

How healthcare organizations can analyze and create insights using price transparency data

Automate discovery of data relationships using ML and Amazon Neptune graph technology

Addressing the Three Scalability Challenges in Modern Data Platforms

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

Measuring Maturity

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

Ten new visual transforms in AWS Glue Studio

A step-by-step guide to setting up a data governance program

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

Manual Feature Engineering

Database vs. Data Warehouse: What’s the Difference?

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Bringing MMM to 21st Century with Machine Learning and Automation?

Automate alerting and reporting for AWS Glue job resource usage

Data Integration Patterns in Knowledge Graph Building with GraphDB

Cross-account integration between SaaS platforms using Amazon AppFlow

Stay Connected