Data Transformation, Metadata and Optimization

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

These data processing and analytical services support Structured Query Language (SQL) to interact with the data. Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values.

Metadata

Metadata Data Lake Modeling Data Warehouse

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

For container terminal operators, data-driven decision-making and efficient data sharing are vital to optimizing operations and boosting supply chain efficiency. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

How to Build a Successful Metadata Management Framework

Alation

JUNE 28, 2022

This is where metadata, or the data about data, comes into play. Having a data catalog is the cornerstone of your data governance strategy, but what supports your data catalog? Your metadata management framework provides the underlying structure that makes your data accessible and manageable.

Metadata

Metadata Management Data Governance Machine Learning

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Available Now! Automated Testing for Data Transformations

Wayne Yaddow

FEBRUARY 18, 2025

Selecting the strategies and tools for validating data transformations and data conversions in your data pipelines. Introduction Data transformations and data conversions are crucial to ensure that raw data is organized, processed, and ready for useful analysis.

Testing

Testing Data Transformation Data-driven Data Quality

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

BMW Group uses 4,500 AWS Cloud accounts across the entire organization but is faced with the challenge of reducing unnecessary costs, optimizing spend, and having a central place to monitor costs. The ultimate goal is to raise awareness of cloud efficiency and optimize cloud utilization in a cost-effective and sustainable manner.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Biggest Trends in Data Visualization Taking Shape in 2022

Smart Data Collective

OCTOBER 13, 2021

There are countless examples of big data transforming many different industries. There is no disputing the fact that the collection and analysis of massive amounts of unstructured data has been a huge breakthrough. How is Data Virtualization performance optimized? Virtualization goes beyond query federation.

Visualization

Visualization Cost-Benefit Big Data Prescriptive Analytics

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

This person (or group of individuals) ensures that the theory behind data quality is communicated to the development team. 2 – Data profiling. Data profiling is an essential process in the DQM lifecycle. This means there are no unintended data errors, and it corresponds to its appropriate designation (e.g.,

Data Quality

Data Quality Metrics Data-driven Management

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making. However, as data volumes continue to grow, optimizing data layout and organization becomes crucial for efficient querying and analysis.

Optimization

Optimization Data Lake Cost-Benefit Reporting

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

AWS Big Data

APRIL 12, 2023

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows organizations to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). This performance-optimized runtime offered by Amazon EMR makes your Spark jobs run fast and cost-effectively. As of the Amazon EMR 6.5 Amazon EMR 6.10

Testing

Testing Big Data Metadata Optimization

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

AWS Big Data

APRIL 2, 2024

You can see the decompressed data has metadata information such as logGroup , logStream , and subscriptionFilters , and the actual data is included within the message field under logEvents (the following example shows an example of CloudTrail events in the CloudWatch Logs).

Metadata

Metadata Marketing Analytics Data Transformation

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

Specifically, the system uses Amazon SageMaker Processing jobs to process the data stored in the data lake, employing the AWS SDK for Pandas (previously known as AWS Wrangler) for various data transformation operations, including cleaning, normalization, and feature engineering. Orca addressed this in several ways.

Data Lake

Data Lake Analytics Snapshot Data Quality

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Data Vault 2.0 allows for the following: Agile data warehouse development Parallel data ingestion A scalable approach to handle multiple data sources even on the same entity A high level of automation Historization Full lineage support However, Data Vault 2.0

Enterprise

Enterprise Data Warehouse Data Lake Optimization

NEW: Octopai Announces Support of Microsoft Azure Data Factory

Octopai

JANUARY 19, 2021

This is done by visualizing the Azure Data Factory pipelines’ full column-level with source-to-target traceability through different data transformations at the most detailed level. Octopai can fully map the BI landscape and trace metadata movement in a mixed environment including complex multi-vendor landscapes.

Metadata

Metadata ROI Machine Learning Data Quality

Alation 2023.1: Easing Self-Service for the Modern Data Stack with Databricks and dbt Labs

Alation

APRIL 4, 2023

Now, joint users will get an enhanced view into cloud and data transformations , with valuable context to guide smarter usage. Integrating helpful metadata into user workflows gives all people, from data scientists to analysts , the context they need to use data more effectively.

Metadata

Metadata Cost-Benefit Data Transformation Predictive Modeling

Enhance your analytics embedding experience with the new Amazon QuickSight JavaScript SDK

AWS Big Data

MARCH 9, 2023

Additionally, there are major rewrites to deliver developer-focused improvements, including static type checking, enhanced runtime validation, strong consistency in call patterns, and optimized event chaining. The following eventNames and eventCodes are returned as part of the onChange callback when there is a change in the SDK code status.

Slice and Dice

Slice and Dice Dashboards Analytics Interactive

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

This involves unifying and sharing a single copy of data and metadata across IBM® watsonx.data ™, IBM® Db2 ®, IBM® Db2® Warehouse and IBM® Netezza ®, using native integrations and supporting open formats, all without the need for migration or recataloging. With Netezza support for 1.2

Cost-Benefit

Cost-Benefit Metadata Optimization Management

How healthcare organizations can analyze and create insights using price transparency data

AWS Big Data

OCTOBER 11, 2023

Due to this low complexity, the solution uses AWS serverless services to ingest the data, transform it, and make it available for analytics. The serverless architecture features auto scaling, high availability, and a pay-as-you-go billing model to increase agility and optimize costs.

Visualization

Visualization Dashboards Data-driven Gap analysis

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. It also lets you choose the right engine for the right workload at the right cost, potentially reducing your data warehouse costs by optimizing workloads. Track models and drive transparent processes.

Risk

Risk Modeling Management Metadata

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

This method uses GZIP compression to optimize storage consumption and query performance. You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches. Choose Run.

Analytics

Analytics IoT Metadata Internet of Things

5 best open source data flow lineage tools

Octopai

AUGUST 11, 2024

By reverse-engineering, parsing, and converting scripts, Octopai seamlessly connects all data points within and across organizational systems. While open-source tools such as Apache Atlas, Open Metadata, Egeria, Spline, and OpenLineage offer valuable capabilities, they come with their own sets of pros and cons.

Metadata

Metadata Visualization Data Quality Data Governance

Sure, Trust Your Data… Until It Breaks Everything: How Automated Data Lineage Saves the Day

Octopai

JUNE 9, 2024

By doing so, they aimed to drive innovation, optimize operations, and enhance patient care. They invested heavily in data infrastructure and hired a talented team of data scientists and analysts. Predictive analytics models became more accurate as they were based on trustworthy data flows. This is where Octopai excels.

IT

IT Data-driven Predictive Analytics Data Strategy

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

.” Sean Im, CEO, Samsung SDS America “In the field of generative AI and foundation models, watsonx is a platform that will enable us to meet our customers’ requirements in terms of optimization and security, while allowing them to benefit from the dynamism and innovations of the open-source community.”

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

Choosing A Graph Data Model to Best Serve Your Use Case

Ontotext

MARCH 27, 2024

For example, GPS, social media, cell phone handoffs are modeled as graphs while data catalogs, data lineage and MDM tools leverage knowledge graphs for linking metadata with semantics. RDF is used extensively for data publishing and data interchange and is based on W3C and other industry standards.

Modeling

Modeling Metadata Data Quality Enterprise

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

A critical feature for every developer however is to get instantaneous feedback like configuration validations or performance metrics, as well as previewing data transformations for each step of their data flow. Attributes contain key metadata like the source directory of a file or the source topic of a Kafka message.

Testing

Testing Publishing Metadata Interactive

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

Incremental query refers to a query strategy that focuses on processing and analyzing only the new or updated data within a data lake since the last query. The key idea behind incremental queries is to use metadata or change tracking mechanisms to identify the new or modified data since the last query.

Data Lake

Data Lake Snapshot Big Data Data-driven

Data platform trinity: Competitive or complementary?

IBM Big Data Hub

JANUARY 18, 2023

A read-optimized platform that can integrate data from multiple applications emerged. In another decade, the internet and mobile started the generate data of unforeseen volume, variety and velocity. This adds an additional ETL step, making the data even more stale. Data lakehouse was created to solve these problems.

Data Lake

Data Lake Data Warehouse Data-driven Metadata

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Data ingestion – Steps 1 and 2 use AWS DMS, which connects to the source database and moves full and incremental data (CDC) to Amazon S3 in Parquet format. Data transformation – Steps 3 and 4 represent an EMR Serverless Spark application (Amazon EMR 6.9 Let’s refer to this S3 bucket as the raw layer.

Data Lake

Data Lake Dashboards Metrics Metadata

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Within the ANZ enterprise data mesh strategy, aligning data mesh nodes with the ANZ Group’s divisional structure provides optimal alignment between data mesh principles and organizational structure, as shown in the following diagram. Consumer feedback and demand drives creation and maintenance of the data product.

Metadata

Metadata Data Governance Data Quality Data-driven

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt is predominantly used by data warehouses (such as Amazon Redshift ) customers who are looking to keep their data transform logic separate from storage and engine.

Data Lake

Data Lake Management Metrics Data Warehouse

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

AWS Glue is a serverless data discovery, load, and transformation service that will prepare data for consumption in BI and AI/ML activities. Solution overview This solution uses Amazon AppFlow to retrieve data from the Jira Cloud. Parquet is a columnar format to optimize subsequent querying. Choose Update.

Data Lake

Data Lake Data Transformation Data-driven Cost-Benefit

Lay the groundwork now for advanced analytics and AI

CIO Business Intelligence

AUGUST 3, 2023

It also used device data to develop Lenovo Device Intelligence, which uses AI-driven predictive analytics to help customers understand and proactively prevent and solve potential IT issues. Lenovo Device Intelligence can also help to optimize IT support costs, reduce employee downtime, and improve the user experience, the company says.

Analytics

Analytics Data Lake Metadata Cost-Benefit

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To create and manage the data products, smava uses Amazon Redshift , a cloud data warehouse. In this post, we show how smava optimized their data platform by using Amazon Redshift Serverless and Amazon Redshift data sharing to overcome right-sizing challenges for unpredictable workloads and further improve price-performance.

Data Lake

Data Lake Data Warehouse Data-driven B2B

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

AWS Big Data

MARCH 15, 2023

Infomedia was looking to build a cloud-based data platform to take advantage of highly scalable data storage with flexible and cloud-native processing tools to ingest, transform, and deliver datasets to their SaaS applications. The raw input data is stored in Amazon S3 in JSON format (called the bronze dataset layer).

Cost-Benefit

Cost-Benefit Data Processing Optimization Data-driven

How to modernize data lakes with a data lakehouse architecture

IBM Big Data Hub

JULY 5, 2023

The challenges of a monolithic data lake architecture Data lakes are, at a high level, single repositories of data at scale. Data may be stored in its raw original form or optimized into a different format suitable for consumption by specialized engines.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

The platform converges data cataloging, data ingestion, data profiling, data tagging, data discovery, and data exploration into a unified platform, driven by metadata. Modak Nabu automates repetitive tasks in the data preparation process and thus accelerates the data preparation by 4x.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In addition, more data is becoming available for processing / enrichment of existing and new use cases e.g., recently we have experienced a rapid growth in data collection at the edge and an increase in availability of frameworks for processing that data. As a result, alternative data integration technologies (e.g.,

Data Processing

Data Processing Data Warehouse Enterprise Visualization

Why The Public Sector Needs Data Governance

Alation

NOVEMBER 22, 2022

Before you implement a data governance framework, you need to know the data you already have. This means you need to: Inventory data: Know all information resources and relevant metadata. Classify data: Organize structured and unstructured data into relevant categories. Reuse metadata productively.

Data Governance

Data Governance Metadata Data-driven Unstructured Data

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

AWS Big Data

NOVEMBER 14, 2024

We use the built-in features of Data Firehose, including AWS Lambda for necessary data transformation and Amazon Simple Notification Service (Amazon SNS) for near real-time alerts. To maintain up-to-date data, an AWS Glue crawler reads and updates the AWS Glue Data Catalog from transformed Parquet files.

Data Lake

Data Lake Metadata Testing Data-driven

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

To optimize their security operations, organizations are adopting modern approaches that combine real-time monitoring with scalable data analytics. They are using data lake architectures and Apache Iceberg to efficiently process large volumes of security data while minimizing operational overhead.

Snapshot

Snapshot Optimization Data Lake Metadata

Automating Data Warehouses in the Era of AI, Data Products and Data Lakehouses

BI-Survey

MARCH 6, 2025

While efficiency is a priority, data quality and security remain non-negotiable. Developing and maintaining data transformation pipelines are among the first tasks to be targeted for automation. However, caution is advised since accuracy, timeliness, and other aspects of data quality depend on the quality of data pipelines.

Data Warehouse

Data Warehouse Metadata Unstructured Data Data-driven

Introducing the HubSpot connector for AWS Glue

AWS Big Data

DECEMBER 2, 2024

AWS Glue establishes a secure connection to HubSpot using OAuth for authorization and TLS for data encryption in transit. AWS Glue also supports the ability to apply complex data transformations, enabling efficient data integration and preparation to meet your needs.

Data Lake

Data Lake Testing Data Integration Metadata

Hybrid big data analytics with Amazon EMR on AWS Outposts

AWS Big Data

JANUARY 29, 2025

This helps optimize data access from the Regional S3 bucket as traffic is routed through Direct Connect. We also use the AWS Glue Data Catalog as the external Hive compatible metastore, which serves as the central technical metadata catalog. This method is ideal for recurring tasks or large-scale data transformations.

Big Data

Big Data Data Analytics Analytics Interactive

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

Reports In formats that are both static and interactive, these showcase tabular views of data. Strategic Objective Provide an optimal user experience regardless of where and how users prefer to access information. Ideally, your primary data source should belong in this group. addresses). Build your first set of reports.

Analytics

Analytics Cost-Benefit Visualization Dashboards

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

This field guide to data mapping will explore how data mapping connects volumes of data for enhanced decision-making. Why Data Mapping is Important Data mapping is a critical element of any data management initiative, such as data integration, data migration, data transformation, data warehousing, or automation.

Data Warehouse

Data Warehouse Reporting Data Transformation Visualization

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

How EUROGATE established a data mesh architecture using Amazon DataZone

Webinars

Trending Sources

How to Build a Successful Metadata Management Framework

Webinars

Available Now! Automated Testing for Data Transformations

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Biggest Trends in Data Visualization Taking Shape in 2022

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

NEW: Octopai Announces Support of Microsoft Azure Data Factory

Alation 2023.1: Easing Self-Service for the Modern Data Stack with Databricks and dbt Labs

Enhance your analytics embedding experience with the new Amazon QuickSight JavaScript SDK

Tackling AI’s data challenges with IBM databases on AWS

How healthcare organizations can analyze and create insights using price transparency data

How to use foundation models and trusted governance to manage AI workflow risk

Gain insights from historical location data using Amazon Location Service and AWS analytics services

5 best open source data flow lineage tools

Sure, Trust Your Data… Until It Breaks Everything: How Automated Data Lineage Saves the Day

Exploring the AI and data capabilities of watsonx

Choosing A Graph Data Model to Best Serve Your Use Case

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Data platform trinity: Competitive or complementary?

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Lay the groundwork now for advanced analytics and AI

How smava makes loans transparent and affordable using Amazon Redshift Serverless

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

How to modernize data lakes with a data lakehouse architecture

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Addressing the Three Scalability Challenges in Modern Data Platforms

Why The Public Sector Needs Data Governance

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Automating Data Warehouses in the Era of AI, Data Products and Data Lakehouses

Introducing the HubSpot connector for AWS Glue

Hybrid big data analytics with Amazon EMR on AWS Outposts

What Is Embedded Analytics?

What is Data Mapping?

Stay Connected