Data Analytics, Metadata, Reference and Testing

Introducing Amazon MWAA larger environment sizes

AWS Big Data

APRIL 16, 2024

Running Apache Airflow at scale puts proportionally greater load on the Airflow metadata database, sometimes leading to CPU and memory issues on the underlying Amazon Relational Database Service (Amazon RDS) cluster. A resource-starved metadata database may lead to dropped connections from your workers, failing tasks prematurely.

Metadata

Metadata Metrics Testing Management

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

Benchmark setup In our testing, we used the 3 TB dataset stored in Amazon S3 in compressed Parquet format and metadata for databases and tables is stored in the AWS Glue Data Catalog. This benchmark uses unmodified TPC-DS data schema and table relationships. With Amazon EMR 6.10.0 If you are using Amazon EMR 6.8.0

Metadata

Metadata Statistics Broadcasting Optimization

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Data Lake Testing Consulting

Webinars

The AI Superhero Approach to Product Management

The Ultimate Guide To Data-Driven Construction: Optimize Projects, Reduce Risks, & Boost Innovation

Building Your BI Strategy: How to Choose a Solution That Scales and Delivers

MORE WEBINARS

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows.

Metadata

Metadata Data Processing Management Testing

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. Test access using Athena queries in the consumer account.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The Business Application Research Center (BARC) warns that data governance is a highly complex, ongoing program, not a “big bang initiative,” and it runs the risk of participants losing trust and interest over time. The program must introduce and support standardization of enterprise data.

Data Governance

Data Governance Management Metadata Data Quality

Use Amazon Athena to query data stored in Google Cloud Platform

AWS Big Data

AUGUST 15, 2023

As customers accelerate their migrations to the cloud and transform their businesses, some find themselves in situations where they have to manage data analytics in a multi-cloud environment, such as acquiring a company that runs on a different cloud provider. For complete steps, refer to Creating a VPC for a data source connector.

Recreation/Entertainment

Recreation/Entertainment Unstructured Data Business Intelligence Data-driven

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Cloudera

OCTOBER 15, 2021

With FSO, Apache Ozone guarantees atomic directory operations, and renaming or deleting a directory is a simple metadata operation even if the directory has a large set of sub-paths (directories/files) within it. In fact, this gives Apache Ozone a significant performance advantage over other object stores in the data analytics ecosystem.

Testing

Testing Measurement Optimization Metadata

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane. By selecting the corresponding asset, you can understand its content through the readme, glossary terms , and technical and business metadata.

Data Quality

Data Quality Visualization Metadata Metrics

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

Cloudera has found that customers have spent many years investing in their big data assets and want to continue to build on that investment by moving towards a more modern architecture that helps leverage the multiple form factors. The customer leverages Cloudera’s multi-function analytics stack in CDP. Test and QA. Test and QA.

Testing

Testing Metadata Risk Data Science

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. Testing on the TPC-DS benchmark showed an 11% improvement in overall query performance when using CBO compared to without it.

Optimization

Optimization Statistics Metadata Data Lake

Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams

AWS Big Data

MAY 24, 2024

Apache Spark is a powerful big data engine used for large-scale data analytics. You can use Apache Spark to process streaming data from a variety of streaming sources, including Amazon Kinesis Data Streams for use cases like clickstream analysis, fraud detection, and more. Starting with Amazon EMR 7.1,

Metadata

Metadata Interactive Business Objectives Management

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

AWS Big Data

NOVEMBER 29, 2023

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift. This post discusses restricting listing of data catalog metadata as per the granted permissions.

Metadata

Metadata Data Warehouse Analytics Data Analytics

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

AWS Big Data

MAY 16, 2024

With the new REST API, you can now invoke DAG runs, manage datasets, or get the status of Airflow’s metadata database, trigger, and scheduler—all without relying on the Airflow web UI or CLI. Refer to Creating an Apache Airflow web login token for more details. small instance class. qps (int): Queries per second.

Testing

Testing Metrics Interactive Management

GraphDB in Action: Putting the Most Reliable RDF Database to Work for Better Human-machine Interaction

Ontotext

JANUARY 26, 2023

These 30 layers can be split into two kinds: a location-reference layer and a topic layer. The authors address the challenge of interoperability in the digitalization of mobility systems and introduce a reference architecture for the Shift2Rail Interoperability Framework (IF). The catalog stores the asset’s metadata in RDF.

Interactive

Interactive Metadata Data Integration Data-driven

Build a real-time analytics solution with Apache Pinot on AWS

AWS Big Data

AUGUST 6, 2024

In essence, it’s the foundation for user-centric data analysis in modern apps, because it’s the layer that translates technical assets into business-friendly terms that enable users to extract actionable insights from data. The scope of data analytics has grown, and more user personas are now seeking to extract insights themselves.

OLAP

OLAP Analytics Visualization Dashboards

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Cloudera

JANUARY 19, 2024

Please refer to the product documentation for more information about specific releases. Cloudera has been testing with GPT running in both Azure and OpenAI, but the following service-model combinations are also supported: Note: Cloudera recommends using the Hue AI assistant with the Azure OpenAI service. or higher on the public cloud.

Data Warehouse

Data Warehouse Data Processing Optimization Modeling

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

Data analytics – Business analysts gather operational insights from multiple data sources, including the location data collected from the vehicles. Athena is used to run geospatial queries on the location data stored in the S3 buckets. You can test this solution yourself using the AWS Samples GitHub repository.

Analytics

Analytics IoT Metadata Internet of Things

Orchestrate Amazon EMR Serverless Spark jobs with Amazon MWAA, and data validation using Amazon Athena

AWS Big Data

DECEMBER 12, 2023

Use EMR Serverless to transform the data using PySpark code and then store the transformed data back in your S3 bucket. Use Athena to create an external table based on the S3 dataset and run queries to analyze the transformed data. Athena uses the AWS Glue Data Catalog to store the table metadata.

Data Processing

Data Processing Statistics Management Interactive

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

In this post, we discuss ways to modernize your legacy, on-premises, real-time analytics architecture to build serverless data analytics solutions on AWS using Amazon Managed Service for Apache Flink. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds.

Management

Management Metadata Analytics Dashboards

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

The model outputs produced by the same code will vary with changes to things like the size of the training data (number of labeled examples), network training parameters, and training run time. This has serious implications for software testing, versioning, deployment, and other core development processes.

Management

Management Machine Learning Experimentation Metrics

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

Surfacing relevant information to end-users in a concise and digestible format is crucial for maximizing the value of data assets. Automatic document summarization, natural language processing (NLP), and data analytics powered by generative AI present innovative solutions to this challenge.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

AI recommendations for descriptions in Amazon DataZone for enhanced business data cataloging and discovery is now generally available

AWS Big Data

APRIL 2, 2024

Without the right metadata and documentation, data consumers overlook valuable datasets relevant to their use case or spend more time going back and forth with data producers to understand the data and its relevance for their use case—or worse, misuse the data for a purpose it was not intended for.

Metadata

Metadata Metrics Data-driven Contextual Data

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

The biggest challenge is broken data pipelines due to highly manual processes. Figure 1 shows a manually executed data analytics pipeline. First, a business analyst consolidates data from some public websites, an SFTP server and some downloaded email attachments, all into Excel. Adding Tests to Reduce Stress.

Testing

Testing Metadata Dashboards Statistics

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Refer to How can I access OpenSearch Dashboards from outside of a VPC using Amazon Cognito authentication for a detailed evaluation of the available options and the corresponding pros and cons. For more information, refer to the AWS CDK v2 Developer Guide. For instructions, refer to Creating a public hosted zone. application.

Dashboards

Dashboards Data Processing Metadata Consulting

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

AWS Big Data

FEBRUARY 1, 2024

The Common Crawl corpus contains petabytes of data, regularly collected since 2008, and contains raw webpage data, metadata extracts, and text extracts. In addition to determining which dataset should be used, cleansing and processing the data to the fine-tuning’s specific need is required.

Metadata

Metadata Modeling Data Processing Unstructured Data

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. This metadata file is later used to read source file names during processing into the staging layer. These files follow the same naming pattern, with a daily system-generated timestamp appended to each file name.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

For more information, refer to IAM Policies for invoking AWS Glue job from Step Functions. There are multiple tables related to customers and order data in the RDS database. Amazon S3 hosts the metadata of all the tables as a.csv file. To learn more about how distributed map redrive works, refer to Redriving Map Runs.

Metadata

Metadata Visualization Data Lake Data-driven

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

Organizations across the world are increasingly relying on streaming data, and there is a growing need for real-time data analytics, considering the growing velocity and volume of data being collected. Refer appendix section for more information on this feature. Refer to the first stack’s output.

Management

Management Metadata Testing Internet of Things

Implement Apache Flink near-online data enrichment patterns

AWS Big Data

NOVEMBER 15, 2023

Stream data processing allows you to act on data in real time. Real-time data analytics can help you have on-time and optimized responses while improving the overall customer experience. Data streaming workloads often require data in the stream to be enriched via external sources (such as databases or other data streams).

Testing

Testing Optimization Management Metadata

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

They store attributes such as object size, total time, turn-around time, and HTTP referer for log records. AWS Glue Data Catalog stores information as metadata tables, where each table specifies a single data store. Running the crawler on a schedule updates AWS Glue Data Catalog with new partitions and metadata.

Metadata

Metadata Dashboards Metrics Visualization

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Use Lake Formation to grant permissions to users to access data. Test the solution by accessing data with a corporate identity. Audit user data access. For a complete guide on creating and providing a certificate, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption.

Analytics

Analytics Data Lake Management Enterprise

Implement Apache Flink real-time data enrichment patterns

AWS Big Data

NOVEMBER 15, 2023

Stream data processing allows you to act on data in real time. Real-time data analytics can help you have on-time and optimized responses while improving the overall customer experience. Data streaming workloads often require data in the stream to be enriched via external sources (such as databases or other data streams).

Testing

Testing Optimization Management Metadata

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

AWS Big Data

MARCH 9, 2023

A dimension is a structure that captures reference data along with associated hierarchies, while a fact table captures different values and metrics that can be aggregated by dimensions. Relative to the metrics data that keeps changing on a daily or even hourly basis, the dimension attributes change less frequently.

Slice and Dice

Slice and Dice Data Warehouse Metrics Metadata

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

AWS Big Data

JUNE 12, 2024

Swisscom’s Data, Analytics, and AI division is building a One Data Platform (ODP) solution that will enable every Swisscom employee, process, and product to benefit from the massive value of Swisscom’s data. Swisscom is a leading telecommunications provider in Switzerland.

Data Architecture

Data Architecture Cost-Benefit Experimentation Data-driven

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Most businesses store their critical data in a data lake, where you can bring data from various sources to a centralized storage. Change Data Capture (CDC) in the context of a data lake refers to the process of capturing and propagating changes made to source data.

Data Lake

Data Lake Metadata Testing Snapshot

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

It is crucial that you perform testing to ensure that a table format meets your specific use case requirements. Amazon Redshift only supports Delta Symlink tables (see Creating external tables for data managed in Delta Lake for more information). This post is not intended to provide detailed technical guidance (e.g.

Data Lake

Data Lake Metadata Optimization Statistics

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview

AWS Big Data

JULY 8, 2024

Introduction to OpenLineage compatible data lineage The need to capture data lineage consistently across various analytical services and combine them into a unified object model is key in uncovering insights from the lineage artifact. To learn more, refer to Creating inventory and published data in Amazon DataZone.

Visualization

Visualization Metadata Publishing Sales

6 Case Studies on The Benefits of Business Intelligence And Analytics

datapine

JANUARY 31, 2022

Using business intelligence and analytics effectively is the crucial difference between companies that succeed and companies that fail in the modern environment. Everything is being tested, and then the campaigns that succeed get more money put into them, while the others aren’t repeated. Let’s look at our first use case.

Business Intelligence

Business Intelligence Analytics Cost-Benefit ROI

“You Complete Me,” said Data Lineage to DataOps Observability.

DataKitchen

JANUARY 23, 2023

DataOps Observability includes monitoring and testing the data pipeline, data quality, data testing, and alerting. Data testing is an essential aspect of DataOps Observability; it helps to ensure that data is accurate, complete, and consistent with its specifications, documentation, and end-user requirements.

Testing

Testing Data Governance Data Quality Data-driven

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

It was titled, The Gartner 2021 Leadership Vision for Data & Analytics Leaders. This was for the Chief Data Officer, or head of data and analytics. It is meant to be a desk-reference for that role for 2021. Storytelling is a nice one to use early on to test the approach. Governance. Architecture.

Data Analytics

Data Analytics Analytics Data-driven Finance

What Is Embedded Analytics?

Jet Global

MAY 1, 2023

that gathers data from many sources. Third-party data might include industry benchmarks, data feeds (such as weather and social media), and/or anonymized customer data. Four Approaches to Data Analytics The world of data analytics is constantly and quickly changing. It’s all about context.

Analytics

Analytics Cost-Benefit Visualization Dashboards

Introducing Amazon MWAA larger environment sizes

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Webinars

Trending Sources

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Webinars

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Use Apache Iceberg in a data lake to support incremental data processing

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

What is data governance? Best practices for managing data assets

Use Amazon Athena to query data stored in Google Cloud Platform

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Upgrade Journey: The Path from CDH to CDP Private Cloud

Speed up queries with the cost-based optimizer in Amazon Athena

Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

GraphDB in Action: Putting the Most Reliable RDF Database to Work for Better Human-machine Interaction

Build a real-time analytics solution with Apache Pinot on AWS

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Orchestrate Amazon EMR Serverless Spark jobs with Amazon MWAA, and data validation using Amazon Athena

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

What you need to know about product management for AI

Enrich your serverless data lake with Amazon Bedrock

AI recommendations for descriptions in Amazon DataZone for enhanced business data cataloging and discovery is now generally available

A Day in the Life of a DataOps Engineer

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

Implement Apache Flink near-online data enrichment patterns

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Implement Apache Flink real-time data enrichment patterns

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Choosing an open table format for your transactional data lake on AWS

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview

6 Case Studies on The Benefits of Business Intelligence And Analytics

“You Complete Me,” said Data Lineage to DataOps Observability.

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

What Is Embedded Analytics?

Stay Connected