Data Architecture, Metadata and Reference

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

The landscape of big data management has been transformed by the rising popularity of open table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake. These formats, designed to address the limitations of traditional data storage systems, have become essential in modern data architectures.

Metadata

Metadata Data Warehouse Big Data Data Lake

Run Apache XTable in AWS Lambda for background conversion of open table formats

AWS Big Data

NOVEMBER 26, 2024

This post was co-written with Dipankar Mazumdar, Staff Data Engineering Advocate with AWS Partner OneHouse. Data architecture has evolved significantly to handle growing data volumes and diverse workloads. For more examples and references to other posts, refer to the following GitHub repository.

Metadata

Metadata Data Lake Snapshot Data Warehouse

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

AWS Big Data

SEPTEMBER 11, 2024

This post describes how HPE Aruba automated their Supply Chain management pipeline, and re-architected and deployed their data solution by adopting a modern data architecture on AWS. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file.

Data Architecture

Data Architecture Optimization Data Warehouse Metadata

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

BladeBridge offers a comprehensive suite of tools that automate much of the complex conversion work, allowing organizations to quickly and reliably transition their data analytics capabilities to the scalable Amazon Redshift data warehouse. Amazon Redshift is a fully managed data warehouse service offered by Amazon Web Services (AWS).

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architecture is a complex and varied field and different organizations and industries have unique needs when it comes to their data architects. Solutions data architect: These individuals design and implement data solutions for specific business needs, including data warehouses, data marts, and data lakes.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions.

Data Lake

Data Lake Snapshot Metadata Data Architecture

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

SageMaker still includes all the existing ML and AI capabilities you’ve come to know and love for data wrangling, human-in-the-loop data labeling with Amazon SageMaker Ground Truth , experiments, MLOps, Amazon SageMaker HyperPod managed distributed training, and more. Having confidence in your data is key.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Data-Driven Enterprise Architecture: Why Enterprise Architects Need to Look at Data First

erwin

MAY 31, 2019

The typical notion is that enterprise architects and data (and metadata) architects are in opposite corners. At Avydium , we believe there’s an important middle ground where different architecture disciplines coexist, including enterprise, solution, application, data, metadata and technical architectures.

Data-driven

Data-driven Enterprise Metadata Strategy

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The Business Application Research Center (BARC) warns that data governance is a highly complex, ongoing program, not a “big bang initiative,” and it runs the risk of participants losing trust and interest over time. The program must introduce and support standardization of enterprise data.

Data Governance

Data Governance Management Metadata Data Quality

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.

Data Integration

Data Integration Data Lake Statistics Data-driven

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly. The solution integrates data in three tiers.

Unstructured Data

Unstructured Data Metadata Management Analytics

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Iceberg stores the metadata pointer for all the metadata files. When a SELECT query is reading an Iceberg table, the query engine first goes to the Iceberg catalog, then retrieves the entry of the location of the latest metadata file, as shown in the following diagram.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

The Future of Data Lineage and the Role of Metadata

Alation

AUGUST 18, 2022

The complex challenge here is to have the lineage be intelligently updated as the data landscape and processing dynamically bubbles and changes daily across an enterprise. Active metadata will play a critical role in automating such updates as they arise. And you now have a frame of reference you can use to track its evolution.

Metadata

Metadata Visualization Statistics Data Architecture

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

First, you must understand the existing challenges of the data team, including the data architecture and end-to-end toolchain. Historic Balance – compares current data to previous or expected values. Monitoring Job Metadata. A DataOps implementation project consists of three steps.

Testing

Testing Metadata Dashboards Statistics

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Processing large records with Amazon Kinesis Data Streams

AWS Big Data

OCTOBER 16, 2023

This service seamlessly integrates into your data architecture, allowing you to tap into the full potential of your data for informed decision-making. Data streaming technologies like Kinesis Data Streams are designed to efficiently process and manage continuous streams of data in real time at large scale.

Cost-Benefit

Cost-Benefit Testing Optimization Metadata

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging. Example Corp.

Data Lake

Data Lake Analytics Dashboards Metrics

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. An entity can act both as a producer of data assets and as a consumer of data assets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

In fact, we recently announced the integration with our cloud ecosystem bringing the benefits of Iceberg to enterprises as they make their journey to the public cloud, and as they adopt more converged architectures like the Lakehouse. 1: Multi-function analytics . 3: Open Performance.

Metadata

Metadata Data Architecture Machine Learning Cost-Benefit

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

Transition from Navigator by migrating the business metadata (tags, entity names, custom properties, descriptions and technical metadata (Hive, Spark, HDFS, Impala) to Atlas. This allowed them to enable a modern data architecture, enhance their streaming capabilities and prepare for the next phase of the CDP Journey.

Testing

Testing Metadata Risk Data Science

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Figure 1: Improvement in total runtime of TPC-DS 3T workload The following graph presents the top queries from the TPC-DS benchmark with the greatest performance improvement over the last year with and without AWS Glue Data Catalog column statistics. This can have a significant impact on overall query performance.

Data Lake

Data Lake Statistics Broadcasting Optimization

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

AWS Big Data

JUNE 12, 2024

In a two-part series, we talk about Swisscom’s journey of automating Amazon Redshift provisioning as part of the Swisscom ODP solution using the AWS Cloud Development Kit (AWS CDK), and we provide code snippets and the other useful references.

Data Architecture

Data Architecture Cost-Benefit Data-driven Experimentation

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

In this post, we are excited to summarize the features that the AWS Glue Data Catalog, AWS Glue crawler, and Lake Formation teams delivered in 2022. Whether you are a data platform builder, data engineer, data scientist, or any technology leader interested in data lake solutions, this post is for you.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

NOVEMBER 15, 2023

For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard. It seamlessly consolidates data from various data sources within AWS, including AWS Cost Explorer (and forecasting with Cost Explorer ), AWS Trusted Advisor , and AWS Compute Optimizer.

Dashboards

Dashboards Analytics Metadata Data Warehouse

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Refer to How can I access OpenSearch Dashboards from outside of a VPC using Amazon Cognito authentication for a detailed evaluation of the available options and the corresponding pros and cons. For more information, refer to the AWS CDK v2 Developer Guide. For instructions, refer to Creating a public hosted zone. application.

Dashboards

Dashboards Data Processing Metadata Consulting

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

They store attributes such as object size, total time, turn-around time, and HTTP referer for log records. AWS Glue Data Catalog stores information as metadata tables, where each table specifies a single data store. Running the crawler on a schedule updates AWS Glue Data Catalog with new partitions and metadata.

Metadata

Metadata Dashboards Metrics Visualization

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane. By selecting the corresponding asset, you can understand its content through the readme, glossary terms , and technical and business metadata. option("header", "true").option("inferSchema",

Data Quality

Data Quality Visualization Metadata Metrics

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

The staging layer represents data intake and light data transformations and enhancements that occur before the data comes to its more permanent resting place, the raw Data Vault (RDV). The RDV holds the historized copy of all of the data from multiple source systems.

Enterprise

Enterprise Data Warehouse Data Lake Optimization

Automate discovery of data relationships using ML and Amazon Neptune graph technology

AWS Big Data

APRIL 19, 2023

Independent data products often only have value if you can connect them, join them, and correlate them to create a higher order data product that creates additional insights. A modern data architecture is critical in order to become a data-driven organization.

Technology

Technology Data-driven Machine Learning Sales

Enterprise Data Management — Driving Large-Scale Change in Your Organization

Sisense

JULY 6, 2020

First off, this involves defining workflows for every business process within the enterprise: the what, how, why, who, when, and where aspects of data. These regulations, ultimately, ensure key business values: data consistency, quality, and trustworthiness.

Enterprise

Enterprise Management Data Architecture Data-driven

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

Kinesis Data Streams has native integrations with other AWS services such as AWS Glue and Amazon EventBridge to build real-time streaming applications on AWS. Refer to Amazon Kinesis Data Streams integrations for additional details. To access your data from Timestream, you need to install the Timestream plugin for Grafana.

Analytics

Analytics IoT Data-driven Snapshot

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. As Ozone scales to exabytes of data, it is important to ensure that Ozone Manager can perform at scale. Cisco has multiple reference architectures for running Ozone.

Management

Management Metadata Metrics Optimization

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Cost and resource efficiency – This is an area where Acast observed a reduction in data duplication, and therefore cost reduction (in some accounts, removing the copy of data 100%), by reading data across accounts while enabling scaling. In this approach, teams responsible for generating data are referred to as producers.

Data-driven

Data-driven Advertising Metadata Data Architecture

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

AWS Big Data

SEPTEMBER 6, 2023

For information about how to configure the producer for connectivity, refer to IAM access control. He works with enterprise FSI customers and is primarily specialized in machine learning and data architectures. . // It serves as a simple API Gateway to Kafka Proxy, accepting requests and forwarding them to a Kafka topic.

Testing

Testing Metadata Cost-Benefit Internet of Things

How Huron built an Amazon QuickSight Asset Catalogue with AWS CDK Based Deployment Pipeline

AWS Big Data

APRIL 26, 2023

Having an accurate and up-to-date inventory of all technical assets helps an organization ensure it can keep track of all its resources with metadata information such as their assigned oners, last updated date, used by whom, how frequently and more. This is a guest blog post co-written with Corey Johnson from Huron.

Metadata

Metadata Dashboards Visualization Consulting

You Cannot Get to the Moon on a Bike!

Ontotext

JANUARY 10, 2024

Limiting growth by (data integration) complexity Most operational IT systems in an enterprise have been developed to serve a single business function and they use the simplest possible model for this. In both cases, semantic metadata is the glue that turns knowledge graphs into hubs of data, metadata, and content.

Metadata

Metadata Slice and Dice Data Integration Enterprise

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

To learn more about RAG, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart. A RAG-based generative AI application can only produce generic responses based on its training data and the relevant documents in the knowledge base.

Data Lake

Data Lake Unstructured Data Management Snapshot

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

For more information about performance improvement capabilities, refer to the list of announcements below. At AWS re:Invent 2023, we introduced more performance enhancements in query planning and execution such as enhanced bloom filters , query rewrites, and support for write operations in auto scaling.

Data Warehouse

Data Warehouse Analytics Data Lake Machine Learning

How Knowledge Graphs Power Data Mesh and Data Fabric

Ontotext

APRIL 10, 2024

In most enterprises data teams lack a data map and data asset inventory and are often unaware of data that exists across the organization, its associated profile, quality and associated metadata. Teams can’t access data to build their business use cases. For example, a product data tag is basic metadata.

Metadata

Metadata Data Lake Data Warehouse Data Quality

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

AWS Big Data

OCTOBER 9, 2024

For detailed instructions on how to set up AWS DMS to replicate data, refer to the appendix at the end of this post. In the following sections, we showcase how to configure an AWS Glue Data Quality job for comparison. Refer to AWS Glue connection properties for additional details. Choose Create connection. AWS Glue 4.0

Data Quality

Data Quality Data Lake Data Warehouse Metrics

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Mark: The first element in the process is the link between the source data and the entry point into the data platform. At Ramsey International (RI), we refer to that layer in the architecture as the foundation, but others call it a staging area, raw zone, or even a source data lake. What is a data fabric?

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Run Apache XTable in AWS Lambda for background conversion of open table formats

Webinars

Trending Sources

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Webinars

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

What is a data architect? Skills, salaries, and how to become a data framework master

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Data-Driven Enterprise Architecture: Why Enterprise Architects Need to Look at Data First

What is data governance? Best practices for managing data assets

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Unstructured data management and governance using AWS AI/ML and analytics services

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

The Future of Data Lineage and the Role of Metadata

A Day in the Life of a DataOps Engineer

How Cargotec uses metadata replication to enable cross-account data sharing

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Processing large records with Amazon Kinesis Data Streams

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Upgrade Journey: The Path from CDH to CDP Private Cloud

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

AWS Lake Formation 2022 year in review

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Automate discovery of data relationships using ML and Amazon Neptune graph technology

Enterprise Data Management — Driving Large-Scale Change in Your Organization

Choosing an open table format for your transactional data lake on AWS

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Boosting Object Storage Performance with Ozone Manager

Design a data mesh on AWS that reflects the envisioned organization

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

How Huron built an Amazon QuickSight Asset Catalogue with AWS CDK Based Deployment Pipeline

You Cannot Get to the Moon on a Bike!

Exploring real-time streaming for generative AI Applications

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

How Knowledge Graphs Power Data Mesh and Data Fabric

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

Demystifying Modern Data Platforms

Stay Connected