Data Analytics, Data Processing and Metadata

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

format(dbname, table_name)) except Exception as ex: print(ex) failed_table = {"table_name": table_name, "Reason": ex} unprocessed_tables.append(failed_table) def get_table_key(host, port, username, password, dbname): jdbc_url = "jdbc:sqlserver://{0}:{1};databaseName={2}".format(host, To start the job, choose Run. format(dbname)).config("spark.sql.catalog.glue_catalog.catalog-impl",

Data Lake

Data Lake Data Processing Optimization Machine Learning

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

In addition to real-time analytics and visualization, the data needs to be shared for long-term data analytics and machine learning applications. From here, the metadata is published to Amazon DataZone by using AWS Glue Data Catalog. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

Gartner Data & Analytics Summit – May 12-13 in London

Octopai

MAY 4, 2025

Octopai, CEO will be hosting a session on leveraging Metadata to build knowledge hub for successful enterprise AI implementation. If you are attending the event, visit us and learn more about how Cloudera and Octopai are leading the data management revolution.

Data Analytics

Data Analytics Metadata Data Processing Analytics

Webinars

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

The IAM role ARN must be the same for both the OpenSearch Servicer sink definition and the Kinesis Data Streams source definition. You can control what data gets indexed in different indexes using the index definition in the sink.

Metadata

Metadata Metrics Analytics Data Processing

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Institutional Data & AI Platform architecture The Institutional Division has implemented a self-service data platform to enable the domain teams to build and manage data products autonomously. The following diagram illustrates the building blocks of the Institutional Data & AI Platform.

Metadata

Metadata Data Governance Data Quality Data-driven

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

Load balancing challenges with operating custom stream processing applications Customers processing real-time data streams typically use multiple compute hosts such as Amazon Elastic Compute Cloud (Amazon EC2) to handle the high throughput in parallel. KCL uses DynamoDB to store metadata such as shard-worker mapping and checkpoints.

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

CIOs are (still) closer than ever to their dream data lakehouse

CIO Business Intelligence

OCTOBER 15, 2024

“The data catalog is critical because it’s where business manages its metadata,” said Venkat Rajaji, Senior Vice President of Product Management at Cloudera. There’s been a ton of innovation lately around the Iceberg REST catalog because the data turf war is over. But the metadata turf war is just getting started.”

Metadata

Metadata Data Processing Uncertainty Data Warehouse

Data confidence begins at the edge

CIO Business Intelligence

SEPTEMBER 23, 2024

For sectors such as industrial manufacturing and energy distribution, metering, and storage, embracing artificial intelligence (AI) and generative AI (GenAI) along with real-time data analytics, instrumentation, automation, and other advanced technologies is the key to meeting the demands of an evolving marketplace, but it’s not without risks.

Manufacturing

Manufacturing Internet of Things Metadata Risk

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

In this post, we discuss ways to modernize your legacy, on-premises, real-time analytics architecture to build serverless data analytics solutions on AWS using Amazon Managed Service for Apache Flink. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds.

Management

Management Metadata Analytics Dashboards

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. An entity can act both as a producer of data assets and as a consumer of data assets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

But there’s a host of new challenges when it comes to managing AI projects: more unknowns, non-deterministic outcomes, new infrastructures, new processes and new tools. You might have millions of short videos , with user ratings and limited metadata about the creators or content.

Management

Management Machine Learning Experimentation Metrics

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. BMS’s EDLS platform hosts over 5,000 jobs and is growing at 15% YoY (year over year). It retrieves the specified files and available metadata to show on the UI.

Metadata

Metadata Data Lake Visualization Data Transformation

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

The FinAuto team built AWS Cloud Development Kit (AWS CDK), AWS CloudFormation , and API tools to maintain a metadata store that ingests from domain owner catalogs into the global catalog. This global catalog captures new or updated partitions from the data producer AWS Glue Data Catalogs.

Finance

Finance Metadata Big Data Recreation/Entertainment

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Create an Amazon Route 53 public hosted zone such as mydomain.com to be used for routing internet traffic to your domain. For instructions, refer to Creating a public hosted zone. Request an AWS Certificate Manager (ACM) public certificate for the hosted zone. hosted_zone_id – The Route 53 public hosted zone ID.

Dashboards

Dashboards Data Processing Metadata Consulting

Gartner Data & Analytics Summit 2022 in London: 3 Key Takeaways

Alation

MAY 19, 2022

Establish what data you have. Active metadata gives you crucial context around what data you have and how to use it wisely. Active metadata provides the who, what, where, and when of a given asset, showing you where it flows through your pipeline, how that data is used, and who uses it most often.

Data Analytics

Data Analytics Metadata Analytics Data Governance

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Cross-account access has been set up between S3 buckets in Account A with resources in Account B to be able to load and unload data. In the second account, Amazon MWAA is hosted in one VPC and Redshift Serverless in a different VPC, which are connected through VPC peering. secretsmanager ). redshift-serverless.amazonaws.com:5439?

Metadata

Metadata Data Processing Management Testing

Business Intelligence for Fairs, Congresses and Exhibitions

Smart Data Collective

APRIL 14, 2021

This is one of the most developed business intelligence tools in the market that comes packed with high-end data analytics and visualization features. This high-end data visualization makes data exploration more accessible to end-users. It also comes with data caching capabilities that enable fast querying.

Business Intelligence

Business Intelligence Dashboards Visualization Big Data

Foote Partners: bonus disparities reveal tech skills most in demand in Q3

CIO Business Intelligence

DECEMBER 16, 2022

The top-earning skills were big data analytics and Ethereum, with a pay premium of 20% of base salary, both up 5.3% Other non-certified skills attracting a pay premium of 19% included data engineering , the Zachman Framework , Azure Key Vault and site reliability engineering (SRE). in the previous six months. since March.

Testing

Testing Metadata Data Processing Machine Learning

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

AWS Big Data

SEPTEMBER 12, 2024

We use leading-edge analytics, data, and science to help clients make intelligent decisions. We developed and host several applications for our customers on Amazon Web Services (AWS). These embeddings, along with metadata such as the document ID and page number, are stored in OpenSearch Service.

Unstructured Data

Unstructured Data Metadata Machine Learning Consulting

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

AWS Big Data

FEBRUARY 1, 2024

The Common Crawl corpus contains petabytes of data, regularly collected since 2008, and contains raw webpage data, metadata extracts, and text extracts. In addition to determining which dataset should be used, cleansing and processing the data to the fine-tuning’s specific need is required.

Metadata

Metadata Modeling Data Processing Unstructured Data

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

While Cloudera CDH was already a success story at HBL, in 2022, HBL identified the need to move its customer data centre environment from Cloudera’s CDH to Cloudera Data Platform (CDP) Private Cloud to accommodate growing volumes of data. and primarily served regulatory reporting and internal analytics requirements.

Management

Management Data Lake Consulting Unstructured Data

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

AWS Big Data

OCTOBER 18, 2023

FINRA centralizes all its data in Amazon Simple Storage Service (Amazon S3) with a remote Hive metastore on Amazon Relational Database Service (Amazon RDS) to manage their metadata information. host') export PASSWORD=$(aws secretsmanager get-secret-value --secret-id $secret_name --query SecretString --output text | jq -r '.password')

Big Data

Big Data Data Processing Interactive Testing

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Profile aggregation – When you’ve uniquely identified a customer, you can build applications in Managed Service for Apache Flink to consolidate all their metadata, from name to interaction history. Then, you transform this data into a concise format.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

Surfacing relevant information to end-users in a concise and digestible format is crucial for maximizing the value of data assets. Automatic document summarization, natural language processing (NLP), and data analytics powered by generative AI present innovative solutions to this challenge. Run sam delete from CloudShell.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Ontotext Invents the Universe So You Don’t Need To

Ontotext

NOVEMBER 22, 2020

Ontotext is also on the list of vendors supporting knowledge graph capabilities in their “2021 Planning Guide for Data Analytics and Artificial Intelligence” report. Content Enrichment and Metadata Management. The value of metadata for content providers is well-established. Developer-Friendly Semantic Technology.

Metadata

Metadata Cost-Benefit Unstructured Data Technology

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Data as a product Treating data as a product entails three key components: the data itself, the metadata, and the associated code and infrastructure. In this approach, teams responsible for generating data are referred to as producers. Srikant Das is an Acceleration Lab Solutions Architect at Amazon Web Services.

Data-driven

Data-driven Advertising Metadata Data Architecture

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To ingest the data, smava uses a set of popular third-party customer data platforms complemented by custom scripts. After the data lands in Amazon S3, smava uses the AWS Glue Data Catalog and crawlers to automatically catalog the available data, capture the metadata, and provide an interface that allows querying all data assets.

Data Lake

Data Lake Data Warehouse Data-driven B2B

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

AWS Big Data

JUNE 12, 2024

Swisscom’s Data, Analytics, and AI division is building a One Data Platform (ODP) solution that will enable every Swisscom employee, process, and product to benefit from the massive value of Swisscom’s data. Swisscom is a leading telecommunications provider in Switzerland.

Data Architecture

Data Architecture Cost-Benefit Data-driven Experimentation

Alation Joins the HPE Pathfinder Club

Alation

JUNE 21, 2022

Hosting an entire data environment in the cloud is costly and unsustainable. Alation makes the most valuable data the most visible, adding crucial context so people can understand it quickly. “The That rich metadata fuels a range of use cases, from governance to data democratization initiatives.

Metadata

Metadata Digital Transformation Cost-Benefit Data Governance

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

AWS Big Data

MAY 16, 2024

With the new REST API, you can now invoke DAG runs, manage datasets, or get the status of Airflow’s metadata database, trigger, and scheduler—all without relying on the Airflow web UI or CLI. Args: region (str): AWS region where the MWAA environment is hosted. Args: region (str): AWS region where the MWAA environment is hosted.

Testing

Testing Metrics Interactive Management

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

Data analytics – Business analysts gather operational insights from multiple data sources, including the location data collected from the vehicles. Athena is used to run geospatial queries on the location data stored in the S3 buckets. The ingestion approach is not in scope of this post. Choose Run.

Analytics

Analytics IoT Metadata Internet of Things

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

2020 saw us hosting our first ever fully digital Data Impact Awards ceremony, and it certainly was one of the highlights of our year. We saw a record number of entries and incredible examples of how customers were using Cloudera’s platform and services to unlock the power of data. SECURITY AND GOVERNANCE LEADERSHIP.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Cloudera

JANUARY 19, 2024

Supported AI models and services The SQL AI Assistant is not bundled with a specific LLM; instead it supports various LLMs and hosting services. The model can run locally, be hosted on CML infra or in the infrastructure of a trusted service provider. You must have an AWS account with Bedrock access before following these steps.

Data Warehouse

Data Warehouse Data Processing Optimization Modeling

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

The system ingests data from various sources such as cloud resources, cloud activity logs, and API access logs, and processes billions of messages, resulting in terabytes of data daily. This data is sent to Apache Kafka, which is hosted on Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Data Lake

Data Lake Analytics Snapshot Data Quality

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

On January 4th I had the pleasure of hosting a webinar. It was titled, The Gartner 2021 Leadership Vision for Data & Analytics Leaders. This was for the Chief Data Officer, or head of data and analytics. We cannot of course forget metadata management tools, of which there are many different.

Data Analytics

Data Analytics Analytics Data-driven Finance

The new challenges of scale: What it takes to go from PB to EB data scale

CIO Business Intelligence

JUNE 14, 2023

Additionally, it is vital to be able to execute computing operations on the 1000+ PB within a multi-parallel processing distributed system, considering that the data remains dynamic, constantly undergoing updates, deletions, movements, and growth.

Unstructured Data

Unstructured Data IT Manufacturing Visualization

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

AWS Big Data

MARCH 9, 2023

During each data load, incoming change records are matched against existing active records, comparing each attribute value to determine whether existing records have changed or were deleted or are new records coming in. SCD2 metadata – rec_eff_dt and rec_exp_dt indicate the state of the record.

Slice and Dice

Slice and Dice Data Warehouse Metrics Metadata

Themes and Conferences per Pacoid, Episode 10

Domino Data Lab

JUNE 2, 2019

She had much to say to leaders of data science teams, coming from perspectives of data engineering at scale. And by “scale” I’m referring to what is arguably the largest, most successful data analytics operation in the cloud of any public firm that isn’t a cloud provider. Rev 2 wrap up. See you at Rev 3 in 2020!

Data Science

Data Science Data-driven Machine Learning Modeling

Dancing with Elephants in 5 Easy Steps

Cloudera

AUGUST 21, 2020

There are now tens of thousands of instances of these Big Data platforms running in production around the world today, and the number is increasing every year. Many of them are increasingly deployed outside of traditional data centers in hosted, “cloud” environments. Streaming data analytics. .

Big Data

Big Data Cost-Benefit ROI Risk

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.

Data Lake

Data Lake Dashboards Metrics Metadata

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

int '2' 'InstanceType': 'Ref': 'ClusterInstanceType' 'Market': 'ON_DEMAND' 'Name': 'Core' 'Outputs': 'ClusterId': 'Value': 'Ref': 'EmrCluster' 'Description': 'The ID of the EMR cluster' 'Metadata': 'AWS::CloudFormation::Designer': {} 'Rules': {} Trusted identity propagation is supported from Amazon EMR 6.15

Analytics

Analytics Data Lake Management Enterprise

The Gartner 2022 Leadership Vision for Data and Analytics Leaders Questions and Answers

Andrew White

JANUARY 9, 2022

On Thursday January 6th I hosted Gartner’s 2022 Leadership Vision for Data and Analytics webinar. To drive a successful Data Analytics strategy do you think it is a multidisciplinary activity and if so, what additional roles would you expect to see involved. We write about data and analytics.

Analytics

Analytics Measurement Data-driven Modeling

How to build a safe path to AI in Healthcare

CIO Business Intelligence

AUGUST 5, 2024

About the author Jay Nambiar is chief technology officer, healthcare, and Prashant Renu is vice president, AI solutions, healthcare at EXL, a leading data analytics and digital operations and solutions company. To learn more, visit us here.

Experimentation

Experimentation Risk Metadata Data-driven

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

How EUROGATE established a data mesh architecture using Amazon DataZone

Webinars

Trending Sources

Gartner Data & Analytics Summit – May 12-13 in London

Webinars

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

CIOs are (still) closer than ever to their dream data lakehouse

Data confidence begins at the edge

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

How Cargotec uses metadata replication to enable cross-account data sharing

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

What you need to know about product management for AI

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

Gartner Data & Analytics Summit 2022 in London: 3 Key Takeaways

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Business Intelligence for Fairs, Congresses and Exhibitions

Foote Partners: bonus disparities reveal tech skills most in demand in Q3

How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

Habib Bank manages data at scale with Cloudera Data Platform

Run Apache Hive workloads using Spark SQL with Amazon EMR on EKS

Create an end-to-end data strategy for Customer 360 on AWS

Enrich your serverless data lake with Amazon Bedrock

Ontotext Invents the Universe So You Don’t Need To

Design a data mesh on AWS that reflects the envisioned organization

How smava makes loans transparent and affordable using Amazon Redshift Serverless

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

Alation Joins the HPE Pathfinder Club

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Announcing the 2021 Data Impact Awards

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

The new challenges of scale: What it takes to go from PB to EB data scale

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

Themes and Conferences per Pacoid, Episode 10

Dancing with Elephants in 5 Easy Steps

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

The Gartner 2022 Leadership Vision for Data and Analytics Leaders Questions and Answers

How to build a safe path to AI in Healthcare

Stay Connected