Data Processing, Data Science and Metadata

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

For container terminal operators, data-driven decision-making and efficient data sharing are vital to optimizing operations and boosting supply chain efficiency. Two use cases illustrate how this can be applied for business intelligence (BI) and data science applications, using AWS services such as Amazon Redshift and Amazon SageMaker.

IoT

IoT Machine Learning Metadata Data-driven

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange. Before we jump into the data ingestion step, here is a quick overview of how Ozone manages its metadata namespace through volumes, buckets and keys. . Data ingestion through ‘s3’. Ozone Namespace Overview. import boto3.

Data Science

Data Science Forecasting Metadata Machine Learning

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

As you experience the benefits of consolidating your data governance strategy on top of Amazon DataZone, you may want to extend its coverage to new, diverse data repositories (either self-managed or as managed services) including relational databases, third-party data warehouses, analytic platforms and more.

Metadata

Metadata Data Lake Data Processing Data-driven

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Themes and Conferences per Pacoid, Episode 11

Domino Data Lab

JULY 2, 2019

In other words, using metadata about data science work to generate code. In this case, code gets generated for data preparation, where so much of the “time and labor” in data science work is concentrated. BTW, videos for Rev2 are up: [link]. On deck this time ’round the Moon: program synthesis.

Metadata

Metadata Data Science Machine Learning Data-driven

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

AWS Big Data

MAY 4, 2023

The Amazon Sustainability Data Initiative (ASDI) uses the capabilities of Amazon S3 to provide a no-cost solution for you to store and share climate science workloads across the globe. Amazon’s Open Data Sponsorship Program allows organizations to host free of charge on AWS.

Data Processing

Data Processing Metadata Informatics Interactive

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

The FinAuto team built AWS Cloud Development Kit (AWS CDK), AWS CloudFormation , and API tools to maintain a metadata store that ingests from domain owner catalogs into the global catalog. This global catalog captures new or updated partitions from the data producer AWS Glue Data Catalogs.

Finance

Finance Metadata Big Data Recreation/Entertainment

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

Others aim simply to manage the collection and integration of data, leaving the analysis and presentation work to other tools that specialize in data science and statistics. Lately a cousin of DMP has evolved, called the customer data platform (CDP). Some DMPs specialize in producing reports with elaborate infographics.

Management

Management Advertising Data Lake Sales

Themes and Conferences per Pacoid, Episode 10

Domino Data Lab

JUNE 2, 2019

Co-chair Paco Nathan provides highlights of Rev 2 , a data science leaders summit. We held Rev 2 May 23-24 in NYC, as the place where “data science leaders and their teams come to learn from each other.” If you lead a data science team/org, DM me and I’ll send you an invite to data-head.slack.com ”.

Data Science

Data Science Data-driven Machine Learning Modeling

Providing fine-grained, trusted access to enterprise datasets with Okera and Domino

Domino Data Lab

OCTOBER 1, 2020

This data supports all kinds of use cases within organizations, from helping production analysts understand how production is progressing, to allowing research scientists to look at the results of a set of treatments across different trials and cross-sections of the population.

Enterprise

Enterprise Metadata Cost-Benefit Data Science

HEMA accelerates their data governance journey with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Each service is hosted in a dedicated AWS account and is built and maintained by a product owner and a development team, as illustrated in the following figure. The business end-users were given a tool to discover data assets produced within the mesh and seamlessly self-serve on their data sharing needs.

Data Governance

Data Governance Publishing Data-driven Metadata

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

The top three items are essentially “the devil you know” for firms which want to invest in data science: data platform, integration, data prep. Data governance shows up as the fourth-most-popular kind of solution that enterprise teams were adopting or evaluating during 2019. Rinse, lather, repeat.

Machine Learning

Machine Learning Data Governance Metadata Data Science

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

AWS Big Data

FEBRUARY 1, 2024

The Common Crawl corpus contains petabytes of data, regularly collected since 2008, and contains raw webpage data, metadata extracts, and text extracts. In addition to determining which dataset should be used, cleansing and processing the data to the fine-tuning’s specific need is required.

Metadata

Metadata Modeling Data Processing Unstructured Data

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Profile aggregation – When you’ve uniquely identified a customer, you can build applications in Managed Service for Apache Flink to consolidate all their metadata, from name to interaction history. Then, you transform this data into a concise format.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Top 15 data management platforms available today

CIO Business Intelligence

SEPTEMBER 22, 2023

Others aim simply to manage the collection and integration of data, leaving the analysis and presentation work to other tools that specialize in data science and statistics. DMP vs. CDP Lately a cousin of DMP has evolved, called the customer data platform (CDP).

Management

Management Advertising Data Lake Sales

Natural Language in Python using spaCy: An Introduction

Domino Data Lab

SEPTEMBER 9, 2019

Data science teams in industry must work with lots of text, one of the top four categories of data used in machine learning. We can compare open source licenses hosted on the Open Source Initiative site: In [11]: lic = {} ?lic["mit"] metadata=convention_df["speaker"]? ). sys.exit(-1).

Deep Learning

Deep Learning Machine Learning Data Science Visualization

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

2020 saw us hosting our first ever fully digital Data Impact Awards ceremony, and it certainly was one of the highlights of our year. We saw a record number of entries and incredible examples of how customers were using Cloudera’s platform and services to unlock the power of data. SECURITY AND GOVERNANCE LEADERSHIP.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

However, as data processing at scale solutions grow, organizations need to build more and more features on top of their data lakes. Additionally, the task of maintaining and managing files in the data lake can be tedious and sometimes complex. Data can be organized into three different zones, as shown in the following figure.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance. Cost-optimization and ease-of-use .

Data Warehouse

Data Warehouse Data Lake IT Analytics

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

bridgei2i

MARCH 3, 2021

Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities. Unlocking the Value of Enterprise AI with Data Engineering Capabilities. They discuss how the data engineering team is instrumental in easing collaboration between analysts, data scientists and ML engineers to build enterprise AI solutions.

Enterprise

Enterprise Digital Transformation Data-driven Interactive

The new challenges of scale: What it takes to go from PB to EB data scale

CIO Business Intelligence

JUNE 14, 2023

Additionally, it is vital to be able to execute computing operations on the 1000+ PB within a multi-parallel processing distributed system, considering that the data remains dynamic, constantly undergoing updates, deletions, movements, and growth.

Unstructured Data

Unstructured Data IT Manufacturing Visualization

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

By supporting open-source frameworks and tools for code-based, automated and visual data science capabilities — all in a secure, trusted studio environment — we’re already seeing excitement from companies ready to use both foundation models and machine learning to accomplish key tasks.

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

Adapting to change on a dime: The absolute necessity of hybrid portability

CIO Business Intelligence

JUNE 6, 2023

A European Multinational Insurance company deployed CDP both on-prem and on Azure public cloud with Cloudera Data Science Workbench (CDSW) and Cloudera Machine Learning (CML). Portability only works if you are fully portable: not just workloads and data but everything that goes along with it. But here’s the thing.

Insurance

Insurance Metadata Data Processing Machine Learning

Dancing with Elephants in 5 Easy Steps

Cloudera

AUGUST 21, 2020

There are now tens of thousands of instances of these Big Data platforms running in production around the world today, and the number is increasing every year. Many of them are increasingly deployed outside of traditional data centers in hosted, “cloud” environments. Streaming data analytics. .

Big Data

Big Data Cost-Benefit ROI Risk

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

AWS Big Data

FEBRUARY 2, 2023

Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data.

Reporting

Reporting Data Lake Management Optimization

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

IAM Identity Center now supports trusted identity propagation , a streamlined experience for users who require access to data with AWS analytics services.

Analytics

Analytics Data Lake Management Enterprise

Data Governance for Dummies: Your Questions, Answered

Alation

FEBRUARY 17, 2023

This past week, I had the pleasure of hosting Data Governance for Dummies author Jonathan Reichental for a fireside chat , along with Denise Swanson , Data Governance lead at Alation. So, establishing a framework to store data by its source is a great place to start. Establishing a solid vision and mission is key.

Data Governance

Data Governance Data Quality Metadata Cost-Benefit

The Modern Data Stack Explained: What The Future Holds

Alation

JANUARY 17, 2023

Powered by cloud computing, more data professionals have access to the data, too. Data analysts have access to the data warehouse using BI tools like Tableau; data scientists have access to data science tools, such as Dataiku. Better Data Culture. Good data warehouses should be reliable.

Data Warehouse

Data Warehouse Cost-Benefit Data Science Data Transformation

GoDaddy: Customer-First Digital Transformation

Alation

FEBRUARY 13, 2020

What role does data play in your customer-first culture? Graves: As I mentioned, one of the key things for us is that we sell web products for our customers to build their own web presence – domains, hosting, shopping carts, and SSL certs. So we came up with a concept called a Unified Data Set (UDS).

Digital Transformation

Digital Transformation Data-driven Business Intelligence Big Data

The Gartner 2022 Leadership Vision for Data and Analytics Leaders Questions and Answers

Andrew White

JANUARY 9, 2022

On Thursday January 6th I hosted Gartner’s 2022 Leadership Vision for Data and Analytics webinar. – We did some early work a few years ago that look at the career path of a CDO – see from 2016 Build Your Career Path to the Chief Data Officer Role. We write about data and analytics.

Analytics

Analytics Measurement Data-driven Modeling

How to get Hadoop and Spark up and running on AWS

Insight

SEPTEMBER 10, 2019

In the data engineering program at Insight Data Science, some Fellows choose to use Pegasus as a tool to quickly stand-up instances on Amazon Web Services and install the necessary distributed computing technologies. Next, Pegasus will identify where the data node will store its metadata.

Metadata

Metadata Data Processing Technology Data Science

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

On January 4th I had the pleasure of hosting a webinar. It was titled, The Gartner 2021 Leadership Vision for Data & Analytics Leaders. This was for the Chief Data Officer, or head of data and analytics. As such a head of analytics, BI and data science may emerge. CAO may well be a name for that role.

Data Analytics

Data Analytics Analytics Data-driven Finance

Funding and Our Future

Alation

FEBRUARY 13, 2020

I had the pleasure of chatting with John Furrier of theCUBE about how our recent round of funding will fuel innovation within the Alation Data Catalog. I’m John Furrier, co-host of theCUBE. You see scale, obviously data science work in the cloud. This is certified data. Check out our conversation below.

ROI

ROI Data-driven Finance Data Quality

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

There are multiple tables related to customers and order data in the RDS database. Amazon S3 hosts the metadata of all the tables as a.csv file. Over the years, he has helped multiple customers on data platform transformations across industry verticals. The following diagram illustrates the Step Functions workflow.

Metadata

Metadata Visualization Data-driven Data Lake

Themes and Conferences per Pacoid, Episode 13

Domino Data Lab

OCTOBER 9, 2019

Data Science meets Climate Science. Environment Data Management (EDM) is an annual meeting for data management teams at the NOAA , this year chaired by Kim Valentine and Eugene Burger. Data veracity, data stewardship, and heros of data science. Metadata Challenges.

Deep Learning

Deep Learning Metadata Machine Learning Data Science

A Lifetime of Data: Departments of Defense and Veterans Affairs Journey to Genesis

Cloudera

APRIL 21, 2022

The centerpiece of MHS Genesis is Cerner’s Millennium services management platform, which provides hosted software-as-a-service functionality in the cloud. A key reason for selecting Cerner, the DoD said , was the company’s data center allows direct access to proprietary data that it couldn’t obtain from a government-hosted environment.

Informatics

Informatics Metadata Insurance Data Processing

Gartner D&A Summit Bake-Offs Explored Flooding Impact And Reasons for Optimism!

Rita Sallam

APRIL 2, 2023

We explored these questions and more at our Bake-Offs and Show Floor Showdowns at our Data and Analytics Summit in Orlando with 4,000 of our closest D&A friends and family. The first featured analytics and BI platform Gartner Magic Quadrant leaders while the other showcased high interest data science and machine learning platforms.

Optimization

Optimization Machine Learning Insurance Data Science

Deploy real-time analytics with StarTree for managed Apache Pinot on AWS

AWS Big Data

MARCH 13, 2025

StarTrees automatic data ingestion framework is ideal for enterprise workloads because it improves scalability and reduces the data maintenance complexity often found in open source Pinot deployments. The data is then modelled to help you organize and structure the data fetched from the selected data source into Pinot tables.

Management

Management Analytics OLAP Online Analytical Processing

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

AWS Big Data

APRIL 21, 2025

You can also refer to Simplify data access for your enterprise using Amazon SageMaker Lakehouse for the Lake Formation admin setup in your AWS account. An S3 bucket to host the sample Iceberg table data and metadata. Insert data from the CSV table to the Iceberg table. CREATE EXTERNAL TABLE `iceberg_db`.`customer_csv`(

Data Lake

Data Lake Metadata Interactive Big Data

Build end-to-end Apache Spark pipelines with Amazon MWAA, Batch Processing Gateway, and Amazon EMR on EKS clusters

AWS Big Data

MAY 1, 2025

By using Amazon MWAA, we add job scheduling and orchestration capabilities, enabling you to build a comprehensive end-to-end Spark-based data processing pipeline. Overview of solution Consider HealthTech Analytics, a healthcare analytics company managing two distinct data processing workloads.

Cost-Benefit

Cost-Benefit Interactive Management Data Processing

Summing Up Three Days at Gartner’s Data and Analytics Conference in Orlando, Florida, USA

Andrew White

MARCH 31, 2023

A workshop that helps diagnostically map specific data to specific business outcomes. I hosted 25 1-1s in between the meetings and presentations. Data mesh versus data fabric I am not the expert here but in lay terms, I believe both fabric and mesh include a semantic inference engine that consumes active metadata.

Analytics

Analytics Marketing Data-driven Visualization

How EUROGATE established a data mesh architecture using Amazon DataZone

Apache Ozone Powers Data Science in CDP Private Cloud

Webinars

Trending Sources

Governing data in relational databases using Amazon DataZone

Webinars

Themes and Conferences per Pacoid, Episode 11

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

How Cargotec uses metadata replication to enable cross-account data sharing

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Top 15 data management platforms

Themes and Conferences per Pacoid, Episode 10

Providing fine-grained, trusted access to enterprise datasets with Okera and Domino

HEMA accelerates their data governance journey with Amazon DataZone

Themes and Conferences per Pacoid, Episode 8

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

Create an end-to-end data strategy for Customer 360 on AWS

Top 15 data management platforms available today

Natural Language in Python using spaCy: An Introduction

Announcing the 2021 Data Impact Awards

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Get Your Analytics Insights Instantly – Without Abandoning Central IT

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

The new challenges of scale: What it takes to go from PB to EB data scale

Exploring the AI and data capabilities of watsonx

Adapting to change on a dime: The absolute necessity of hybrid portability

Dancing with Elephants in 5 Easy Steps

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Data Governance for Dummies: Your Questions, Answered

The Modern Data Stack Explained: What The Future Holds

GoDaddy: Customer-First Digital Transformation

The Gartner 2022 Leadership Vision for Data and Analytics Leaders Questions and Answers

How to get Hadoop and Spark up and running on AWS

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Funding and Our Future

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Themes and Conferences per Pacoid, Episode 13

A Lifetime of Data: Departments of Defense and Veterans Affairs Journey to Genesis

Gartner D&A Summit Bake-Offs Explored Flooding Impact And Reasons for Optimism!

Deploy real-time analytics with StarTree for managed Apache Pinot on AWS

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

Build end-to-end Apache Spark pipelines with Amazon MWAA, Batch Processing Gateway, and Amazon EMR on EKS clusters

Summing Up Three Days at Gartner’s Data and Analytics Conference in Orlando, Florida, USA

Stay Connected