Data Leaders Brief

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

JANUARY 9, 2025

In this post, we focus on data management implementation options such as accessing data directly in Amazon Simple Storage Service (Amazon S3), using popular data formats like Parquet, or using open table formats like Iceberg. Data management is the foundation of quantitative research.

Metadata

Metadata Snapshot Cost-Benefit Optimization

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2025

However, commits can still fail if the latest metadata is updated after the base metadata version is established. Icebergs concurrency model and conflict type Before diving into specific implementation patterns, its essential to understand how Iceberg manages concurrent writes through its table architecture and transaction model.

Snapshot

Snapshot Management Metadata Big Data

The state of data quality in 2020

O'Reilly on Data

FEBRUARY 11, 2020

Data scientists and analysts, data engineers, and the people who manage them comprise 40% of the audience; developers and their managers, about 22%. These include the basics, such as metadata creation and management, data provenance, data lineage, and other essentials. Respondents who work in upper management—i.e.,

Data Quality

Data Quality Metadata Data Governance Publishing

Webinars

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

AWS Big Data

NOVEMBER 22, 2024

It is appealing to migrate from self-managed OpenSearch and Elasticsearch clusters in legacy versions to Amazon OpenSearch Service to enjoy the ease of use, native integration with AWS services, and rich features from the open-source environment ( OpenSearch is now part of Linux Foundation ).

Snapshot

Snapshot Metadata Recreation/Entertainment Data Processing

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

AWS Big Data

APRIL 9, 2025

Whether youre a data analyst seeking a specific metric or a data steward validating metadata compliance, this update delivers a more precise, governed, and intuitive search experience. Refer to the product documentation to learn more about how to set up metadata rules for subscription and publishing workflows.

Metadata

Metadata Metrics Data-driven Cost-Benefit

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query.

Metadata

Metadata Data Lake Modeling Data Warehouse

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

The landscape of big data management has been transformed by the rising popularity of open table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake. Both Delta Lake and Iceberg metadata files reference the same data files.

Metadata

Metadata Data Warehouse Big Data Data Lake

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

Additionally, multiple copies of the same data locked in proprietary systems contribute to version control issues, redundancies, staleness, and management headaches. Founded in 2016, Octopai offers automated solutions for data lineage, data discovery, data catalog, mapping, and impact analysis across complex data environments.

Metadata

Metadata Management Data Governance Data-driven

It’s 2025. Are your data strategies strong enough to de-risk AI adoption?

CIO Business Intelligence

DECEMBER 11, 2024

According to Richard Kulkarni, Country Manager for Quest, a lack of clarity concerning governance and policy around AI means that employees and teams are finding workarounds to access the technology. Some senior technology leaders fear a Pandoras Box type situation with AI becoming impossible to control once unleashed.

Risk

Risk Data Strategy Strategy Data Governance

How BMW streamlined data access using AWS Lake Formation fine-grained access control

AWS Big Data

OCTOBER 29, 2024

The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight. This led to inefficiencies in data governance and access control.

Data Lake

Data Lake Sales Metadata Machine Learning

SAP Datasphere Powers Business at the Speed of Data

Rocket-Powered Data Science

MARCH 20, 2023

Datasphere manages and integrates structured, semi-structured, and unstructured data types. Datasphere provides full-spectrum data governance: metadata management, data catalogs, data privacy, data quality, and data lineage (provenance) tracking. Datasphere is not just for data managers.

Data Warehouse

Data Warehouse Metadata Digital Transformation Machine Learning

How EUROGATE established a data mesh architecture using Amazon DataZone

AWS Big Data

JANUARY 15, 2025

Their terminal operations rely heavily on seamless data flows and the management of vast volumes of data. Thus, managing data at scale and establishing data-driven decision support across different companies and departments within the EUROGATE Group remains a challenge. This process is shown in the following figure.

IoT

IoT Machine Learning Metadata Data-driven

Write queries faster with Amazon Q generative SQL for Amazon Redshift

AWS Big Data

NOVEMBER 7, 2024

Amazon Redshift is a fully managed, AI-powered cloud data warehouse that delivers the best price-performance for your analytics workloads at any scale. It enables you to get insights faster without extensive knowledge of your organization’s complex database schema and metadata. Within this feature, user data is secure and private.

Metadata

Metadata Sales Data Warehouse Optimization

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

AWS Big Data

NOVEMBER 19, 2024

Organizations of all sizes and types are using generative AI to create products and solutions. They are looking for a reliable and scalable solution to implement robust access controls to make sure these documents are only accessible to individuals who have a legitimate business need and the appropriate level of authorization.

Management

Management Metadata Manufacturing Testing

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

AWS Big Data

NOVEMBER 11, 2024

Kinesis Data Streams is a fully managed, serverless data streaming service that stores and ingests various streaming data in real time at any scale. Solution overview In this solution, we consider a common use case for centralized log aggregation for an organization. To create a Kinesis Data Stream, see Create a data stream.

Metadata

Metadata Metrics Analytics Data Processing

Introducing Amazon MWAA micro environments for Apache Airflow

AWS Big Data

NOVEMBER 19, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA), is a managed Apache Airflow service used to extract business insights across an organization by combining, enriching, and transforming data through a series of tasks called a workflow. This approach offers greater flexibility and control over workflow management.

Metadata

Metadata Cost-Benefit Metrics Optimization

How companies are building sustainable AI and ML initiatives

O'Reilly on Data

JANUARY 29, 2019

In order to have a longstanding AI and ML practice, companies need to have data infrastructure in place to collect, transform, store, and manage data. Fifty-eight percent of respondents indicated that they were either building or evaluating data science platform solutions. Data scientists and data engineers are in demand.

Deep Learning

Deep Learning Machine Learning Data Science Metadata

Comprehensive data management for AI: The next-gen data management engine that will drive AI to new heights

CIO Business Intelligence

NOVEMBER 19, 2024

Some challenges include data infrastructure that allows scaling and optimizing for AI; data management to inform AI workflows where data lives and how it can be used; and associated data services that help data scientists protect AI workflows and keep their models clean. I’m excited to give you a preview of what’s around the corner for ONTAP.

Management

Management Unstructured Data Deep Learning Metadata

Accelerating AI at scale without sacrificing security

CIO Business Intelligence

NOVEMBER 27, 2024

As AI adoption accelerates, it demands increasingly vast amounts of data, leading to more users accessing, transferring, and managing it across diverse environments. The platform also offers a deeply integrated set of security and governance technologies, ensuring comprehensive data management and reducing risk.

Data Governance

Data Governance Risk Insurance Metadata

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

AWS Big Data

DECEMBER 4, 2024

Recognizing this paradigm shift, ANZ Institutional Division has embarked on a transformative journey to redefine its approach to data management, utilization, and extracting significant business value from data insights. This principle makes sure data accountability remains close to the source, fostering higher data quality and relevance.

Metadata

Metadata Data Governance Data Quality Data-driven

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

AWS Big Data

OCTOBER 21, 2024

Let’s briefly describe the capabilities of the AWS services we referred above: AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics.

Sales

Sales Data-driven Data Processing Key Performance Indicator

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

DECEMBER 4, 2024

Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. We take care of the ETL for you by automating the creation and management of data replication. Zero-ETL provides service-managed replication. Glue ETL offers customer-managed data ingestion. What is zero-ETL?

Data Integration

Data Integration Data Lake Statistics Data-driven

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

AWS Big Data

MAY 20, 2025

Solution overview To illustrate the new Amazon Bedrock Knowledge Bases integration with structured data in Amazon Redshift, we will build a conversational AI-powered assistant for financial assistance that is designed to help answer financial inquiries, like Who has the most accounts? Create an AWS Identity and Access Management (IAM) role.

Structured Data

Structured Data Data Warehouse Analytics Finance

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

AWS Big Data

NOVEMBER 6, 2024

When building custom stream processing applications, developers typically face challenges with managing distributed computing at scale that is required to process high throughput data in real time. reduces the Amazon DynamoDB cost associated with KCL by optimizing read operations on the DynamoDB table storing metadata.

Cost-Benefit

Cost-Benefit Metadata Optimization Publishing

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

AWS Big Data

DECEMBER 19, 2024

Amazon DataZone , a data management service, helps you catalog, discover, share, and govern data stored across AWS, on-premises systems, and third-party sources. This solution enhances governance and simplifies access to unstructured data assets across the organization. The solution architecture is shown in the following screenshot.

Publishing

Publishing Unstructured Data Metadata Data-driven

Recap of Amazon Redshift key product announcements in 2024

AWS Big Data

DECEMBER 17, 2024

Amazon Redshift scales linearly with the number of users and volume of data, making it an ideal solution for both growing businesses and enterprises. These improvements collectively reinforce Amazon Redshifts focus as a leading cloud data warehouse solution, offering unparalleled performance and value to customers.

Data Lake

Data Lake Data Warehouse Data-driven Optimization

Specialized tools for machine learning development and model governance are becoming essential

O'Reilly on Data

APRIL 2, 2019

About 10 months ago, Databricks announced MLflow , a new open source project for managing machine learning development (full disclosure: Ben Lorica is an advisor to Databricks). MLflow is being used to manage multi-step machine learning pipelines. Traditional software developers have long had tools for managing their projects.

Machine Learning

Machine Learning Modeling Data Science Software

Automating ethics

O'Reilly on Data

MARCH 22, 2019

While neither of these is a complete solution, I can imagine a future version of these proposals that standardizes metadata so data routing protocols can determine which flows are appropriate and which aren't. Whatever solutions we end up with, we must not fall in love with the tools.

Metadata

Metadata Advertising Insurance Modeling

Use open table format libraries on AWS Glue 5.0 for Apache Spark

AWS Big Data

DECEMBER 4, 2024

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. The adoption of open table formats is a crucial consideration for organizations looking to optimize their data management practices and extract maximum value from their data.

Snapshot

Snapshot Metadata Data Lake Optimization

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

AWS Big Data

DECEMBER 10, 2024

As organizations increasingly adopt cloud-based solutions and centralized identity management, the need for seamless and secure access to data warehouses like Amazon Redshift becomes crucial. federated users to access the AWS Management Console. From there, the user can access the Redshift Query Editor V2.

Sales

Sales Metadata Enterprise Testing

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

AWS Big Data

MAY 2, 2025

The visual designer is recommended for helping you manage workflow projects. A solution requires the following: An ingest flow to generate text embeddings (vectors) from text in an existing index. Lets compare our semantic and keyword solutions from the search comparison tool. Flows are a pipeline of processor resources.

Machine Learning

Machine Learning Visualization Dashboards Metadata

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

AWS Big Data

OCTOBER 30, 2024

With this launch of JDBC connectivity, Amazon DataZone expands its support for data users, including analysts and scientists, allowing them to work in their preferred environments—whether it’s SQL Workbench, Domino, or Amazon-native solutions—while ensuring secure, governed access within Amazon DataZone.

Visualization

Visualization Data Lake Testing Data Governance

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

AWS Big Data

OCTOBER 23, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a fully managed service that builds upon Apache Airflow, offering its benefits while eliminating the need for you to set up, operate, and maintain the underlying infrastructure, reducing operational overhead while increasing security and resilience.

Interactive

Interactive Testing Data-driven Data Lake

Deep automation in machine learning

O'Reilly on Data

DECEMBER 19, 2018

In a previous post , we talked about applications of machine learning (ML) to software development, which included a tour through sample tools in data science and for managing data infrastructure. We have great tools for working with code: creating it, managing it, testing it, and deploying it. The tools for Software 2.0

Machine Learning

Machine Learning Software Metadata Testing

Becoming a machine learning company means investing in foundational technologies

O'Reilly on Data

MAY 21, 2019

In our own conferences, we see strong interest in training sessions and tutorials on deep learning for time series and natural language processing—two areas where organizations likely already have existing solutions, and for which deep learning is beginning to show some promise. and managed services in the cloud.

Machine Learning

Machine Learning Technology Deep Learning Data Science

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

Monitoring and tracking issues in the data management lifecycle are essential for achieving operational excellence in data lakes. This is where Apache Iceberg comes into play, offering a new approach to data lake management. You will learn about an open-source solution that can collect important metrics from the Iceberg metadata layer.

Metadata

Metadata Snapshot Data Lake Metrics

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

If you’re already a software product manager (PM), you have a head start on becoming a PM for artificial intelligence (AI) or machine learning (ML). But there’s a host of new challenges when it comes to managing AI projects: more unknowns, non-deterministic outcomes, new infrastructures, new processes and new tools.

Management

Management Machine Learning Experimentation Metrics

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

AWS Big Data

DECEMBER 4, 2024

SageMaker helps you work faster and smarter with your data and build powerful analytics and AI solutions that are deeply rooted in your unique data assets, giving you an edge over the competition. We’ve simplified data architectures, saving you time and costs on unnecessary data movement, data duplication, and custom solutions.

Data Analytics

Data Analytics Analytics Data Lake Data Quality

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

AWS Big Data

NOVEMBER 7, 2024

This post explores how you can use BladeBridge , a leading data environment modernization solution, to simplify and accelerate the migration of SQL code from BigQuery to Amazon Redshift. BladeBridge provides a configurable framework to seamlessly convert legacy metadata and code into more modern services such as Amazon Redshift.

Data Warehouse

Data Warehouse Reporting Big Data Data Lake

Have we reached the end of ‘too expensive’ for enterprise software?

CIO Business Intelligence

JANUARY 9, 2025

These required specialized roles and teams to collect domain-specific data, prepare features, label data, retrain and manage the entire lifecycle of a model. Take, for example, an app for recording and managing travel expenses. The system then offers them more precise solutions or forwards them to the appropriate support staff.

Software

Software Enterprise Key Performance Indicator Machine Learning

Data’s dark secret: Why poor quality cripples AI and growth

CIO Business Intelligence

APRIL 8, 2025

As data-centric AI, automated metadata management and privacy-aware data sharing mature, the opportunity to embed data quality into the enterprises core has never been more significant. Instead, organizations resort to manual workarounds often managed by overburdened analysts or domain experts.

Data Quality

Data Quality Data-driven Key Performance Indicator Metadata

What is SCOR? A model to improve supply chain management

CIO Business Intelligence

MAY 20, 2025

Supply chain management (SCM) is a critical focus for companies that sell products, services, hardware, and software. The updated version includes more emerging drivers of supply chain success, covering topics such as omnichannel, metadata, and blockchain , according to the ASCM. What is the main focus of the SCOR model?

Modeling

Modeling Management Metrics Measurement

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

Why it’s challenging to process and manage unstructured data Unstructured data makes up a large proportion of the data in the enterprise that can’t be stored in a traditional relational database management systems (RDBMS). You can integrate different technologies or tools to build a solution.

Unstructured Data

Unstructured Data Metadata Management Analytics

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

AWS Big Data

SEPTEMBER 12, 2024

The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance.

Optimization

Optimization Snapshot Metadata Metrics

Build a high-performance quant research platform with Apache Iceberg

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

Webinars

Trending Sources

The state of data quality in 2020

Webinars

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

It’s 2025. Are your data strategies strong enough to de-risk AI adoption?

How BMW streamlined data access using AWS Lake Formation fine-grained access control

SAP Datasphere Powers Business at the Speed of Data

How EUROGATE established a data mesh architecture using Amazon DataZone

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Introducing Amazon MWAA micro environments for Apache Airflow

How companies are building sustainable AI and ML initiatives

Comprehensive data management for AI: The next-gen data management engine that will drive AI to new heights

Accelerating AI at scale without sacrificing security

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

Recap of Amazon Redshift key product announcements in 2024

Specialized tools for machine learning development and model governance are becoming essential

Automating ethics

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Amazon OpenSearch Service launches flow builder to empower rapid AI search innovation

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Deep automation in machine learning

Becoming a machine learning company means investing in foundational technologies

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

What you need to know about product management for AI

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

Accelerate SQL code migration from Google BigQuery to Amazon Redshift using BladeBridge

Have we reached the end of ‘too expensive’ for enterprise software?

Data’s dark secret: Why poor quality cripples AI and growth

What is SCOR? A model to improve supply chain management

Unstructured data management and governance using AWS AI/ML and analytics services

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

Stay Connected