2022, Data Lake and Metadata - Data Leaders Brief

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Data Talks, CFOs Listen: Why Analytics Are Key To Better Spend Management

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

In this post, we are excited to summarize the features that the AWS Glue Data Catalog, AWS Glue crawler, and Lake Formation teams delivered in 2022. Whether you are a data platform builder, data engineer, data scientist, or any technology leader interested in data lake solutions, this post is for you.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

AWS Big Data

OCTOBER 23, 2024

It’s a set of HTTP endpoints to perform operations such as invoking Directed Acyclic Graphs (DAGs), checking task statuses, retrieving metadata about workflows, managing connections and variables, and even initiating dataset-related events, without directly accessing the Airflow web interface or command line tools.

Interactive

Interactive Testing Data-driven Data Lake

Gartner Data & Analytics Sydney 2022

Timo Elliott

NOVEMBER 21, 2022

For the last 30 years, whenever you want to do analytics, the first step is to rip it out of the operational applications and try and move it to a different environment—so data warehousing, data lakes, data lakehouses and now data clouds.

Data Analytics

Data Analytics Analytics Recreation/Entertainment Data Lake

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2022 , we talked about the enhancements we had done to these services.

Data Lake

Data Lake Metadata Data Governance Statistics

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

AWS Big Data

JULY 21, 2023

Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.

Data Lake

Data Lake Data Warehouse Marketing Management

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

DECEMBER 16, 2022

We are pleased to announce that Cloudera has been named a Leader in the 2022 Gartner ® Magic Quadrant for Cloud Database Management Systems. Cloudera has long had the capabilities of a data lakehouse, if not the label. 4-Ready for modern data fabric architectures. 4-Ready for modern data fabric architectures.

Management

Management Metadata Machine Learning Data Lake

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale. This is where the tagging feature in Apache Iceberg comes in handy.

Snapshot

Snapshot Data Lake Testing Strategy

Regeneron turns to IT to accelerate drug discovery

CIO Business Intelligence

NOVEMBER 4, 2022

MetaBio, which received a 2022 CIO 100 Award , provides a single source for datasets in a unified format, enabling researchers to quickly extract information about various therapeutic functions without having to worry about how to prepare or find the data. Much of Regeneron’s data, of course, is confidential.

Data Lake

Data Lake IT Experimentation Data-driven

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Data Lake Testing Consulting

The Future of the Data Lakehouse – Open

CIO Business Intelligence

JUNE 23, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Data-driven

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures. Are data architects in demand?

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Data-driven

Achieve your AI goals with an open data lakehouse approach

IBM Big Data Hub

OCTOBER 4, 2023

Why does AI need an open data lakehouse architecture? from 2022 to 2026. Another IDC study showed that while 2/3 of respondents reported using AI-driven data analytics, most reported that less than half of the data under management is available for this type of analytics. All of this supports the use of AI.

Data Lake

Data Lake Metadata Data Warehouse Cost-Benefit

Salesforce readies Einstein Copilot to unleash generative AI across its offerings

CIO Business Intelligence

SEPTEMBER 12, 2023

The hype around generative AI since ChatGPT’s launch in November 2022 has driven some software vendors to rush to incorporate the technology into their applications. To that end, Salesforce is leveraging Data Cloud as a central data hub for enterprise implementations of Einstein Copilot.

IT

IT Metadata Data Lake Cost-Benefit

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

With Cloudera’s vision of hybrid data , enterprises adopting an open data lakehouse can easily get application interoperability and portability to and from on premises environments and any public cloud without worrying about data scaling. Why integrate Apache Iceberg with Cloudera Data Platform?

Data Lake

Data Lake Data Warehouse Data Architecture Metadata

Databricks’ Data+AI Summit 2022: A Show of Partner “Unity”

Alation

JULY 18, 2022

Are your data users overwhelmed by silos and frustrated by untrusted data? Tell them to grab a catalog … and go jump in a lake. That was the message — delivered a little more elegantly than that — at Databricks’ Data+AI Summit 2022. Learn more about our partner network.

ROI

ROI Metadata Data Lake Digital Transformation

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

The Solution: CDP Private Cloud brings a next-generation hybrid architecture with cloud-native benefits to HBL’s data platform. HBL started their data journey in 2019 when data lake initiative was started to consolidate complex data sources and enable the bank to use single version of truth for decision making.

Management

Management Data Lake Consulting Unstructured Data

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities. Let’s say that this company is located in Europe and the data product must comply with the GDPR.

Data Lake

Data Lake Management Metrics Data Warehouse

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Augmented data management: Data fabric versus data mesh

IBM Big Data Hub

APRIL 27, 2022

Gartner defines a data fabric as “a design concept that serves as an integrated layer of data and connecting processes. The data fabric architectural approach can simplify data access in an organization and facilitate self-service data consumption at scale. 2 “Exposing The Data Mesh Blind Side ” Forrester.

Management

Management Metadata Data Architecture Data Lake

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

This is the promise of the modern data lakehouse architecture. analyst Sumit Pal, in “Exploring Lakehouse Architecture and Use Cases,” published January 11, 2022: “Data lakehouses integrate and unify the capabilities of data warehouses and data lakes, aiming to support AI, BI, ML, and data engineering on a single platform.”

Metadata

Metadata Machine Learning Unstructured Data Data Lake

How Fujitsu implemented a global data mesh architecture and democratized data

AWS Big Data

MAY 1, 2024

To achieve data-driven management, we built OneData, a data utilization platform used in the four global AWS Regions, which started operation in April 2022. The platform consists of approximately 370 dashboards, 360 tables registered in the data catalog, and 40 linked systems.

Dashboards

Dashboards Publishing Data-driven Cost-Benefit

Data Mesh vs. Data Fabric: A Love Story

Alation

JANUARY 13, 2022

Thoughtworks says data mesh is key to moving beyond a monolithic data lake. Spoiler alert: data fabric and data mesh are independent design concepts that are, in fact, quite complementary. Thoughtworks says data mesh is key to moving beyond a monolithic data lake 2. Gartner on Data Fabric.

Data Lake

Data Lake Metadata Data-driven Data Governance

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

CSP was recently recognized as a leader in the 2022 GigaOm Radar for Streaming Data Platforms report. The DevOps/app dev team wants to know how data flows between such entities and understand the key performance metrics (KPMs) of these entities. Without context, streaming data is useless.”

Data Lake

Data Lake Manufacturing Metadata Dashboards

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

This view is used to identify patterns and trends in customer behavior, which can inform data-driven decisions to improve business outcomes. In 2022, AWS commissioned a study conducted by the American Productivity and Quality Center (APQC) to quantify the Business Value of Customer 360.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

The new feature this blog post is aiming to discuss about Iceberg V2 format (version 2), as the Iceberg table specification explains, the V1 format aimed to support large analytic data tables, while V2 aimed to add row level deletes and updates. CDW also simplifies administration by making multi-tenancy secure and manageable.

Data Warehouse

Data Warehouse Snapshot Metadata Cost-Benefit

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

July brings summer vacations, holiday gatherings, and for the first time in two years, the return of the Massachusetts Institute of Technology (MIT) Chief Data Officer symposium as an in-person event. A key area of focus for the symposium this year was the design and deployment of modern data platforms.

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

CIO Business Intelligence

SEPTEMBER 6, 2022

Very has come full circle as a business built on catalog data, but it took some introspection in order to figure out the best way to get there. Understanding what data you’ve got locked in all these different stores is a big part of the jigsaw puzzle.”.

IT

IT Forecasting Data Lake Data Warehouse

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes.

Data Lake

Data Lake Dashboards Metrics Metadata

Pillars of Knowledge, Best Practices for Data Governance

Cloudera

AUGUST 4, 2021

A top-notch system will include an easy-to-navigate data catalog that provides a single-pane view to administer and discover all data assets. The data is profiled and enhanced with rich metadata—including operational, social, and business context—creating trusted and reusable data assets and making them discoverable.

Data Governance

Data Governance Metadata Data-driven Enterprise

CIOs rise to the ESG reporting challenge

CIO Business Intelligence

JANUARY 30, 2024

As the internal technology provider for parent company Allianz SE with 15,000 employees, the entity employs more than 100 ESG experts who spend several weeks each year heads down collecting and reporting ESG data manually. Data quality is key, but if we’re doing it manually there’s the potential for mistakes.

Reporting

Reporting Data Quality Strategy Data-driven

How data stores and governance impact your AI initiatives

IBM Big Data Hub

OCTOBER 12, 2023

Among the tasks necessary for internal and external compliance is the ability to report on the metadata of an AI model. Metadata includes details specific to an AI model such as: The AI model’s creation (when it was created, who created it, etc.) Learn more about IBM watsonx 1.

Cost-Benefit

Cost-Benefit Metadata Data Governance Modeling

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

Alation

AUGUST 11, 2022

Today, the brightest minds in our industry are targeting the massive proliferation of data volumes and the accompanying but hard-to-find value locked within all that data. I recently had the opportunity to connect with Mohan at Snowflake Summit 2022 in Las Vegas. Data fabric is a technology architecture.

Metadata

Metadata Data Warehouse Data Quality Data Lake

Alation Earns 8 Top Rankings in BARC’s The Data Management Survey 23

Alation

OCTOBER 19, 2022

Today, they have issued The Data Management Survey 23 , a report based on a survey of more than 1,200 data management end-users of 23 products (or groups of products). The survey was conducted from January to April 2022 and examined user feedback on product experience across 18 criteria.

Management

Management KPI Data Governance Reporting

Why We Started the Data Intelligence Project

Alation

JULY 7, 2022

In 2013 I joined American Family Insurance as a metadata analyst. I had always been fascinated by how people find, organize, and access information, so a metadata management role after school was a natural choice. The use cases for metadata are boundless, offering opportunities for innovation in every sector.

Metadata

Metadata Data-driven Insurance Statistics

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

In the era of data, organizations are increasingly using data lakes to store and analyze vast amounts of structured and unstructured data. Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making.

Optimization

Optimization Data Lake Cost-Benefit Reporting

What Is Alation Connected Sheets? Q&A with the Creators

Alation

NOVEMBER 28, 2022

You founded Kloudio to address the spreadsheet problem, and Alation acquired Kloudio in February of 2022. But refreshing this analysis with the latest data was impossible… unless you were proficient in SQL or Python. Each analyst has their own version of analysis in their spreadsheets due to differences in formulas and data used.

Metadata

Metadata Enterprise Cost-Benefit Finance

Do the Benefits of Cloud Outweigh the Costs?

Jet Global

SEPTEMBER 19, 2023

In fact, a recent Gartner report on cloud expenditure found that cross-industry cloud spend has risen from 8% as a percentage of total IT spend in 2018 to 16% in 2022. What are the best practices for analyzing cloud ERP data? Data Management How do we create a data warehouse or data lake in the cloud using our cloud ERP?

Cost-Benefit

Cost-Benefit Data Warehouse Reporting Enterprise

Redefining enterprise transformation in the age of intelligent ecosystems

CIO Business Intelligence

JANUARY 16, 2025

The mega-vendor era By 2020, the basis of competition for what are now referred to as mega-vendors was interoperability, automation and intra-ecosystem participation and unlocking access to data to drive business capabilities, value and manage risk.

Enterprise

Enterprise Digital Transformation Scorecard Interactive

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

AWS Big Data

NOVEMBER 14, 2024

Amazon Data Firehose – Data Firehose is an extract, transform, and load (ETL) service that reliably captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services. AWS Glue – The AWS Glue Data Catalog is your persistent technical metadata store in the AWS Cloud.

Data Lake

Data Lake Metadata Testing Data-driven

Use Apache Iceberg in a data lake to support incremental data processing

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Webinars

Trending Sources

Choosing an open table format for your transactional data lake on AWS

Webinars

AWS Lake Formation 2022 year in review

Build a real-time GDPR-aligned Apache Iceberg data lake

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Gartner Data & Analytics Sydney 2022

AWS Lake Formation 2023 year in review

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Regeneron turns to IT to accelerate drug discovery

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

The Future of the Data Lakehouse – Open

What is a data architect? Skills, salaries, and how to become a data framework master

The Future of the Data Lakehouse – Open

Achieve your AI goals with an open data lakehouse approach

Salesforce readies Einstein Copilot to unleash generative AI across its offerings

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Databricks’ Data+AI Summit 2022: A Show of Partner “Unity”

Habib Bank manages data at scale with Cloudera Data Platform

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Augmented data management: Data fabric versus data mesh

The Modern Data Lakehouse: An Architectural Innovation

How Fujitsu implemented a global data mesh architecture and democratized data

Data Mesh vs. Data Fabric: A Love Story

Turning Streams Into Data Products

Create an end-to-end data strategy for Customer 360 on AWS

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Demystifying Modern Data Platforms

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Pillars of Knowledge, Best Practices for Data Governance

CIOs rise to the ESG reporting challenge

How data stores and governance impact your AI initiatives

Fabrics, Meshes & Stacks, oh my! Q&A with Sanjeev Mohan

Alation Earns 8 Top Rankings in BARC’s The Data Management Survey 23

Why We Started the Data Intelligence Project

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

What Is Alation Connected Sheets? Q&A with the Creators

Do the Benefits of Cloud Outweigh the Costs?

Redefining enterprise transformation in the age of intelligent ecosystems

Ingest telemetry messages in near real time with Amazon API Gateway, Amazon Data Firehose, and Amazon Location Service

Stay Connected