Data Lake, Document and Metadata

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

AWS Big Data

JULY 29, 2024

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources.

Metadata

Metadata Snapshot Data Lake Metrics

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

AWS Big Data

OCTOBER 30, 2024

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake ( Apache Iceberg ) using AWS Glue. format(add_column)).select("DATA_TYPE").toPandas().iterrows())[0]

Data Lake

Data Lake Data Processing Optimization Machine Learning

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

Data Lake

Data Lake Analytics Cost-Benefit Management

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Statistics Optimization

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

AWS Big Data

OCTOBER 1, 2024

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries.

Data Lake

Data Lake Statistics Broadcasting Optimization

Enrich your serverless data lake with Amazon Bedrock

AWS Big Data

SEPTEMBER 26, 2024

Organizations are collecting and storing vast amounts of structured and unstructured data like reports, whitepapers, and research documents. By consolidating this information, analysts can discover and integrate data from across the organization, creating valuable data products based on a unified dataset.

Data Lake

Data Lake Cost-Benefit Unstructured Data Modeling

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena , Amazon Redshift , Amazon EMR , and so on. Table metadata is fetched from AWS Glue. The generated Athena SQL query is run.

Metadata

Metadata Data Lake Modeling Data Warehouse

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

But even with the “need for speed” to market, new applications must be modeled and documented for compliance, transparency and stakeholder literacy. With all these diverse metadata sources, it is difficult to understand the complicated web they form much less get a simple visual flow of data lineage and impact analysis.

Data Governance

Data Governance Metadata Testing Data Lake

Integrating Data Governance and Enterprise Architecture

erwin

SEPTEMBER 3, 2020

Data governance provides time-sensitive, current-state architecture information with a high level of quality. It documents your data assets from end to end for business understanding and clear data lineage with traceability. Automating Data Governance and Enterprise Architecture.

Data Governance

Data Governance Enterprise Risk Data Lake

Key takeaways for CIOs from AWS re:Invent 2024

CIO Business Intelligence

DECEMBER 9, 2024

This unification is perhaps best exemplified by a new offering inside Amazon SageMaker, Unified Studio , which combinesSQLanalytics, data processing, AI development, data streaming, business intelligence, and search analytics. On the storage front, AWS unveiled S3 Table Buckets and the S3 Metadata features.

Metadata

Metadata Unstructured Data Data Lake Data-driven

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

Text, images, audio, and videos are common examples of unstructured data. Most companies produce and consume unstructured data such as documents, emails, web pages, engagement center phone calls, and social media. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly.

Unstructured Data

Unstructured Data Metadata Management Analytics

Data Cataloging in the Data Lake: Alation + Kylo

Alation

FEBRUARY 20, 2020

When it was no longer a hard requirement that a physical data model be created upon the ingestion of data, there was a resulting drop in richness of the description and consistency of the data stored in Hadoop. You did not have to understand or prepare the data to get it into Hadoop, so people rarely did.

Data Lake

Data Lake Metadata Structured Data Big Data

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

Data governance is a critical building block across all these approaches, and we see two emerging areas of focus. First, many LLM use cases rely on enterprise knowledge that needs to be drawn from unstructured data such as documents, transcripts, and images, in addition to structured data from data warehouses.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

AWS Big Data

DECEMBER 4, 2024

In today’s data-driven world , organizations are constantly seeking efficient ways to process and analyze vast amounts of information across data lakes and warehouses. This post will showcase how this data can also be queried by other data teams using Amazon Athena. Verify that you have Python version 3.7

Data Lake

Data Lake Metadata Insurance Data-driven

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

AWS Big Data

NOVEMBER 14, 2024

Under the hood, UniForm generates Iceberg metadata files (including metadata and manifest files) that are required for Iceberg clients to access the underlying data files in Delta Lake tables. Both Delta Lake and Iceberg metadata files reference the same data files. in Delta Lake public document.

Metadata

Metadata Data Warehouse Big Data Data Lake

The Security Challenges of Data Warehousing in the Cloud

Cloudera

NOVEMBER 5, 2020

When you register an Environment in CDP, a Data Lake is automatically deployed for that environment. Data Lake security and governance is managed by a shared set of services running within a Data Lake cluster. Apache Atlas — metadata management and governance: lineage, analytics, attributes.

Data Lake

Data Lake Data Warehouse Metadata Optimization

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Apache Ozone is one of the major innovations introduced in CDP, which provides the next generation storage architecture for Big Data applications, where data blocks are organized in storage containers for larger scale and to handle small objects. Collects and aggregates metadata from components and present cluster state.

Data Lake

Data Lake Cost-Benefit Metadata Testing

Data Governance Makes Data Security Less Scary

erwin

OCTOBER 31, 2019

What data do we have and where is it? Data is a critical asset used to operate, manage and grow a business. While sometimes at rest in databases, data lakes and data warehouses; a large percentage is federated and integrated across the enterprise, introducing governance, manageability and risk issues that must be managed.

Data Governance

Data Governance Metadata Risk Data Lake

Cloud Data Science News – Beta 6

Data Science 101

DECEMBER 16, 2019

It now also supports PDF documents. Azure Data Factory Preserves Metadata during File Copy When performing a File copy between Amazon S3, Azure Blob, and Azure Data Lake Gen 2, the metadata will be copied as well. Not a huge update but still a nice feature. Azure Database for MySQL now supports MySQL 8.0

Data Science

Data Science Machine Learning Metadata Data Lake

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

One of the bank’s key challenges related to strict cybersecurity requirements is to implement field level encryption for personally identifiable information (PII), Payment Card Industry (PCI), and data that is classified as high privacy risk (HPR). Only users with required permissions are allowed to access data in clear text.

Data Lake

Data Lake Data Warehouse Management Risk

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

Customers who have chosen Google Cloud as their cloud platform can now use CDP Public Cloud to create secure governed data lakes in their own cloud accounts and deliver security, compliance and metadata management across multiple compute clusters. Google Cloud Storage buckets – in the same subregion as your subnets .

Data Lake

Data Lake Metadata Enterprise Analytics

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

The Replication Manager support matrix is documented in our public docs. This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. CDP Data Lake cluster versions – CM 7.4.0, CDP Data Lake cluster versions – CM 7.4.0,

Data Lake

Data Lake Metadata Unstructured Data Management

Breaking State and Local Data Silos with Modern Data Architectures

Cloudera

AUGUST 30, 2022

It outlines a scenario in which “recently married people might want to change their names on their driver’s licenses or other documentation. That should be easy, but when agencies don’t share data or applications, they don’t have a unified view of people. Forrester ).

Data Architecture

Data Architecture Data Lake Data Warehouse Metadata

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance. Proprietary file formats mean no one else is invited in!

Data Warehouse

Data Warehouse Data Lake IT Analytics

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

A RAG-based generative AI application can only produce generic responses based on its training data and the relevant documents in the knowledge base. Streaming jobs constantly ingest new data to synchronize across systems and can perform enrichment, transformations, joins, and aggregations across windows of time more efficiently.

Data Lake

Data Lake Unstructured Data Management Snapshot

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

And knowing the business purpose translates into actively governing personal data against potential privacy and security violations. Do You Know Where Your Sensitive Data Is? Data is a valuable asset used to operate, manage and grow a business.

Data Governance

Data Governance Cost-Benefit Metadata Risk

What Is a Data Catalog?

Alation

FEBRUARY 13, 2020

Why do we need a data catalog? What does a data catalog do? These are all good questions and a logical place to start your data cataloging journey. Data catalogs have become the standard for metadata management in the age of big data and self-service analytics. Figure 1 – Data Catalog Metadata Subjects.

Metadata

Metadata Data Lake Recreation/Entertainment Big Data

Introducing watsonx: The future of AI for business

IBM Big Data Hub

MAY 9, 2023

is not just for data scientists and developers — business users can also access it via an easy-to-use interface that responds to natural language prompts for different tasks. Through workload optimization an organization can reduce data warehouse costs by up to 50 percent by augmenting with this solution. [1] Watsonx.ai

Data Warehouse

Data Warehouse Machine Learning Cost-Benefit Metadata

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing. For Matching conditions , choose Match all conditions.

Metadata

Metadata Sales Data Lake Big Data

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

AWS Big Data

SEPTEMBER 26, 2023

AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage access control for your data lake data in Amazon Simple Storage Service (Amazon S3 ) and its metadata in AWS Glue Data Catalog in one place with familiar database-style features.

Data Lake

Data Lake Metadata Management Modeling

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Cloudera

OCTOBER 7, 2022

Using these adapters, Cloudera customers can use dbt to collaborate, test, deploy, and document their data transformation and analytic pipelines on CDP Public Cloud, CDP One, and CDP Private Cloud. To date, dbt was only available on proprietary cloud data warehouses, with very little interoperability between different engines.

Data Warehouse

Data Warehouse Data Transformation Machine Learning Testing

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

A data hub contains data at multiple levels of granularity and is often not integrated. It differs from a data lake by offering data that is pre-validated and standardized, allowing for simpler consumption by users. Data hubs and data lakes can coexist in an organization, complementing each other.

Analytics

Analytics Data Warehouse Data Lake Metadata

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. How to scale AL and ML with built-in governance A fit-for-purpose data store built on an open lakehouse architecture allows you to scale AI and ML while providing built-in governance tools.

Risk

Risk Modeling Management Metadata

Pillars of Knowledge, Best Practices for Data Governance

Cloudera

AUGUST 4, 2021

A top-notch system will include an easy-to-navigate data catalog that provides a single-pane view to administer and discover all data assets. The data is profiled and enhanced with rich metadata—including operational, social, and business context—creating trusted and reusable data assets and making them discoverable.

Data Governance

Data Governance Metadata Data-driven Enterprise

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

Cloudera’s open data lakehouse, powered by Apache Iceberg, solves the real-world big data challenges mentioned above by providing a unified, curated, shareable, and interoperable data lake that is accessible by a wide array of Iceberg-compatible compute engines and tools. Follow the steps below to setup Cloudera: 1.

Metadata

Metadata Data Warehouse ROI Snapshot

The Data Scientist’s Guide to the Data Catalog

Alation

JULY 19, 2022

A data catalog can assist directly with every step, but model development. And even then, information from the data catalog can be transferred to a model connector , allowing data scientists to benefit from curated metadata within those platforms. How Data Catalogs Help Data Scientists Ask Better Questions.

Metadata

Metadata Data Quality Statistics Data Science

How data stores and governance impact your AI initiatives

IBM Big Data Hub

OCTOBER 12, 2023

Among the tasks necessary for internal and external compliance is the ability to report on the metadata of an AI model. Metadata includes details specific to an AI model such as: The AI model’s creation (when it was created, who created it, etc.) This allows for a high degree of transparency and auditability.

Cost-Benefit

Cost-Benefit Metadata Data Governance Modeling

Get started with the new Amazon DataZone enhancements for Amazon Redshift

AWS Big Data

JULY 29, 2024

Amazon DataZone is a powerful data management service that empowers data engineers, data scientists, product managers, analysts, and business users to seamlessly catalog, discover, analyze, and govern data across organizational boundaries, AWS accounts, data lakes, and data warehouses. Choose Next.

Data Warehouse

Data Warehouse Sales Metadata Publishing

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

AWS Big Data

OCTOBER 2, 2023

JSON data in Amazon Redshift Amazon Redshift enables storage, processing, and analytics on JSON data through the SUPER data type, PartiQL language, materialized views, and data lake queries. This approach requires an additional step of schema retrieval and decoding based on context.

Cost-Benefit

Cost-Benefit Metadata Structured Data Data-driven

Data Mesh 101: How Data Mesh Can Be Used in an Organization

Ontotext

FEBRUARY 12, 2024

This calls for additional planning, documentation, and testing. A data mesh will likely require more engineers to get started, so a critical mass is needed for successful adoption. Each team should be accountable for providing their prepared data sets to downstream systems. For instance, JPMorgan Chase & Co.

Data Quality

Data Quality Data-driven Data Lake Data Governance

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Trending Sources

Multicloud data lake analytics with Amazon Athena

Choosing an open table format for your transactional data lake on AWS

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Enrich your serverless data lake with Amazon Bedrock

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Build a real-time GDPR-aligned Apache Iceberg data lake

Doing Cloud Migration and Data Governance Right the First Time

Integrating Data Governance and Enterprise Architecture

Key takeaways for CIOs from AWS re:Invent 2024

Unstructured data management and governance using AWS AI/ML and analytics services

Data Cataloging in the Data Lake: Alation + Kylo

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Data governance in the age of generative AI

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

The Security Challenges of Data Warehousing in the Cloud

Apache Ozone and Dense Data Nodes

Data Governance Makes Data Security Less Scary

Cloud Data Science News – Beta 6

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Migrate Hive data from CDH to CDP public cloud

Breaking State and Local Data Silos with Modern Data Architectures

What is a data architect? Skills, salaries, and how to become a data framework master

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Exploring real-time streaming for generative AI Applications

How Data Governance Protects Sensitive Data

What Is a Data Catalog?

Introducing watsonx: The future of AI for business

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

How to use foundation models and trusted governance to manage AI workflow risk

Pillars of Knowledge, Best Practices for Data Governance

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

The Data Scientist’s Guide to the Data Catalog

How data stores and governance impact your AI initiatives

Get started with the new Amazon DataZone enhancements for Amazon Redshift

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

Data Mesh 101: How Data Mesh Can Be Used in an Organization

Stay Connected