Big Data, Metadata and Workshop - Data Leaders Brief

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

AWS Big Data

OCTOBER 14, 2024

Amazon Athena provides interactive analytics service for analyzing the data in Amazon Simple Storage Service (Amazon S3). Amazon Redshift is used to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Table metadata is fetched from AWS Glue.

Metadata

Metadata Data Lake Modeling Data Warehouse

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

AWS Big Data

APRIL 12, 2023

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows organizations to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). To learn more and get started with EMR on EKS, try out the EMR on EKS Workshop and visit the EMR on EKS Best Practices Guide page. Amazon EMR 6.10

Testing

Testing Big Data Metadata Optimization

Proposals for model vulnerability and security

O'Reilly on Data

MARCH 20, 2019

Distributed systems and models : For better or worse, we live in the age of big data. Many organizations are now using distributed data processing and machine learning systems. Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security. ACM (2018). URL: [link]. Conclusion.

Modeling

Modeling Machine Learning Predictive Modeling Consulting

Webinars

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Configure SAML federation for Amazon OpenSearch Serverless with Okta

AWS Big Data

AUGUST 8, 2023

aoss.amazonaws.com/_saml/acs (replace with the corresponding Region) to generate the IdP metadata. After an app is created, choose the sign-on tab, scroll down to the metadata details, and copy the value for Metadata URL. Open a new tab and enter the copied metadata URL into your browser. Select I’m a software vendor.

Metadata

Metadata Dashboards Visualization Management

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

The Orca Platform is powered by a state-of-the-art anomaly detection system that uses cutting-edge ML algorithms and big data capabilities to detect potential security threats and alert customers in real time, ensuring maximum security for their cloud environment. Why did Orca choose Apache Iceberg?

Data Lake

Data Lake Analytics Snapshot Data Quality

Configure SAML federation for Amazon OpenSearch Serverless with AWS IAM Identity Center

AWS Big Data

APRIL 18, 2023

Under IAM Identity Center metadata , choose Download under IAM Identity Center SAML metadata file. We use this metadata file to create a SAML provider under OpenSearch Serverless. Under Application metadata , select Manually type your metadata values. Enter the metadata from your IdP that you downloaded earlier.

Dashboards

Dashboards Metadata Management Visualization

IBM named a leader in the 2022 Gartner® Magic Quadrant™ for Data Integration Tools

IBM Big Data Hub

AUGUST 24, 2022

Remote runtime data integration as-a-service execution capabilities for on-premises and multi-cloud execution. Multi-directional data movement topology with high volume and low-latency integration. Support for data governance. Metadata exchange with third party metadata management and governance tools.

Data Integration

Data Integration Metadata Data-driven Data Architecture

Build a real-time analytics solution with Apache Pinot on AWS

AWS Big Data

AUGUST 6, 2024

It ingests data from both streaming and batch sources and organizes it into logical tables distributed across multiple nodes in a Pinot cluster, ensuring scalability. Pinot provides functionality similar to other modern big data frameworks, supporting SQL queries, upserts, complex joins, and various indexing options.

OLAP

OLAP Analytics Visualization Dashboards

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

For additional details on this feature, refer to AWS Lake Formation-managed Redshift datashares (preview) and How Redshift data share can be managed by Lake Formation. Amazon EMR is a managed cluster platform to run big data applications using Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto at scale.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

Improve observability across Amazon MWAA tasks

AWS Big Data

FEBRUARY 6, 2023

To run the scripts, refer to the Amazon MWAA analytics workshop. To learn more and get hands-on experience, start with the Amazon MWAA analytics workshop and then use the scripts in the GitHub repo to gain more observability of your DAG run. DAG definitions In this section, we look at snippets of the additions needed to the DAG file.

Management

Management Interactive Publishing Metadata

Stream multi-tenant data with Amazon MSK

AWS Big Data

JUNE 20, 2024

The options available with Kafka are passing the tenant ID either as event metadata (header) or part of the payload itself as an explicit field. Carefully weighing your streaming outcomes and customer needs will help determine the correct trade-offs you can make while making sure your customer data is secure and auditable.

Modeling

Modeling Internet of Things Risk Management

How Zurich Insurance Group built a log management solution on AWS

AWS Big Data

JULY 16, 2024

Priority 2 logs, such as operating system security logs, firewall, identity provider (IdP), email metadata, and AWS CloudTrail , are ingested into Amazon OpenSearch Service to enable the following capabilities. Previously, P2 logs were ingested into the SIEM.

Insurance

Insurance Management Cost-Benefit Optimization

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

AWS Big Data

SEPTEMBER 6, 2023

. // It serves as a simple API Gateway to Kafka Proxy, accepting requests and forwarding them to a Kafka topic. withBody("Message successfully pushed to kafka"); } catch (Exception e) { // In case of exception, log the error message and return a 500 status code log.error(e.getMessage(), e); return response.withBody(e.getMessage()).withStatusCode(500);

Testing

Testing Metadata Cost-Benefit Internet of Things

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

AWS Big Data

AUGUST 6, 2024

At a high level, the core of Langley’s architecture is based on a set of Amazon Simple Queue Service (Amazon SQS) queues and AWS Lambda functions, and a dedicated RDS database to store ETL job data and metadata. Amazon MWAA offers one-click updates of the infrastructure for minor versions, like moving from Airflow version x.4.z

Cost-Benefit

Cost-Benefit Snapshot Metadata Metrics

Improve reliability and reduce costs of your Apache Spark workloads with vertical autoscaling on Amazon EMR on EKS

AWS Big Data

MAY 4, 2023

If not, refer to the Setting up Prometheus and Grafana for monitoring the cluster section of the Running batch workloads on Amazon EKS workshop to get them up and running on your cluster. To cleanup your EMR on EKS cluster after trying out the vertical autoscaling feature, refer to the clean-up section of the EMR on EKS workshop.

Metrics

Metrics Dashboards Optimization Statistics

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Refer to Catalogs for more information.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Amazon OpenSearch Serverless is now generally available!

AWS Big Data

JANUARY 25, 2023

OpenSearch Serverless caches the most recent log data, typically the first 24 hours, on ephemeral disk. For data older than 24 hours, OpenSearch Serverless only caches metadata and fetches the necessary data blocks from Amazon S3 based on query access. This model also helps pack more data while controlling the costs.

Management

Management Dashboards Metadata Analytics

AI governance is rapidly evolving — Here’s how government agencies must prepare

IBM Big Data Hub

APRIL 11, 2024

Step 3: For six to eight weeks leading up to the presentation date, offer applied training to the teams on developing these artifacts through workshops on their specific use cases. Bolster development teams by inviting diverse, multidisciplinary teams to join them in these workshops as they assess ethics and model risk.

Risk

Risk Consulting Data Processing Publishing

Themes and Conferences per Pacoid, Episode 10

Domino Data Lab

JUNE 2, 2019

ans from Nick Elprin, CEO and co-founder of Domino Data Lab, about the importance of model-driven business: “Being data-driven is like navigating by watching the rearview mirror. If your business is using big data and putting dashboards in front of analysts, you’re missing the point.”. I consider that a healthy trend.

Data Science

Data Science Data-driven Machine Learning Modeling

Amazon OpenSearch Service H1 2023 in review

AWS Big Data

AUGUST 23, 2023

SS4O is inspired by both OpenTelemetry and the Elastic Common Schema (ECS) and uses Amazon Elastic Container Service ( Amazon ECS ) event logs and OpenTelemetry (OTel) metadata. You can get started by having hands-on experience with the publicly available workshops for semantic search , microservice observability , and OpenSearch Serverless.

Snapshot

Snapshot Dashboards Visualization Metrics

The Modern Data Stack Explained: What The Future Holds

Alation

JANUARY 17, 2023

Data ingestion/integration services. Data orchestration tools. These tools are used to manage big data, which is defined as data that is too large or complex to be processed by traditional means. How Did the Modern Data Stack Get Started? What Are the Benefits of a Modern Data Stack?

Data Warehouse

Data Warehouse Cost-Benefit Data Science Data Transformation

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

Since much of the work is siloed, there are entire markets focused on, for example, data privacy tools, data security tools, data quality tools and more. We cannot of course forget metadata management tools, of which there are many different. But for them, big data evolved into all data and all formats.

Data Analytics

Data Analytics Analytics Data-driven Finance

Themes and Conferences per Pacoid, Episode 13

Domino Data Lab

OCTOBER 9, 2019

Paco Nathan’s latest article covers data practices from the National Oceanic and Atmospheric Administration (NOAA) Environment Data Management (EDM) workshop as well as updates from the AI Conference. Data Science meets Climate Science. At the EDM workshop, I gave a keynote about AI adoption in industry.

Deep Learning

Deep Learning Metadata Machine Learning Data Science

Data Science, Past & Future

Domino Data Lab

JULY 22, 2019

By virtue of that, if you take those log files of customers interactions, you aggregate them, then you take that aggregated data, run machine learning models on them, you can produce data products that you feed back into your web apps, and then you get this kind of effect in business. That was the origin of big data.

Data Science

Data Science Machine Learning Data Governance Modeling

How Volkswagen Autoeuropa built a data solution with a robust governance framework, simplifying access to quality data using Amazon DataZone

AWS Big Data

NOVEMBER 13, 2024

Aligning the solution with the data strategy At an early stage of the project, the Volkswagen Autoeuropa and AWS team identified that a data mesh architecture for the data solution aligns with the Volkswagen Autoeuropa’s vision of becoming a data-driven factory.

Metadata

Metadata Data Quality Digital Transformation Data-driven

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

AWS Big Data

NOVEMBER 6, 2024

To learn more about how to process Firehose records using Lambda, see Transform source data in Amazon Data Firehose. After executing your Lambda function, Firehose looks for routing information and operations in the metadata fields (in the following format) provided by your Lambda function. b64decode(record['data']).decode('utf-8')

Metadata

Metadata Data Lake Management Internet of Things

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

AWS Big Data

FEBRUARY 18, 2025

When Firehose delivers data to the S3 table, it uses the AWS Glue Data Catalog to store and manage table metadata. This metadata includes schema information, partition details, and file locations, enabling seamless data discovery and querying across AWS analytics services.

Snapshot

Snapshot Optimization Data Lake Metadata

Data Leaders Brief

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

Webinars

Trending Sources

Proposals for model vulnerability and security

Webinars

Configure SAML federation for Amazon OpenSearch Serverless with Okta

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Configure SAML federation for Amazon OpenSearch Serverless with AWS IAM Identity Center

IBM named a leader in the 2022 Gartner® Magic Quadrant™ for Data Integration Tools

Build a real-time analytics solution with Apache Pinot on AWS

AWS Lake Formation 2022 year in review

Improve observability across Amazon MWAA tasks

Stream multi-tenant data with Amazon MSK

How Zurich Insurance Group built a log management solution on AWS

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

How Amazon GTTS runs large-scale ETL jobs on AWS using Amazon MWAA

Improve reliability and reduce costs of your Apache Spark workloads with vertical autoscaling on Amazon EMR on EKS

Build a data lake with Apache Flink on Amazon EMR

Amazon OpenSearch Serverless is now generally available!

AI governance is rapidly evolving — Here’s how government agencies must prepare

Themes and Conferences per Pacoid, Episode 10

Amazon OpenSearch Service H1 2023 in review

The Modern Data Stack Explained: What The Future Holds

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Themes and Conferences per Pacoid, Episode 13

Data Science, Past & Future

How Volkswagen Autoeuropa built a data solution with a robust governance framework, simplifying access to quality data using Amazon DataZone

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

Streamline AWS WAF log analysis with Apache Iceberg and Amazon Data Firehose

Stay Connected