Remove Reference Remove Snapshot Remove Statistics
article thumbnail

Build a high-performance quant research platform with Apache Iceberg

AWS Big Data

In this post, we use the term vanilla Parquet to refer to Parquet files stored directly in Amazon S3 and accessed through standard query engines like Apache Spark, without the additional features provided by table formats such as Iceberg. When a user requests a time travel query, the typical workflow involves querying a specific snapshot.

Metadata 107
article thumbnail

Chart Snapshot: Bagplots

The Data Visualisation Catalogue

A Bagplot is a visualisation method used in robust statistics primarily designed for analysing two- or three-dimensional datasets. The key purpose of a Bagplot is to provide a comprehensive understanding of various statistical properties of the dataset, including its location, spread, skewness, and identification of outliers.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

AWS Big Data

The company is looking for an efficient, scalable, and cost-effective solution to collecting and ingesting data from ServiceNow, ensuring continuous near real-time replication, automated availability of new data attributes, robust monitoring capabilities to track data load statistics, and reliable data lake foundation supporting data versioning.

article thumbnail

Proposals for model vulnerability and security

O'Reilly on Data

Data poisoning refers to someone systematically changing your training data to manipulate your model’s predictions. Watermarking is a term borrowed from the deep learning security literature that often refers to putting special pixels into an image to trigger a desired outcome from your model. Data poisoning attacks. Watermark attacks.

Modeling 278
article thumbnail

Data Observability and Monitoring with DataOps

DataKitchen

We liken this methodology to the statistical process controls advocated by management guru Dr. Edward Deming. In addition to statistical process controls, we recommend location and historical balance tests. Statistical Process Control. These are called Time Balance tests or, more commonly, statistical process control (SPC).

Testing 214
article thumbnail

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

Major market indexes, such as S&P 500, are subject to periodic inclusions and exclusions for reasons beyond the scope of this post (for an example, refer to CoStar Group, Invitation Homes Set to Join S&P 500; Others to Join S&P 100, S&P MidCap 400, and S&P SmallCap 600 ). Load the dataset into Amazon S3.

article thumbnail

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

Snowflake integrates with AWS Glue Data Catalog to retrieve the snapshot location. In the event of a query, Snowflake uses the snapshot location from AWS Glue Data Catalog to read Iceberg table data in Amazon S3. Snowflake can query across Iceberg and Snowflake table formats.

Data Lake 106