Performance Evaluation

Each phase that runs a model writes a per-target model-scores file: a long-format table with one row per scored prediction. These files are intended for performance evaluation — computing ROC curves, Precision-Recall curves, confusion matrices at a chosen threshold, AUC, and so on — including in external tools such as R.

Schema

Column	Type	Description
`method`	string	The model that produced the row, as a lowercase short name (for example `xgb`, `dt`, `rf`, `logit`, `lda`, `svm`, `knn`, `gnb`, `mlp`).
`k`	integer	Fold index. For validation this ranges over the cross-validation folds (`0 .. K-1`); for the build/test and classify phases there is a single pass, so `k` is `0`.
`label`	integer	Ground-truth label (`0` or `1`).
`score`	float	Model probability for the positive class, in `[0, 1]`.

The target variable (temp, psal, pres) is encoded in the file name, not in a column — each file holds a single target.

Note

These files intentionally do not contain a predicted_label column. A predicted label is simply score >= threshold, so storing it would bake in a single threshold and make the file less useful for sweeping the threshold — which is exactly what ROC and Precision-Recall analysis does. Derive labels yourself at whatever threshold you need:

Python / Polars: predicted = (df["score"] >= t).cast(pl.Int64)
R: predicted <- as.integer(df$score >= t)

See Prediction Threshold for the threshold used when the library itself generates predicted labels.

File names by phase

Phase	Single model	Suite (multiple methods)
Validation (step 2)	`model_scores_{target}.parquet`	`model_scores_{method}_{target}.parquet`
Build / test (step 4)	`test_model_scores_{target}.parquet`	`test_model_scores_{target}.parquet`
Classify (step 6)	`classify_model_scores_{target}.parquet`	`classify_model_scores_{target}.parquet`

In the suite build/classify files, the method column distinguishes the methods within a single per-target file. In the suite validation files, each method has its own file, so the method column is constant within a file.

Example: ROC curve in R

library(arrow)
library(pROC)

df <- read_parquet("test_model_scores_temp.parquet")
roc_obj <- roc(df$label, df$score)   # uses scores directly, all thresholds
plot(roc_obj)
auc(roc_obj)

# Confusion matrix at a chosen operating point:
t <- 0.7
predicted <- as.integer(df$score >= t)
table(actual = df$label, predicted = predicted)

Because the files store raw score values rather than thresholded labels, the same file supports any threshold you choose without rerunning the pipeline.