Performance Evaluation
Each phase that runs a model writes a per-target model-scores file: a long-format table with one row per scored prediction. These files are intended for performance evaluation — computing ROC curves, Precision-Recall curves, confusion matrices at a chosen threshold, AUC, and so on — including in external tools such as R.
Schema
Column |
Type |
Description |
|---|---|---|
|
string |
The model that produced the row, as a lowercase short name (for
example |
|
integer |
Fold index. For validation this ranges over the cross-validation folds
( |
|
integer |
Ground-truth label ( |
|
float |
Model probability for the positive class, in |
The target variable (temp, psal, pres) is encoded in the file
name, not in a column — each file holds a single target.
Note
These files intentionally do not contain a predicted_label column.
A predicted label is simply score >= threshold, so storing it would bake
in a single threshold and make the file less useful for sweeping the
threshold — which is exactly what ROC and Precision-Recall analysis does.
Derive labels yourself at whatever threshold you need:
Python / Polars:
predicted = (df["score"] >= t).cast(pl.Int64)R:
predicted <- as.integer(df$score >= t)
See Prediction Threshold for the threshold used when the library itself generates predicted labels.
File names by phase
Phase |
Single model |
Suite (multiple methods) |
|---|---|---|
Validation (step 2) |
|
|
Build / test (step 4) |
|
|
Classify (step 6) |
|
|
In the suite build/classify files, the method column distinguishes the
methods within a single per-target file. In the suite validation files, each
method has its own file, so the method column is constant within a file.
Example: ROC curve in R
library(arrow)
library(pROC)
df <- read_parquet("test_model_scores_temp.parquet")
roc_obj <- roc(df$label, df$score) # uses scores directly, all thresholds
plot(roc_obj)
auc(roc_obj)
# Confusion matrix at a chosen operating point:
t <- 0.7
predicted <- as.integer(df$score >= t)
table(actual = df$label, predicted = predicted)
Because the files store raw score values rather than thresholded labels,
the same file supports any threshold you choose without rerunning the
pipeline.