Performance Evaluation ====================== Each phase that runs a model writes a per-target **model-scores** file: a long-format table with one row per scored prediction. These files are intended for performance evaluation — computing ROC curves, Precision-Recall curves, confusion matrices at a chosen threshold, AUC, and so on — including in external tools such as R. Schema ------ .. list-table:: :header-rows: 1 :widths: 15 12 73 * - Column - Type - Description * - ``method`` - string - The model that produced the row, as a lowercase short name (for example ``xgb``, ``dt``, ``rf``, ``logit``, ``lda``, ``svm``, ``knn``, ``gnb``, ``mlp``). * - ``k`` - integer - Fold index. For validation this ranges over the cross-validation folds (``0 .. K-1``); for the build/test and classify phases there is a single pass, so ``k`` is ``0``. * - ``label`` - integer - Ground-truth label (``0`` or ``1``). * - ``score`` - float - Model probability for the positive class, in ``[0, 1]``. The target variable (``temp``, ``psal``, ``pres``) is encoded in the *file name*, not in a column — each file holds a single target. .. note:: These files intentionally do **not** contain a ``predicted_label`` column. A predicted label is simply ``score >= threshold``, so storing it would bake in a single threshold and make the file less useful for sweeping the threshold — which is exactly what ROC and Precision-Recall analysis does. Derive labels yourself at whatever threshold you need: * Python / Polars: ``predicted = (df["score"] >= t).cast(pl.Int64)`` * R: ``predicted <- as.integer(df$score >= t)`` See :doc:`prediction_threshold` for the threshold used when the library itself generates predicted labels. File names by phase ------------------- .. list-table:: :header-rows: 1 :widths: 22 39 39 * - Phase - Single model - Suite (multiple methods) * - Validation (step 2) - ``model_scores_{target}.parquet`` - ``model_scores_{method}_{target}.parquet`` * - Build / test (step 4) - ``test_model_scores_{target}.parquet`` - ``test_model_scores_{target}.parquet`` * - Classify (step 6) - ``classify_model_scores_{target}.parquet`` - ``classify_model_scores_{target}.parquet`` In the suite build/classify files, the ``method`` column distinguishes the methods within a single per-target file. In the suite validation files, each method has its own file, so the ``method`` column is constant within a file. Example: ROC curve in R ----------------------- .. code-block:: r library(arrow) library(pROC) df <- read_parquet("test_model_scores_temp.parquet") roc_obj <- roc(df$label, df$score) # uses scores directly, all thresholds plot(roc_obj) auc(roc_obj) # Confusion matrix at a chosen operating point: t <- 0.7 predicted <- as.integer(df$score >= t) table(actual = df$label, predicted = predicted) Because the files store raw ``score`` values rather than thresholded labels, the same file supports any threshold you choose without rerunning the pipeline.