Configuration of Training & Evaluation

The train workflow (stage="train") is responsible for orchestrating the machine learning model building process. It takes the prepared dataset (the output from the prepare workflow) and handles critical steps such as cross-validation, actual model training, and final evaluation on a held-out test set.

While the prepare workflow focuses on complex data transformation and feature engineering, the train configuration is generally simpler. Its primary role is to leverage the “building blocks” concept to specify:

  • The machine learning model to be used.

  • The chosen validation strategy (e.g., k-fold cross-validation).

  • The locations of the prepared input data and where to save the final trained models.

Detailed Configuration Sections

path_info_sets

This section is crucial for linking the training workflow to the prepared datasets from the previous prepare stage and defining where to save the resulting trained models.

  • common.base_path: The root directory where the prepared dataset (output from the prepare workflow) is located. This typically corresponds to the common.base_path defined in your prepare_config.yaml.

  • input.step_folder_name: The name of the subdirectory within the prepared dataset’s folder where the final training/validation/test splits are located (e.g., training).

path_info_sets:
  - name: data_set_1
    common:
      base_path: /path/to/data
    input:
      step_folder_name: training

target_sets

Similar to the prepare workflow, this section specifies the target variables for your machine learning model. It ensures that the training process correctly identifies which column represents the prediction target and understands its associated quality control (QC) flags, which are often used to filter or weight data during training.

target_sets:
  - name: target_set_1
    variables:
      - name: temp
        flag: temp_qc
        pos_flag_values: [ 4, 6, 7  ]
        neg_flag_values: [ 1 ]

step_class_sets

This powerful section allows you to define the core components of your training pipeline by specifying the Python classes to use for each major step. This is where you choose your machine learning model, the cross-validation method, and other pipeline components.

  • steps.input: The class responsible for ingesting the prepared training, validation, and test datasets.

  • steps.validate: The class defining the cross-validation strategy (e.g., KFoldValidation, TimeSeriesValidation).

  • steps.model: The class for the machine learning algorithm to be trained (e.g., XGBoost, RandomForest).

  • steps.build: The class that handles the final model training on the full training set and saving the model artifacts.

step_class_sets:
  - name: training_step_set_1
    steps:
      input: InputTrainingSetA
      validate: KFoldValidation
      model: XGBoost
      build: BuildModel

Note

aiqclib integrates multiple ML algorithms, and it is easy to switch between them by setting the model key. For more details, see the dedicated Algorithm Selection page.

step_param_sets

This section provides detailed parameters for the classes defined in your chosen step_class_sets. This allows you to fine-tune the behavior of each step, such as specifying the number of folds for cross-validation or providing hyperparameters for your machine learning model.

  • steps.input: Parameters for the input data loading step (often empty or simple flags).

  • steps.validate.k_fold: For KFoldValidation, specifies the number of folds for cross-validation.

  • steps.model.calculate_shap: This is used to control SHAP value calculation.

  • steps.model.model_params.scale_pos_weight: This is used to address imbalanced datasets by weighting the positive class. For example, 200 indicates a ratio of negative to positive records of 200:1.

  • steps.model.model_params.n_jobs: The number of threads used by XGBoost. It tries to use all available CPU cores if it is set to -1.

  • steps.build: Parameters for the final model building step (often empty or simple flags for saving).

step_param_sets:
  - name: training_param_set_1
    steps:
      input: { }
      validate: { k_fold: 5 }
      model: { calculate_shap: False,
               model_params: { scale_pos_weight: 200,
                               n_jobs: -1 } }
      build: { }

Note

SHAP values can be automatically calculated during the test phase. For more details, see the dedicated SHAP Values page.

Note

Model parameters specified by model_params differ across ML algorithms. Please consult the dedicated Algorithm Selection page.

training_sets

This is the main “assembly” section that defines a complete training and evaluation job. Each entry in this list orchestrates a unique training run by linking together the prepared dataset with the specific path, target variable, and step configurations (classes and parameters).

  • name: A unique identifier for this particular training job.

  • dataset_folder_name: The name of the specific folder (created by the prepare workflow) containing the prepared data for this job (e.g., dataset_0001).

  • path_info: The name of the path configuration to use from path_info_sets.

  • target_set: The name of the target variable configuration to use from target_sets.

  • step_class_set & step_param_set: The name of the step class and parameter configurations to use, respectively.

training_sets:
  - name: training_0001
    dataset_folder_name: dataset_0001
    path_info: data_set_1
    target_set: target_set_1
    step_class_set: training_step_set_1
    step_param_set: training_param_set_1

Note

While you can define multiple training sets in the training_sets section, a specific one must be selected for subsequent processes. Please consult the dedicated Selecting Specific Configurations page for instructions on how to do this.

Full Example with XGBoost

Below is a complete example of a training_config.yaml file. The lines you will most commonly need to edit or customize are highlighted for quick reference.

Full training_config.yaml example
---
path_info_sets:
  - name: data_set_1
    common:
      base_path: /path/to/data # Root directory containing prepared data
    input:
      step_folder_name: training

target_sets:
  - name: target_set_1
    variables:
      - name: temp
        flag: temp_qc
        pos_flag_values: [ 4, 6, 7 ]
        neg_flag_values: [ 1 ]
      - name: psal
        flag: psal_qc
        pos_flag_values: [ 4, 6, 7 ]
        neg_flag_values: [ 1 ]
      - name: pres
        flag: pres_qc
        pos_flag_values: [ 4, 6, 7 ]
        neg_flag_values: [ 1 ]

step_class_sets:
  - name: training_step_set_1
    steps:
      input: InputTrainingSetA
      validate: KFoldValidation
      model: XGBoost
      build: BuildModel

step_param_sets:
  - name: training_param_set_1
    steps:
      input: { }
      validate: { k_fold: 5 }
      model: { calculate_shap: False,
               model_params: { scale_pos_weight: 200,
                               n_jobs: -1 } }
      build: { }

training_sets:
  - name: training_0001  # A unique name for this training job
    dataset_folder_name: dataset_0001  # The folder name containing the prepared data for this job
    path_info: data_set_1
    target_set: target_set_1
    step_class_set: training_step_set_1
    step_param_set: training_param_set_1