Configuration of Training & Evaluation
The train workflow (stage="train") is responsible for orchestrating the machine learning model building process. It takes the prepared dataset (the output from the prepare workflow) and handles critical steps such as cross-validation, actual model training, and final evaluation on a held-out test set.
While the prepare workflow focuses on complex data transformation and feature engineering, the train configuration is generally simpler. Its primary role is to leverage the “building blocks” concept to specify:
The machine learning model to be used.
The chosen validation strategy (e.g., k-fold cross-validation).
The locations of the prepared input data and where to save the final trained models.
Detailed Configuration Sections
path_info_sets
This section is crucial for linking the training workflow to the prepared datasets from the previous prepare stage and defining where to save the resulting trained models.
common.base_path: The root directory where the prepared dataset (output from the
prepareworkflow) is located. This typically corresponds to thecommon.base_pathdefined in yourprepare_config.yaml.input.step_folder_name: The name of the subdirectory within the prepared dataset’s folder where the final training/validation/test splits are located (e.g.,
training).
path_info_sets:
- name: data_set_1
common:
base_path: /path/to/data
input:
step_folder_name: training
target_sets
Similar to the prepare workflow, this section specifies the target variables for your machine learning model. It ensures that the training process correctly identifies which column represents the prediction target and understands its associated quality control (QC) flags, which are often used to filter or weight data during training.
target_sets:
- name: target_set_1
variables:
- name: temp
flag: temp_qc
pos_flag_values: [ 4, 6, 7 ]
neg_flag_values: [ 1 ]
step_class_sets
This powerful section allows you to define the core components of your training pipeline by specifying the Python classes to use for each major step. This is where you choose your machine learning model, the cross-validation method, and other pipeline components.
steps.input: The class responsible for ingesting the prepared training, validation, and test datasets.
steps.validate: The class defining the cross-validation strategy (e.g.,
KFoldValidation,TimeSeriesValidation).steps.model: The class for the machine learning algorithm to be trained (e.g.,
XGBoost,RandomForest).steps.build: The class that handles the final model training on the full training set and saving the model artifacts.
step_class_sets:
- name: training_step_set_1
steps:
input: InputTrainingSetA
validate: KFoldValidation
model: XGBoost
build: BuildModel
Note
aiqclib integrates multiple ML algorithms, and it is easy to switch between them by setting the model key. For more details, see the dedicated Algorithm Selection page.
step_param_sets
This section provides detailed parameters for the classes defined in your chosen step_class_sets. This allows you to fine-tune the behavior of each step, such as specifying the number of folds for cross-validation or providing hyperparameters for your machine learning model.
steps.input: Parameters for the input data loading step (often empty or simple flags).
steps.validate.k_fold: For
KFoldValidation, specifies the number of folds for cross-validation.steps.model.calculate_shap: This is used to control SHAP value calculation.
steps.model.model_params.scale_pos_weight: This is used to address imbalanced datasets by weighting the positive class. For example,
200indicates a ratio of negative to positive records of 200:1.steps.model.model_params.n_jobs: The number of threads used by XGBoost. It tries to use all available CPU cores if it is set to -1.
steps.build: Parameters for the final model building step (often empty or simple flags for saving).
step_param_sets:
- name: training_param_set_1
steps:
input: { }
validate: { k_fold: 5 }
model: { calculate_shap: False,
model_params: { scale_pos_weight: 200,
n_jobs: -1 } }
build: { }
Note
SHAP values can be automatically calculated during the test phase. For more details, see the dedicated SHAP Values page.
Note
Model parameters specified by model_params differ across ML algorithms. Please consult the dedicated Algorithm Selection page.
training_sets
This is the main “assembly” section that defines a complete training and evaluation job. Each entry in this list orchestrates a unique training run by linking together the prepared dataset with the specific path, target variable, and step configurations (classes and parameters).
name: A unique identifier for this particular training job.
dataset_folder_name: The name of the specific folder (created by the
prepareworkflow) containing the prepared data for this job (e.g.,dataset_0001).path_info: The
nameof the path configuration to use frompath_info_sets.target_set: The
nameof the target variable configuration to use fromtarget_sets.step_class_set & step_param_set: The
nameof the step class and parameter configurations to use, respectively.
training_sets:
- name: training_0001
dataset_folder_name: dataset_0001
path_info: data_set_1
target_set: target_set_1
step_class_set: training_step_set_1
step_param_set: training_param_set_1
Note
While you can define multiple training sets in the training_sets section, a specific one must be selected for subsequent processes. Please consult the dedicated Selecting Specific Configurations page for instructions on how to do this.
Full Example with XGBoost
Below is a complete example of a training_config.yaml file. The lines you will most commonly need to edit or customize are highlighted for quick reference.
---
path_info_sets:
- name: data_set_1
common:
base_path: /path/to/data # Root directory containing prepared data
input:
step_folder_name: training
target_sets:
- name: target_set_1
variables:
- name: temp
flag: temp_qc
pos_flag_values: [ 4, 6, 7 ]
neg_flag_values: [ 1 ]
- name: psal
flag: psal_qc
pos_flag_values: [ 4, 6, 7 ]
neg_flag_values: [ 1 ]
- name: pres
flag: pres_qc
pos_flag_values: [ 4, 6, 7 ]
neg_flag_values: [ 1 ]
step_class_sets:
- name: training_step_set_1
steps:
input: InputTrainingSetA
validate: KFoldValidation
model: XGBoost
build: BuildModel
step_param_sets:
- name: training_param_set_1
steps:
input: { }
validate: { k_fold: 5 }
model: { calculate_shap: False,
model_params: { scale_pos_weight: 200,
n_jobs: -1 } }
build: { }
training_sets:
- name: training_0001 # A unique name for this training job
dataset_folder_name: dataset_0001 # The folder name containing the prepared data for this job
path_info: data_set_1
target_set: target_set_1
step_class_set: training_step_set_1
step_param_set: training_param_set_1