Step 3: Training & Evaluation

With a properly prepared dataset (from Step 2: Dataset Preparation), you are now ready to train and evaluate a machine learning model. This workflow leverages the training, validation, and test sets created in the previous step to build a model, rigorously assess its performance using cross-validation, and generate final evaluation metrics on a held-out test set.

Like all workflows in aiqclib, this process is controlled by a dedicated YAML configuration file, which, like the preparation config, utilizes the “building blocks” concept for modularity and reusability.

Prerequisites

This tutorial assumes you have successfully completed Step 2: Dataset Preparation. The training process directly uses the output files (the split datasets) generated in that step. Ensure your ~/aiqc_project/data/dataset_0001/training/ directory exists and contains the prepared data.

The Training Workflow

The training workflow follows a similar pattern to the preparation step: you will generate a new configuration template, customize it to define your model and validation strategy, point to your input data, and specify where the trained models should be saved.

Step 3.1: Generate the Configuration Template

First, use aiqclib to generate a boilerplate configuration template specifically for the training workflow.

import aiqclib as aq
import os

config_path = os.path.expanduser("~/aiqc_project/config/training_config.yaml")
aq.write_config_template(
    file_name=config_path,
    stage="train"
)

Step 3.2: Customize the Configuration File

Now, open the newly created ~/aiqc_project/config/training_config.yaml file in your text editor. Your primary goals are to define:

  1. Input & Output Paths: Where to find the prepared dataset and where to save the trained model.

  2. Model & Validation Strategy: Which machine learning model to train and what cross-validation method to use.

You will need to edit the path_info_sets, step_class_sets, step_param_sets, and training_sets sections.

Before you modify the config, let’s create a directory where your trained models will be saved:

mkdir -p ~/aiqc_project/models

Update your training_config.yaml file: Modify the file to align with the following structure. Remember to replace placeholder paths with your actual project setup.

Note

aiqclib integrates multiple ML algorithms, and it is easy to switch between them. For more details, see the dedicated Algorithm Selection page.

path_info_sets:
  - name: data_set_1
    common:
      base_path: ~/aiqc_project/data # Root directory of the prepared dataset (from preparation step)
    input:
      step_folder_name: training # Subdirectory containing the split training/validation/test data
    model:
      base_path: ~/aiqc_project/models # Directory where the final trained models will be saved
# Define your model and validation strategy here.
# For this tutorial, we'll use a KFoldValidation and XGBoost model.
step_class_sets:
  - name: training_step_set_1
    steps:
      input: InputTrainingSetA
      validate: KFoldValidation # Specify your cross-validation class
      model: XGBoost # Specify your ML model class (e.g., XGBoost, RandomForest)
      build: BuildModel

# Define parameters for your chosen model and validation.
# For example, number of folds for CV, or model hyperparameters.
step_param_sets:
  - name: training_param_set_1
    steps:
      input: { }
      validate: { k_fold: 5 } # 5-fold cross-validation
      model: { calculate_shap: False,                   # Control SHAP value calculation
               model_params: { scale_pos_weight: 200,   # Specify pos:neg ratio
                               n_jobs: -1 } }           # Number of threads used by XGBoost
      build: { }
training_sets:
  - name: training_0001  # A unique name for this training job
    dataset_folder_name: dataset_0001  # This MUST match the dataset_folder_name from your preparation config
    path_info: data_set_1
    target_set: target_set_1 # This needs to match a 'target_set' defined in your prepare_config.yaml
    step_class_set: training_step_set_1
    step_param_set: training_param_set_1

Note

The training configuration file includes many other options for advanced model selection, hyperparameter tuning, and cross-validation strategies. For a complete reference of all available parameters, please consult the dedicated Configuration of Training & Evaluation page.

Step 3.3: Run the Training Process

Once you have customized your training_config.yaml with the correct paths and model/validation configurations, you can execute the training and evaluation workflow.

Load the configuration file and then call the train_and_evaluate function:

import aiqclib as aq
import os

config_path = os.path.expanduser("~/aiqc_project/config/training_config.yaml")
config = aq.read_config(config_path)
aq.train_and_evaluate(config)

Understanding the Output

After the command finishes, aiqclib will have created new folders within your dataset’s output directory (e.g., ~/aiqc_project/data/dataset_0001/) and within your model’s base path (~/aiqc_project/models/). The primary outputs include:

  • validate: Contains detailed results from the cross-validation process, allowing you to inspect model performance across different data folds. This includes metrics, predictions, and potentially visualizations.

  • build: Holds a comprehensive report of the final model’s evaluation performance on the held-out test dataset, along with aggregated metrics.

  • models: Holds the final, trained model object(s) ready for classification. These are the artifacts you will use in the next step.

Next Steps

You have now successfully trained and evaluated a machine learning model using aiqclib! The final step in the workflow is to use this trained model to classify new, unseen data.

Proceed to the next tutorial: Step 4: Classification.