Step 3: Training & Evaluation ============================= With a properly prepared dataset (from :doc:`./preparation`), you are now ready to train and evaluate a machine learning model. This workflow leverages the training, validation, and test sets created in the previous step to build a model, rigorously assess its performance using cross-validation, and generate final evaluation metrics on a held-out test set. Like all workflows in ``aiqclib``, this process is controlled by a dedicated YAML configuration file, which, like the preparation config, utilizes the "building blocks" concept for modularity and reusability. .. admonition:: Prerequisites This tutorial assumes you have successfully completed :doc:`./preparation`. The training process directly uses the output files (the split datasets) generated in that step. Ensure your ``~/aiqc_project/data/dataset_0001/training/`` directory exists and contains the prepared data. The Training Workflow --------------------- The training workflow follows a similar pattern to the preparation step: you will generate a new configuration template, customize it to define your model and validation strategy, point to your input data, and specify where the trained models should be saved. Step 3.1: Generate the Configuration Template ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, use ``aiqclib`` to generate a boilerplate configuration template specifically for the training workflow. .. code-block:: python import aiqclib as aq import os config_path = os.path.expanduser("~/aiqc_project/config/training_config.yaml") aq.write_config_template( file_name=config_path, stage="train" ) Step 3.2: Customize the Configuration File ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, open the newly created ``~/aiqc_project/config/training_config.yaml`` file in your text editor. Your primary goals are to define: 1. **Input & Output Paths:** Where to find the prepared dataset and where to save the trained model. 2. **Model & Validation Strategy:** Which machine learning model to train and what cross-validation method to use. You will need to edit the ``path_info_sets``, ``step_class_sets``, ``step_param_sets``, and ``training_sets`` sections. Before you modify the config, let's create a directory where your trained models will be saved: .. code-block:: bash mkdir -p ~/aiqc_project/models **Update your training_config.yaml file:** Modify the file to align with the following structure. Remember to replace placeholder paths with your actual project setup. .. note:: ``aiqclib`` integrates multiple ML algorithms, and it is easy to switch between them. For more details, see the dedicated :doc:`../../how-to/algorithm_selection` page. .. code-block:: yaml path_info_sets: - name: data_set_1 common: base_path: ~/aiqc_project/data # Root directory of the prepared dataset (from preparation step) input: step_folder_name: training # Subdirectory containing the split training/validation/test data model: base_path: ~/aiqc_project/models # Directory where the final trained models will be saved .. code-block:: yaml # Define your model and validation strategy here. # For this tutorial, we'll use a KFoldValidation and XGBoost model. step_class_sets: - name: training_step_set_1 steps: input: InputTrainingSetA validate: KFoldValidation # Specify your cross-validation class model: XGBoost # Specify your ML model class (e.g., XGBoost, RandomForest) build: BuildModel # Define parameters for your chosen model and validation. # For example, number of folds for CV, or model hyperparameters. step_param_sets: - name: training_param_set_1 steps: input: { } validate: { k_fold: 5 } # 5-fold cross-validation model: { calculate_shap: False, # Control SHAP value calculation model_params: { scale_pos_weight: 200, # Specify pos:neg ratio n_jobs: -1 } } # Number of threads used by XGBoost build: { } .. code-block:: yaml training_sets: - name: training_0001 # A unique name for this training job dataset_folder_name: dataset_0001 # This MUST match the dataset_folder_name from your preparation config path_info: data_set_1 target_set: target_set_1 # This needs to match a 'target_set' defined in your prepare_config.yaml step_class_set: training_step_set_1 step_param_set: training_param_set_1 .. note:: The training configuration file includes many other options for advanced model selection, hyperparameter tuning, and cross-validation strategies. For a complete reference of all available parameters, please consult the dedicated :doc:`../../configuration/training` page. Step 3.3: Run the Training Process ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once you have customized your ``training_config.yaml`` with the correct paths and model/validation configurations, you can execute the training and evaluation workflow. Load the configuration file and then call the ``train_and_evaluate`` function: .. code-block:: python import aiqclib as aq import os config_path = os.path.expanduser("~/aiqc_project/config/training_config.yaml") config = aq.read_config(config_path) aq.train_and_evaluate(config) Understanding the Output ------------------------ After the command finishes, ``aiqclib`` will have created new folders within your dataset's output directory (e.g., ``~/aiqc_project/data/dataset_0001/``) and within your model's base path (``~/aiqc_project/models/``). The primary outputs include: * **validate**: Contains detailed results from the cross-validation process, allowing you to inspect model performance across different data folds. This includes metrics, predictions, and potentially visualizations. * **build**: Holds a comprehensive report of the final model's evaluation performance on the held-out test dataset, along with aggregated metrics. * **models**: Holds the final, trained model object(s) ready for classification. These are the artifacts you will use in the next step. Next Steps ---------- You have now successfully trained and evaluated a machine learning model using ``aiqclib``! The final step in the workflow is to use this trained model to classify new, unseen data. Proceed to the next tutorial: :doc:`./classification`.