Step 4: Classification

You have successfully prepared a dataset and trained a machine learning model. The final step in the aiqclib workflow is to put that model to work by classifying a new, unseen dataset. This workflow applies your pre-trained model to every observation in an input file, adding model predictions and probability scores as new columns.

This is the culmination of the aiqclib pipeline, transforming your machine learning model into a practical tool for real-time data analysis and quality control.

Prerequisites

This tutorial assumes you have successfully completed :doc:./training. To proceed with classification, you will need:

  • The trained model file(s) saved in the directory you specified in the training configuration (e.g., ~/aiqc_project/models/).

  • The original raw data file (nrt_cora_bo_4.parquet) that you wish to classify. This file should be in your ~/aiqc_project/input/ directory.

The Classification Workflow

The classification process follows the familiar aiqclib pattern: you will generate a configuration template, customize it to point to your input data, the trained model, and the desired output location, and then execute the classification script.

Step 4.1: Generate the Configuration Template

First, use aiqclib to generate the boilerplate configuration template specifically for the classify workflow.

import aiqclib as aq
import os

config_path = os.path.expanduser("~/aiqc_project/config/classification_config.yaml")
aq.write_config_template(
    file_name=config_path,
    stage="classify"
)

Step 4.2: Customize the Configuration File

Open the newly created ~/aiqc_project/config/classification_config.yaml file in your text editor. You need to configure the paths for the raw input data, the location of your trained model, and where the final classified output should be saved.

Crucially, the following sections in this classification configuration MUST EXACTLY MATCH those used in your ``prepare_config.yaml`` for the model’s training. This ensures that the new input data is preprocessed and features are engineered in precisely the same way the model expects. You can often copy these sections directly from your prepare_config.yaml.

  1. target_sets

  2. summary_stats_sets

  3. feature_sets

  4. feature_param_sets

  5. feature_stats_sets

Update your classification_config.yaml file to match the following. Remember to replace placeholder paths and details with your actual project setup.

path_info_sets:
  - name: data_set_1
    common:
      base_path: ~/aiqc_project/data # Root directory for all processed outputs for this job
    input:
      base_path: ~/aiqc_project/input # Directory containing the raw input files to classify
      step_folder_name: "" # Set to "" if input files are directly in base_path
    model:
      base_path: ~/aiqc_project/models # Directory containing your trained model file(s) from the training step
    concat:
      step_folder_name: classify # Subdirectory within common.base_path for the final classified output
step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years: [],
                                     keep_years: [2023] } }
      summary: { }
      select: { }
      locate: { }
      extract: { }
      model: { calculate_shap: False,         # Control SHAP value calculation
               model_params: { n_jobs: -1 } } # The number of threads used by XGBoost  (-1: all available cores)
      classify: { }
      concat: { }
classification_sets:
  - name: classification_0001  # A unique name for this classification task
    dataset_folder_name: dataset_0001  # This MUST match the dataset_folder_name used during preparation and training
    input_file_name: nrt_cora_bo_4.parquet   # The specific raw input filename to classify
    path_info: data_set_1
    target_set: target_set_1
    summary_stats_set: summary_stats_set_1
    feature_set: feature_set_1
    feature_param_set: feature_set_1_param_set_1
    step_class_set: data_set_step_set_1
    step_param_set: data_set_param_set_1

Note

The classification configuration file is comprehensive and has many options similar to both preparation and training configurations. For a complete reference of all available parameters, please consult the dedicated Configuration of Classification page.

Step 4.3: Run the Classification Process

Once you have customized your classification_config.yaml with the correct paths, input file, and inherited configuration references, you can execute the classification workflow.

Load the configuration file and then call the classify_dataset function:

import aiqclib as aq
import os

config_path = os.path.expanduser("~/aiqc_project/config/classification_config.yaml")
config = aq.read_config(config_path)
aq.classify_dataset(config)

Understanding the Output

After the command finishes, your output root directory (e.g., ~/aiqc_project/data) will contain a new folder named dataset_0001 (from classification_sets.dataset_folder_name). Inside dataset_0001, you will find several subdirectories, reflecting the processing steps:

  • summary: Contains intermediate files with summary statistics if re-calculated or referenced.

  • select: Stores the input profiles after any initial filtering. In classification, this typically includes all profiles you want to classify.

  • locate: Contains all observation records that proceeded through the pipeline, often after proximity-based selection for feature generation.

  • extract: Holds the features extracted from the observation records, transformed consistently with how the model was trained.

  • classify: This is the final output directory. It contains all classification results, including the final files that concatenate the original input with the corresponding prediction outputs.

Conclusion

Congratulations! You have successfully completed the entire aiqclib workflow, from raw data preparation to training a machine learning model and then using it to generate predictions on new data.

You now have a powerful, repeatable, and configurable pipeline for your machine learning tasks. You can easily adapt the configuration files to process new datasets, experiment with different models and features, or integrate this into larger automated workflows.