Algorithm Selection

The library supports multiple machine learning algorithms spanning different logical categories, from tree-based ensembles to distance-based and neural methods.

Available Algorithms

Except for XGBoost, all methods use the implementation provided by scikit-learn.

Category	Algorithm	Class Name	Short Name	Method
Tree-Based & Ensemble	XGBoost	XGBoost	XGB	Ensemble (Boosting)
	Random Forest	RandomForest	RF	Ensemble (Bagging)
	Decision Tree	DecisionTree	DT	Tree
Linear & Geometric	Logistic Regression	LogisticRegression	Logit	Linear
	Linear Discriminant Analysis	LinearDiscriminantAnalysis	LDA	Linear / Statistical
	Support Vector Machine	SupportVectorMachine	SVM	Geometric
Instance-Based	K-Nearest Neighbors	KNearestNeighbors	KNN	Distance-based
Probabilistic	Gaussian Naive Bayes	GaussianNaiveBayes	GNB	Probabilistic
Neural Network	Multilayer Perceptron	MultilayerPerceptron	MLP	Neural Network

Configuration

To select an algorithm, set the model key in step_class_sets to the algorithm’s class name (e.g., XGBoost).

To customize the hyperparameters for your selected algorithm, add them to the model step within step_param_sets.

Training Configuration Example

 step_class_sets:
   - name: training_step_set_1
     steps:
       input: InputTrainingSetA
       validate: KFoldValidation
       model: XGBoost
       build: BuildModel

 step_param_sets:
   - name: training_param_set_1
     steps:
       input: { }
       validate: { }
       model: { learning_rate: 0.01 }
       build: { }

Classification Configuration Example

 step_class_sets:
   - name: data_set_step_set_1
     steps:
       input: InputDataSetAll
       summary: SummaryDataSetAll
       select: SelectDataSetAll
       locate: LocateDataSetAll
       extract: ExtractDataSetAll
       model: XGB
       classify: ClassifyAll
       concat: ConcatDataSetAll

Imputation

As non-tree-based machine learning methods do not accept NaN values, missing values are automatically imputed using SimpleImputer(strategy="median") provided by scikit-learn during the training phase.

During the classification phase, instances containing NaN values in their features are handled such that non-tree-based models output a class value of 0 and a score of 0.

Model Suite Class

aiqclib provides a model suite class that performs training and classification with multiple algorithms simultaneously. To select a set of algorithms, set the model key in step_class_sets to ModelSuite.Then, specify a list of actual algorithms and their parameters using the methods and model_params keys within step_param_sets, respectively.

Note

Both methods and model_params keys accept “Class name” and “Short name” shown in the table above (e.g., XGBoost and XGB).

In addition, the ModelSuite class requires specific counterpart classes for training and classification to correctly handle multiple outputs. These are:

KFoldValidationSuite for k-fold validation
BuildModelSuite for model building
ClassifyAllSuite for classification
ConcatDataSetSuite for the final result concatenation

Training Configuration Example

 step_class_sets:
   - name: training_step_set_1
     steps:
       input: InputTrainingSetA
       validate: KFoldValidationSuite
       model: ModelSuite
       build: BuildModelSuite

 step_param_sets:
   - name: training_param_set_1
     steps:
       input: { }
       validate: { }
       model: {
                calculate_shap: True,
                methods: [ DT, XGB, RF ],
                model_params: {
                  DT:  { },             # Default (you still need to set an empty dictionary)
                  XGB: { scale_pos_weight: 200 , n_jobs: 30 },
                  RF:  { n_jobs: 30 }   # Number of parallel jobs
                }
              }
       build: { }

Classification Configuration Example

 step_class_sets:
   - name: data_set_step_set_1
     steps:
       input: InputDataSetAll
       summary: SummaryDataSetAll
       select: SelectDataSetAll
       locate: LocateDataSetAll
       extract: ExtractDataSetAll
       model: ModelSuite
       classify: ClassifyAllSuite
       concat: ConcatDataSetSuite

 step_param_sets:
   - name: training_param_set_1
     steps:
       input: { }
       summary: { }
       select: { }
       locate: { }
       extract: { }
       model: { methods: [ DT, XGB, RF ] }
       classify: { }
       concat: { }

Default Parameters

If no specific parameters are provided in step_param_sets, the algorithms will initialize with the following default parameters based on their Scikit-Learn or XGBoost implementations.

Decision Tree (DT)

Parameter	Default	Description
`criterion`	`"gini"`	The function to measure the quality of a split.
`splitter`	`"best"`	The strategy used to choose the split at each node.
`max_depth`	`10`	The maximum depth of the tree.
`min_samples_split`	`10`	The minimum number of samples required to split an internal node.
`min_samples_leaf`	`5`	The minimum number of samples required to be at a leaf node.
`max_features`	`None`	The number of features to consider when looking for the best split.
`random_state`	`None`	Controls the randomness of the estimator for reproducibility.
`class_weight`	`"balanced"`	Weights associated with classes (e.g., `"balanced"`).
`ccp_alpha`	`0.001`	Complexity parameter used for Minimal Cost-Complexity Pruning.

Random Forest (RF)

Parameter	Default	Description
`n_estimators`	`100`	The number of trees in the forest.
`criterion`	`"gini"`	The function to measure the quality of a split.
`max_depth`	`10`	The maximum depth of the trees.
`min_samples_split`	`10`	The minimum number of samples required to split an internal node.
`min_samples_leaf`	`5`	The minimum number of samples required to be at a leaf node.
`max_features`	`"sqrt"`	The number of features to consider when looking for the best split.
`bootstrap`	`True`	Whether bootstrap samples are used when building trees.
`n_jobs`	`-1`	The number of jobs to run in parallel (`-1` means using all processors).
`random_state`	`None`	Controls both the randomness of the bootstrapping and feature sampling.
`class_weight`	`"balanced_subsample"`	Weights associated with classes (e.g., `"balanced"`).

XGBoost (XGB)

Parameter	Default	Description
`n_estimators`	`100`	Number of boosting rounds (trees to build).
`max_depth`	`10`	Maximum tree depth for base learners.
`learning_rate`	`0.1`	Boosting learning rate (step size shrinkage).
`eval_metric`	`"logloss"`	Evaluation metric for validation data.
`scale_pos_weight`	`1`	Multiplier for the gradient of positive samples (e.g., set to sum(negative cases) / sum(positive cases)).
`n_jobs`	`-1`	Number of parallel threads used to run XGBoost.

Logistic Regression (Logit)

Parameter	Default	Description
`l1_ratio`	`0`	Elastic-Net mixing parameter; only used when `penalty` is `"elasticnet"`.
`C`	`1.0`	Inverse of regularization strength; smaller values specify stronger regularization.
`solver`	`"lbfgs"`	Algorithm to use in the optimization problem.
`class_weight`	`"balanced"`	Weights associated with classes (e.g., `"balanced"`).
`max_iter`	`200`	Maximum number of iterations taken for the solvers to converge.

Linear Discriminant Analysis (LDA)

Parameter	Default	Description
`solver`	`"svd"`	Solver to use (Singular Value Decomposition).
`shrinkage`	`None`	Shrinkage parameter, used to improve estimation of covariance matrices.
`priors`	`None`	The class prior probabilities.
`n_components`	`None`	Number of components for dimensionality reduction.
`store_covariance`	`False`	If True, explicitly computes the empirical class covariance matrix.
`tol`	`1.0e-4`	Absolute threshold for a singular value of X to be considered significant.

Support Vector Machine (SVM)

Parameter	Default	Description
`C`	`1.0`	Regularization parameter. The strength of the regularization is inversely proportional to C.
`kernel`	`"linear"`	Specifies the kernel type to be used in the algorithm.
`probability`	`True`	Whether to enable probability estimates (required for ROC/PR curves).
`tol`	`1e-3`	Tolerance for stopping criterion.
`max_iter`	`200`	Hard limit on iterations within solver (`-1` for no limit).
`random_state`	`None`	Controls the pseudo random number generation for probability estimates.
`class_weight`	`"balanced"`	Weights associated with classes (e.g., `"balanced"`).

Gaussian Naive Bayes (GNB)

Parameter	Default	Description
`priors`	`None`	Prior probabilities of the classes. If specified, priors are not adjusted according to the data.
`var_smoothing`	`1e-9`	Portion of the largest variance of all features added to variances for calculation stability.

K-Nearest Neighbors (KNN)

Parameter	Default	Description
`n_neighbors`	`5`	Number of neighbors to use by default for queries.
`weights`	`"uniform"`	Weight function used in prediction (all points in neighborhood are weighted equally).
`algorithm`	`"auto"`	Algorithm used to compute the nearest neighbors.
`leaf_size`	`30`	Leaf size passed to BallTree or KDTree (affects memory and speed).
`p`	`2`	Power parameter for the Minkowski metric (`2` corresponds to Euclidean distance).
`metric`	`"minkowski"`	The distance metric to use for the tree.
`n_jobs`	`-1`	The number of parallel jobs to run for neighbors search.

Multilayer Perceptron (MLP)

Parameter	Default	Description
`hidden_layer_sizes`	`(50,)`	The ith element represents the number of neurons in the ith hidden layer.
`activation`	`"relu"`	Activation function for the hidden layer.
`solver`	`"adam"`	The solver for weight optimization.
`alpha`	`0.0001`	L2 penalty (regularization term) parameter.
`batch_size`	`"auto"`	Size of minibatches for stochastic optimizers.
`learning_rate`	`"constant"`	Learning rate schedule for weight updates.
`learning_rate_init`	`0.001`	The initial learning rate used.
`max_iter`	`100`	Maximum number of iterations/epochs.
`shuffle`	`True`	Whether to shuffle samples in each iteration.
`random_state`	`None`	Determines random number generation for weights and bias initialization.
`tol`	`1e-3`	Tolerance for the optimization.
`early_stopping`	`True`	Whether to use early stopping to terminate training when validation score is not improving.
`n_iter_no_change`	`5`	Number of iterations to stop training when validation score stops improving.