Algorithm Selection

The library supports multiple machine learning algorithms spanning different logical categories, from tree-based ensembles to distance-based and neural methods.

Available Algorithms

Except for XGBoost, all methods use the implementation provided by scikit-learn.

Category

Algorithm

Class Name

Short Name

Method

Tree-Based & Ensemble

XGBoost

XGBoost

XGB

Ensemble (Boosting)

Random Forest

RandomForest

RF

Ensemble (Bagging)

Decision Tree

DecisionTree

DT

Tree

Linear & Geometric

Logistic Regression

LogisticRegression

Logit

Linear

Linear Discriminant Analysis

LinearDiscriminantAnalysis

LDA

Linear / Statistical

Support Vector Machine

SupportVectorMachine

SVM

Geometric

Instance-Based

K-Nearest Neighbors

KNearestNeighbors

KNN

Distance-based

Probabilistic

Gaussian Naive Bayes

GaussianNaiveBayes

GNB

Probabilistic

Neural Network

Multilayer Perceptron

MultilayerPerceptron

MLP

Neural Network

Configuration

To select an algorithm, set the model key in step_class_sets to the algorithm’s class name (e.g., XGBoost).

To customize the hyperparameters for your selected algorithm, add them to the model step within step_param_sets.

Training Configuration Example

 step_class_sets:
   - name: training_step_set_1
     steps:
       input: InputTrainingSetA
       validate: KFoldValidation
       model: XGBoost
       build: BuildModel

 step_param_sets:
   - name: training_param_set_1
     steps:
       input: { }
       validate: { }
       model: { learning_rate: 0.01 }
       build: { }

Classification Configuration Example

 step_class_sets:
   - name: data_set_step_set_1
     steps:
       input: InputDataSetAll
       summary: SummaryDataSetAll
       select: SelectDataSetAll
       locate: LocateDataSetAll
       extract: ExtractDataSetAll
       model: XGB
       classify: ClassifyAll
       concat: ConcatDataSetAll

Imputation

As non-tree-based machine learning methods do not accept NaN values, missing values are automatically imputed using SimpleImputer(strategy="median") provided by scikit-learn during the training phase.

During the classification phase, instances containing NaN values in their features are handled such that non-tree-based models output a class value of 0 and a score of 0.

Model Suite Class

aiqclib provides a model suite class that performs training and classification with multiple algorithms simultaneously. To select a set of algorithms, set the model key in step_class_sets to ModelSuite.Then, specify a list of actual algorithms and their parameters using the methods and model_params keys within step_param_sets, respectively.

Note

Both methods and model_params keys accept β€œClass name” and β€œShort name” shown in the table above (e.g., XGBoost and XGB).

In addition, the ModelSuite class requires specific counterpart classes for training and classification to correctly handle multiple outputs. These are:

  • KFoldValidationSuite for k-fold validation

  • BuildModelSuite for model building

  • ClassifyAllSuite for classification

  • ConcatDataSetSuite for the final result concatenation

Training Configuration Example

 step_class_sets:
   - name: training_step_set_1
     steps:
       input: InputTrainingSetA
       validate: KFoldValidationSuite
       model: ModelSuite
       build: BuildModelSuite

 step_param_sets:
   - name: training_param_set_1
     steps:
       input: { }
       validate: { }
       model: {
                calculate_shap: True,
                methods: [ DT, XGB, RF ],
                model_params: {
                  DT:  { },             # Default (you still need to set an empty dictionary)
                  XGB: { scale_pos_weight: 200 , n_jobs: 30 },
                  RF:  { n_jobs: 30 }   # Number of parallel jobs
                }
              }
       build: { }

Classification Configuration Example

 step_class_sets:
   - name: data_set_step_set_1
     steps:
       input: InputDataSetAll
       summary: SummaryDataSetAll
       select: SelectDataSetAll
       locate: LocateDataSetAll
       extract: ExtractDataSetAll
       model: ModelSuite
       classify: ClassifyAllSuite
       concat: ConcatDataSetSuite

 step_param_sets:
   - name: training_param_set_1
     steps:
       input: { }
       summary: { }
       select: { }
       locate: { }
       extract: { }
       model: { methods: [ DT, XGB, RF ] }
       classify: { }
       concat: { }

Default Parameters

If no specific parameters are provided in step_param_sets, the algorithms will initialize with the following default parameters based on their Scikit-Learn or XGBoost implementations.

Decision Tree (DT)

Parameter

Default

Description

criterion

"gini"

The function to measure the quality of a split.

splitter

"best"

The strategy used to choose the split at each node.

max_depth

10

The maximum depth of the tree.

min_samples_split

10

The minimum number of samples required to split an internal node.

min_samples_leaf

5

The minimum number of samples required to be at a leaf node.

max_features

None

The number of features to consider when looking for the best split.

random_state

None

Controls the randomness of the estimator for reproducibility.

class_weight

"balanced"

Weights associated with classes (e.g., "balanced").

ccp_alpha

0.001

Complexity parameter used for Minimal Cost-Complexity Pruning.

Random Forest (RF)

Parameter

Default

Description

n_estimators

100

The number of trees in the forest.

criterion

"gini"

The function to measure the quality of a split.

max_depth

10

The maximum depth of the trees.

min_samples_split

10

The minimum number of samples required to split an internal node.

min_samples_leaf

5

The minimum number of samples required to be at a leaf node.

max_features

"sqrt"

The number of features to consider when looking for the best split.

bootstrap

True

Whether bootstrap samples are used when building trees.

n_jobs

-1

The number of jobs to run in parallel (-1 means using all processors).

random_state

None

Controls both the randomness of the bootstrapping and feature sampling.

class_weight

"balanced_subsample"

Weights associated with classes (e.g., "balanced").

XGBoost (XGB)

Parameter

Default

Description

n_estimators

100

Number of boosting rounds (trees to build).

max_depth

10

Maximum tree depth for base learners.

learning_rate

0.1

Boosting learning rate (step size shrinkage).

eval_metric

"logloss"

Evaluation metric for validation data.

scale_pos_weight

1

Multiplier for the gradient of positive samples (e.g., set to sum(negative cases) / sum(positive cases)).

n_jobs

-1

Number of parallel threads used to run XGBoost.

Logistic Regression (Logit)

Parameter

Default

Description

l1_ratio

0

Elastic-Net mixing parameter; only used when penalty is "elasticnet".

C

1.0

Inverse of regularization strength; smaller values specify stronger regularization.

solver

"lbfgs"

Algorithm to use in the optimization problem.

class_weight

"balanced"

Weights associated with classes (e.g., "balanced").

max_iter

200

Maximum number of iterations taken for the solvers to converge.

Linear Discriminant Analysis (LDA)

Parameter

Default

Description

solver

"svd"

Solver to use (Singular Value Decomposition).

shrinkage

None

Shrinkage parameter, used to improve estimation of covariance matrices.

priors

None

The class prior probabilities.

n_components

None

Number of components for dimensionality reduction.

store_covariance

False

If True, explicitly computes the empirical class covariance matrix.

tol

1.0e-4

Absolute threshold for a singular value of X to be considered significant.

Support Vector Machine (SVM)

Parameter

Default

Description

C

1.0

Regularization parameter. The strength of the regularization is inversely proportional to C.

kernel

"linear"

Specifies the kernel type to be used in the algorithm.

probability

True

Whether to enable probability estimates (required for ROC/PR curves).

tol

1e-3

Tolerance for stopping criterion.

max_iter

200

Hard limit on iterations within solver (-1 for no limit).

random_state

None

Controls the pseudo random number generation for probability estimates.

class_weight

"balanced"

Weights associated with classes (e.g., "balanced").

Gaussian Naive Bayes (GNB)

Parameter

Default

Description

priors

None

Prior probabilities of the classes. If specified, priors are not adjusted according to the data.

var_smoothing

1e-9

Portion of the largest variance of all features added to variances for calculation stability.

K-Nearest Neighbors (KNN)

Parameter

Default

Description

n_neighbors

5

Number of neighbors to use by default for queries.

weights

"uniform"

Weight function used in prediction (all points in neighborhood are weighted equally).

algorithm

"auto"

Algorithm used to compute the nearest neighbors.

leaf_size

30

Leaf size passed to BallTree or KDTree (affects memory and speed).

p

2

Power parameter for the Minkowski metric (2 corresponds to Euclidean distance).

metric

"minkowski"

The distance metric to use for the tree.

n_jobs

-1

The number of parallel jobs to run for neighbors search.

Multilayer Perceptron (MLP)

Parameter

Default

Description

hidden_layer_sizes

(50,)

The ith element represents the number of neurons in the ith hidden layer.

activation

"relu"

Activation function for the hidden layer.

solver

"adam"

The solver for weight optimization.

alpha

0.0001

L2 penalty (regularization term) parameter.

batch_size

"auto"

Size of minibatches for stochastic optimizers.

learning_rate

"constant"

Learning rate schedule for weight updates.

learning_rate_init

0.001

The initial learning rate used.

max_iter

100

Maximum number of iterations/epochs.

shuffle

True

Whether to shuffle samples in each iteration.

random_state

None

Determines random number generation for weights and bias initialization.

tol

1e-3

Tolerance for the optimization.

early_stopping

True

Whether to use early stopping to terminate training when validation score is not improving.

n_iter_no_change

5

Number of iterations to stop training when validation score stops improving.