Algorithm Selectionο
The library supports multiple machine learning algorithms spanning different logical categories, from tree-based ensembles to distance-based and neural methods.
Available Algorithmsο
Except for XGBoost, all methods use the implementation provided by scikit-learn.
Category |
Algorithm |
Class Name |
Short Name |
Method |
|---|---|---|---|---|
Tree-Based & Ensemble |
XGBoost |
XGBoost |
XGB |
Ensemble (Boosting) |
Random Forest |
RandomForest |
RF |
Ensemble (Bagging) |
|
Decision Tree |
DecisionTree |
DT |
Tree |
|
Linear & Geometric |
Logistic Regression |
LogisticRegression |
Logit |
Linear |
Linear Discriminant Analysis |
LinearDiscriminantAnalysis |
LDA |
Linear / Statistical |
|
Support Vector Machine |
SupportVectorMachine |
SVM |
Geometric |
|
Instance-Based |
K-Nearest Neighbors |
KNearestNeighbors |
KNN |
Distance-based |
Probabilistic |
Gaussian Naive Bayes |
GaussianNaiveBayes |
GNB |
Probabilistic |
Neural Network |
Multilayer Perceptron |
MultilayerPerceptron |
MLP |
Neural Network |
Configurationο
To select an algorithm, set the model key in step_class_sets to the algorithmβs class name (e.g., XGBoost).
To customize the hyperparameters for your selected algorithm, add them to the model step within step_param_sets.
Training Configuration Exampleο
step_class_sets:
- name: training_step_set_1
steps:
input: InputTrainingSetA
validate: KFoldValidation
model: XGBoost
build: BuildModel
step_param_sets:
- name: training_param_set_1
steps:
input: { }
validate: { }
model: { learning_rate: 0.01 }
build: { }
Classification Configuration Exampleο
step_class_sets:
- name: data_set_step_set_1
steps:
input: InputDataSetAll
summary: SummaryDataSetAll
select: SelectDataSetAll
locate: LocateDataSetAll
extract: ExtractDataSetAll
model: XGB
classify: ClassifyAll
concat: ConcatDataSetAll
Imputationο
As non-tree-based machine learning methods do not accept NaN values, missing values are automatically imputed using SimpleImputer(strategy="median") provided by scikit-learn during the training phase.
During the classification phase, instances containing NaN values in their features are handled such that non-tree-based models output a class value of 0 and a score of 0.
Model Suite Classο
aiqclib provides a model suite class that performs training and classification with multiple algorithms simultaneously. To select a set of algorithms, set the model key in step_class_sets to ModelSuite.Then, specify a list of actual algorithms and their parameters using the methods and model_params keys within step_param_sets, respectively.
Note
Both methods and model_params keys accept βClass nameβ and βShort nameβ shown in the table above (e.g., XGBoost and XGB).
In addition, the ModelSuite class requires specific counterpart classes for training and classification to correctly handle multiple outputs. These are:
KFoldValidationSuitefor k-fold validationBuildModelSuitefor model buildingClassifyAllSuitefor classificationConcatDataSetSuitefor the final result concatenation
Training Configuration Exampleο
step_class_sets:
- name: training_step_set_1
steps:
input: InputTrainingSetA
validate: KFoldValidationSuite
model: ModelSuite
build: BuildModelSuite
step_param_sets:
- name: training_param_set_1
steps:
input: { }
validate: { }
model: {
calculate_shap: True,
methods: [ DT, XGB, RF ],
model_params: {
DT: { }, # Default (you still need to set an empty dictionary)
XGB: { scale_pos_weight: 200 , n_jobs: 30 },
RF: { n_jobs: 30 } # Number of parallel jobs
}
}
build: { }
Classification Configuration Exampleο
step_class_sets:
- name: data_set_step_set_1
steps:
input: InputDataSetAll
summary: SummaryDataSetAll
select: SelectDataSetAll
locate: LocateDataSetAll
extract: ExtractDataSetAll
model: ModelSuite
classify: ClassifyAllSuite
concat: ConcatDataSetSuite
step_param_sets:
- name: training_param_set_1
steps:
input: { }
summary: { }
select: { }
locate: { }
extract: { }
model: { methods: [ DT, XGB, RF ] }
classify: { }
concat: { }
Default Parametersο
If no specific parameters are provided in step_param_sets, the algorithms will initialize with the following default parameters based on their Scikit-Learn or XGBoost implementations.
Decision Tree (DT)ο
Parameter |
Default |
Description |
|---|---|---|
|
|
The function to measure the quality of a split. |
|
|
The strategy used to choose the split at each node. |
|
|
The maximum depth of the tree. |
|
|
The minimum number of samples required to split an internal node. |
|
|
The minimum number of samples required to be at a leaf node. |
|
|
The number of features to consider when looking for the best split. |
|
|
Controls the randomness of the estimator for reproducibility. |
|
|
Weights associated with classes (e.g., |
|
|
Complexity parameter used for Minimal Cost-Complexity Pruning. |
Random Forest (RF)ο
Parameter |
Default |
Description |
|---|---|---|
|
|
The number of trees in the forest. |
|
|
The function to measure the quality of a split. |
|
|
The maximum depth of the trees. |
|
|
The minimum number of samples required to split an internal node. |
|
|
The minimum number of samples required to be at a leaf node. |
|
|
The number of features to consider when looking for the best split. |
|
|
Whether bootstrap samples are used when building trees. |
|
|
The number of jobs to run in parallel ( |
|
|
Controls both the randomness of the bootstrapping and feature sampling. |
|
|
Weights associated with classes (e.g., |
XGBoost (XGB)ο
Parameter |
Default |
Description |
|---|---|---|
|
|
Number of boosting rounds (trees to build). |
|
|
Maximum tree depth for base learners. |
|
|
Boosting learning rate (step size shrinkage). |
|
|
Evaluation metric for validation data. |
|
|
Multiplier for the gradient of positive samples (e.g., set to sum(negative cases) / sum(positive cases)). |
|
|
Number of parallel threads used to run XGBoost. |
Logistic Regression (Logit)ο
Parameter |
Default |
Description |
|---|---|---|
|
|
Elastic-Net mixing parameter; only used when |
|
|
Inverse of regularization strength; smaller values specify stronger regularization. |
|
|
Algorithm to use in the optimization problem. |
|
|
Weights associated with classes (e.g., |
|
|
Maximum number of iterations taken for the solvers to converge. |
Linear Discriminant Analysis (LDA)ο
Parameter |
Default |
Description |
|---|---|---|
|
|
Solver to use (Singular Value Decomposition). |
|
|
Shrinkage parameter, used to improve estimation of covariance matrices. |
|
|
The class prior probabilities. |
|
|
Number of components for dimensionality reduction. |
|
|
If True, explicitly computes the empirical class covariance matrix. |
|
|
Absolute threshold for a singular value of X to be considered significant. |
Support Vector Machine (SVM)ο
Parameter |
Default |
Description |
|---|---|---|
|
|
Regularization parameter. The strength of the regularization is inversely proportional to C. |
|
|
Specifies the kernel type to be used in the algorithm. |
|
|
Whether to enable probability estimates (required for ROC/PR curves). |
|
|
Tolerance for stopping criterion. |
|
|
Hard limit on iterations within solver ( |
|
|
Controls the pseudo random number generation for probability estimates. |
|
|
Weights associated with classes (e.g., |
Gaussian Naive Bayes (GNB)ο
Parameter |
Default |
Description |
|---|---|---|
|
|
Prior probabilities of the classes. If specified, priors are not adjusted according to the data. |
|
|
Portion of the largest variance of all features added to variances for calculation stability. |
K-Nearest Neighbors (KNN)ο
Parameter |
Default |
Description |
|---|---|---|
|
|
Number of neighbors to use by default for queries. |
|
|
Weight function used in prediction (all points in neighborhood are weighted equally). |
|
|
Algorithm used to compute the nearest neighbors. |
|
|
Leaf size passed to BallTree or KDTree (affects memory and speed). |
|
|
Power parameter for the Minkowski metric ( |
|
|
The distance metric to use for the tree. |
|
|
The number of parallel jobs to run for neighbors search. |
Multilayer Perceptron (MLP)ο
Parameter |
Default |
Description |
|---|---|---|
|
|
The ith element represents the number of neurons in the ith hidden layer. |
|
|
Activation function for the hidden layer. |
|
|
The solver for weight optimization. |
|
|
L2 penalty (regularization term) parameter. |
|
|
Size of minibatches for stochastic optimizers. |
|
|
Learning rate schedule for weight updates. |
|
|
The initial learning rate used. |
|
|
Maximum number of iterations/epochs. |
|
|
Whether to shuffle samples in each iteration. |
|
|
Determines random number generation for weights and bias initialization. |
|
|
Tolerance for the optimization. |
|
|
Whether to use early stopping to terminate training when validation score is not improving. |
|
|
Number of iterations to stop training when validation score stops improving. |