=================== Algorithm Selection =================== The library supports multiple machine learning algorithms spanning different logical categories, from tree-based ensembles to distance-based and neural methods. Available Algorithms -------------------- Except for `XGBoost `_, all methods use the implementation provided by `scikit-learn `_. ================================ ================================= ============================ ============ ====================== Category Algorithm Class Name Short Name Method ================================ ================================= ============================ ============ ====================== Tree-Based & Ensemble **XGBoost** XGBoost XGB Ensemble (Boosting) \ **Random Forest** RandomForest RF Ensemble (Bagging) \ **Decision Tree** DecisionTree DT Tree Linear & Geometric **Logistic Regression** LogisticRegression Logit Linear \ **Linear Discriminant Analysis** LinearDiscriminantAnalysis LDA Linear / Statistical \ **Support Vector Machine** SupportVectorMachine SVM Geometric Instance-Based **K-Nearest Neighbors** KNearestNeighbors KNN Distance-based Probabilistic **Gaussian Naive Bayes** GaussianNaiveBayes GNB Probabilistic Neural Network **Multilayer Perceptron** MultilayerPerceptron MLP Neural Network ================================ ================================= ============================ ============ ====================== Configuration ------------- To select an algorithm, set the ``model`` key in ``step_class_sets`` to the algorithm's class name (e.g., ``XGBoost``). To customize the hyperparameters for your selected algorithm, add them to the ``model`` step within ``step_param_sets``. Training Configuration Example ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml :emphasize-lines: 6, 14 step_class_sets: - name: training_step_set_1 steps: input: InputTrainingSetA validate: KFoldValidation model: XGBoost build: BuildModel step_param_sets: - name: training_param_set_1 steps: input: { } validate: { } model: { learning_rate: 0.01 } build: { } Classification Configuration Example ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml :emphasize-lines: 9 step_class_sets: - name: data_set_step_set_1 steps: input: InputDataSetAll summary: SummaryDataSetAll select: SelectDataSetAll locate: LocateDataSetAll extract: ExtractDataSetAll model: XGB classify: ClassifyAll concat: ConcatDataSetAll Imputation ---------- As non-tree-based machine learning methods do not accept NaN values, missing values are automatically imputed using ``SimpleImputer(strategy="median")`` provided by `scikit-learn `_ during the training phase. During the classification phase, instances containing NaN values in their features are handled such that non-tree-based models output a class value of 0 and a score of 0. Model Suite Class ----------------- ``aiqclib`` provides a model suite class that performs training and classification with multiple algorithms simultaneously. To select a set of algorithms, set the ``model`` key in ``step_class_sets`` to ``ModelSuite``.Then, specify a list of actual algorithms and their parameters using the ``methods`` and ``model_params`` keys within ``step_param_sets``, respectively. .. note:: Both ``methods`` and ``model_params`` keys accept "Class name" and "Short name" shown in the table above (e.g., ``XGBoost`` and ``XGB``). In addition, the ``ModelSuite`` class requires specific counterpart classes for training and classification to correctly handle multiple outputs. These are: * ``KFoldValidationSuite`` for k-fold validation * ``BuildModelSuite`` for model building * ``ClassifyAllSuite`` for classification * ``ConcatDataSetSuite`` for the final result concatenation Training Configuration Example ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml :emphasize-lines: 5, 6, 7, 14, 15, 16, 17, 18, 19, 20, 21, 22 step_class_sets: - name: training_step_set_1 steps: input: InputTrainingSetA validate: KFoldValidationSuite model: ModelSuite build: BuildModelSuite step_param_sets: - name: training_param_set_1 steps: input: { } validate: { } model: { calculate_shap: True, methods: [ DT, XGB, RF ], model_params: { DT: { }, # Default (you still need to set an empty dictionary) XGB: { scale_pos_weight: 200 , n_jobs: 30 }, RF: { n_jobs: 30 } # Number of parallel jobs } } build: { } Classification Configuration Example ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml :emphasize-lines: 9, 10, 11, 21 step_class_sets: - name: data_set_step_set_1 steps: input: InputDataSetAll summary: SummaryDataSetAll select: SelectDataSetAll locate: LocateDataSetAll extract: ExtractDataSetAll model: ModelSuite classify: ClassifyAllSuite concat: ConcatDataSetSuite step_param_sets: - name: training_param_set_1 steps: input: { } summary: { } select: { } locate: { } extract: { } model: { methods: [ DT, XGB, RF ] } classify: { } concat: { } Default Parameters ------------------ If no specific parameters are provided in ``step_param_sets``, the algorithms will initialize with the following default parameters based on their Scikit-Learn or XGBoost implementations. Decision Tree (DT) ^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 20 55 :header-rows: 1 * - Parameter - Default - Description * - ``criterion`` - ``"gini"`` - The function to measure the quality of a split. * - ``splitter`` - ``"best"`` - The strategy used to choose the split at each node. * - ``max_depth`` - ``10`` - The maximum depth of the tree. * - ``min_samples_split`` - ``10`` - The minimum number of samples required to split an internal node. * - ``min_samples_leaf`` - ``5`` - The minimum number of samples required to be at a leaf node. * - ``max_features`` - ``None`` - The number of features to consider when looking for the best split. * - ``random_state`` - ``None`` - Controls the randomness of the estimator for reproducibility. * - ``class_weight`` - ``"balanced"`` - Weights associated with classes (e.g., ``"balanced"``). * - ``ccp_alpha`` - ``0.001`` - Complexity parameter used for Minimal Cost-Complexity Pruning. Random Forest (RF) ^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 20 55 :header-rows: 1 * - Parameter - Default - Description * - ``n_estimators`` - ``100`` - The number of trees in the forest. * - ``criterion`` - ``"gini"`` - The function to measure the quality of a split. * - ``max_depth`` - ``10`` - The maximum depth of the trees. * - ``min_samples_split`` - ``10`` - The minimum number of samples required to split an internal node. * - ``min_samples_leaf`` - ``5`` - The minimum number of samples required to be at a leaf node. * - ``max_features`` - ``"sqrt"`` - The number of features to consider when looking for the best split. * - ``bootstrap`` - ``True`` - Whether bootstrap samples are used when building trees. * - ``n_jobs`` - ``-1`` - The number of jobs to run in parallel (``-1`` means using all processors). * - ``random_state`` - ``None`` - Controls both the randomness of the bootstrapping and feature sampling. * - ``class_weight`` - ``"balanced_subsample"`` - Weights associated with classes (e.g., ``"balanced"``). XGBoost (XGB) ^^^^^^^^^^^^^ .. list-table:: :widths: 25 20 55 :header-rows: 1 * - Parameter - Default - Description * - ``n_estimators`` - ``100`` - Number of boosting rounds (trees to build). * - ``max_depth`` - ``10`` - Maximum tree depth for base learners. * - ``learning_rate`` - ``0.1`` - Boosting learning rate (step size shrinkage). * - ``eval_metric`` - ``"logloss"`` - Evaluation metric for validation data. * - ``scale_pos_weight`` - ``1`` - Multiplier for the gradient of positive samples (e.g., set to sum(negative cases) / sum(positive cases)). * - ``n_jobs`` - ``-1`` - Number of parallel threads used to run XGBoost. Logistic Regression (Logit) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 20 55 :header-rows: 1 * - Parameter - Default - Description * - ``l1_ratio`` - ``0`` - Elastic-Net mixing parameter; only used when ``penalty`` is ``"elasticnet"``. * - ``C`` - ``1.0`` - Inverse of regularization strength; smaller values specify stronger regularization. * - ``solver`` - ``"lbfgs"`` - Algorithm to use in the optimization problem. * - ``class_weight`` - ``"balanced"`` - Weights associated with classes (e.g., ``"balanced"``). * - ``max_iter`` - ``200`` - Maximum number of iterations taken for the solvers to converge. Linear Discriminant Analysis (LDA) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 20 55 :header-rows: 1 * - Parameter - Default - Description * - ``solver`` - ``"svd"`` - Solver to use (Singular Value Decomposition). * - ``shrinkage`` - ``None`` - Shrinkage parameter, used to improve estimation of covariance matrices. * - ``priors`` - ``None`` - The class prior probabilities. * - ``n_components`` - ``None`` - Number of components for dimensionality reduction. * - ``store_covariance`` - ``False`` - If True, explicitly computes the empirical class covariance matrix. * - ``tol`` - ``1.0e-4`` - Absolute threshold for a singular value of X to be considered significant. Support Vector Machine (SVM) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 20 55 :header-rows: 1 * - Parameter - Default - Description * - ``C`` - ``1.0`` - Regularization parameter. The strength of the regularization is inversely proportional to C. * - ``kernel`` - ``"linear"`` - Specifies the kernel type to be used in the algorithm. * - ``probability`` - ``True`` - Whether to enable probability estimates (required for ROC/PR curves). * - ``tol`` - ``1e-3`` - Tolerance for stopping criterion. * - ``max_iter`` - ``200`` - Hard limit on iterations within solver (``-1`` for no limit). * - ``random_state`` - ``None`` - Controls the pseudo random number generation for probability estimates. * - ``class_weight`` - ``"balanced"`` - Weights associated with classes (e.g., ``"balanced"``). Gaussian Naive Bayes (GNB) ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 20 55 :header-rows: 1 * - Parameter - Default - Description * - ``priors`` - ``None`` - Prior probabilities of the classes. If specified, priors are not adjusted according to the data. * - ``var_smoothing`` - ``1e-9`` - Portion of the largest variance of all features added to variances for calculation stability. K-Nearest Neighbors (KNN) ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 20 55 :header-rows: 1 * - Parameter - Default - Description * - ``n_neighbors`` - ``5`` - Number of neighbors to use by default for queries. * - ``weights`` - ``"uniform"`` - Weight function used in prediction (all points in neighborhood are weighted equally). * - ``algorithm`` - ``"auto"`` - Algorithm used to compute the nearest neighbors. * - ``leaf_size`` - ``30`` - Leaf size passed to BallTree or KDTree (affects memory and speed). * - ``p`` - ``2`` - Power parameter for the Minkowski metric (``2`` corresponds to Euclidean distance). * - ``metric`` - ``"minkowski"`` - The distance metric to use for the tree. * - ``n_jobs`` - ``-1`` - The number of parallel jobs to run for neighbors search. Multilayer Perceptron (MLP) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 20 55 :header-rows: 1 * - Parameter - Default - Description * - ``hidden_layer_sizes`` - ``(50,)`` - The ith element represents the number of neurons in the ith hidden layer. * - ``activation`` - ``"relu"`` - Activation function for the hidden layer. * - ``solver`` - ``"adam"`` - The solver for weight optimization. * - ``alpha`` - ``0.0001`` - L2 penalty (regularization term) parameter. * - ``batch_size`` - ``"auto"`` - Size of minibatches for stochastic optimizers. * - ``learning_rate`` - ``"constant"`` - Learning rate schedule for weight updates. * - ``learning_rate_init`` - ``0.001`` - The initial learning rate used. * - ``max_iter`` - ``100`` - Maximum number of iterations/epochs. * - ``shuffle`` - ``True`` - Whether to shuffle samples in each iteration. * - ``random_state`` - ``None`` - Determines random number generation for weights and bias initialization. * - ``tol`` - ``1e-3`` - Tolerance for the optimization. * - ``early_stopping`` - ``True`` - Whether to use early stopping to terminate training when validation score is not improving. * - ``n_iter_no_change`` - ``5`` - Number of iterations to stop training when validation score stops improving.