aiqclib.common.base package

Submodules

aiqclib.common.base.config_base module

Module for handling YAML-based configuration management.

This module provides the ConfigBase abstract base class, which facilitates loading, validating, and retrieving structured data from YAML configuration files. It uses JSON schemas for validation and supports template-based configuration loading.

class aiqclib.common.base.config_base.ConfigBase(section_name, config_file, auto_select=False)[source]

Bases: ABC

Abstract base class for loading and accessing YAML configurations.

This class provides a common interface for handling configuration files. It supports loading from a file path or from a built-in template, validating the configuration against a predefined JSON schema, and providing convenient methods to access specific parts of the config.

Subclasses must override the expected_class_name attribute to match the base_class value specified in the YAML configuration.

Note

This is an abstract base class and should not be instantiated directly.

Variables:

expected_class_name (str, optional) – Must be overridden by subclasses to match the YAML’s base_class entry.
section_name (str) – The top-level section of the config this instance manages.
yaml_schema (dict) – The JSON schema used for validating the configuration.
full_config (dict) – The entire configuration loaded from the YAML file.
valid_yaml (bool) – flag indicating if the loaded configuration is valid.
data (dict, optional) – The specific configuration dictionary for the selected entry.
dataset_name (str, optional) – The name of the selected dataset or task.

Parameters:

section_name (str)
config_file (str)
auto_select (bool)

auto_select()[source]

Automatically validate and select a single configuration entry.

Raises:: ValueError – If the YAML is invalid or multiple entries exist.
Returns:: None
Return type:: None

expected_class_name = None

get_base_class(step_name)[source]

Retrieve the associated class name for a specified step.

Parameters:: step_name (str) – The name of the step.
Returns:: The class name defined for the step.
Return type:: str

get_base_path(step_name)[source]

Retrieve the base path for a given processing step.

Parameters:: step_name (str) – The name of the step (e.g., “preprocess”).
Returns:: The configured base path.
Return type:: str
Raises:: ValueError – If no base path is found.

get_dataset_folder_name(step_name)[source]

Get the dataset-specific folder name for a given step.

Parameters:: step_name (str) – The name of the step.
Returns:: The folder name for the dataset, or an empty string.
Return type:: str

get_file_name(step_name, default_name=None)[source]

Retrieve the file name for a given step.

Parameters:

step_name (str) – The name of the step.
default_name (Optional[str]) – Fallback file name if not defined in config.

Returns:

The file name for the step.

Return type:

str

Raises:

ValueError – If no file name is found and no default is provided.

get_full_file_name(step_name, default_file_name=None, use_dataset_folder=True, folder_name_auto=True)[source]

Construct a full, normalized file path for a step.

Parameters:

step_name (str) – The name of the step.
default_file_name (Optional[str]) – Default file name if not in config.
use_dataset_folder (bool) – If True, include dataset folder. Defaults to True.
folder_name_auto (bool) – If True, auto-generate step folder name. Defaults to True.

Returns:

The complete, normalized file path.

Return type:

str

get_model_params(model_long_name, model_short_name)[source]

Retrieve the parameters dictionary for a model.

Parameters:

model_long_name (str) – The long-form name of the model.
model_short_name (str) – The short-form name of the model.

Returns:

Parameters for the specified model or the whole model param dict.

Return type:

Dict

get_normalization_file_name(default_file_name='normalization_stats.yaml')[source]

Resolve the full path of the normalization statistics file.

This file holds the data-derived normalization values (for auto_min_max and standard features). It is written during dataset preparation and read back during classification so that the identical fitted normalization is applied without re-entering values.

The path is resolved through the standard step-path machinery using the logical step name "normalize". The folder defaults to normalize and the file name can be overridden via step_param_sets.steps.normalize.file_name in the configuration.

Parameters:: default_file_name (str) – File name used when none is set in the config.
Returns:: The complete, normalized path to the normalization file.
Return type:: str

get_skip_evaluation(target_name)[source]

Resolve whether performance evaluation and label creation should be skipped for a given classification target.

Resolution order:

If skip_evaluation is explicitly set in the model step params, that value wins for every target in the step.

Otherwise it is derived per target: True when the target’s QC flag is missing/empty (see is_flag_missing()).

Parameters:: target_name (str) – The name of the target variable.
Returns:: True to skip label creation and performance evaluation.
Return type:: bool

get_step_folder_name(step_name, folder_name_auto=True)[source]

Get the folder name for a specific processing step.

Parameters:

step_name (str) – The name of the step.
folder_name_auto (bool) – If True, uses step_name as fallback. Defaults to True.

Returns:

The folder name for the step.

Return type:

str

get_step_params(step_name)[source]

Retrieve the parameters dictionary for a specific step.

Parameters:: step_name (str) – The name of the step.
Returns:: Parameters for the specified step.
Return type:: Dict
Raises:: KeyError – If the step or param set is missing.

get_summary_stats(stats_name, stats_type='min_max')[source]

Retrieve specific summary statistics parameters from the configuration.

Parameters:

stats_name (str) – Name of the summary statistics set to retrieve.
stats_type (str) – Type of statistics (e.g., “min_max”). Defaults to “min_max”.

Raises:

ValueError – If the specified stats name is not found.

Returns:

A dictionary containing the requested statistics.

Return type:

Dict

get_target_dict()[source]

Get target variable definitions as a name-keyed dictionary.

Returns:: Mapping of target names to their definitions.
Return type:: Dict[str, Dict]

get_target_file_names(step_name, default_file_name=None, use_dataset_folder=True, folder_name_auto=True)[source]

Construct a dictionary of full file paths for each target variable.

Parameters:

step_name (str) – The name of the step.
default_file_name (Optional[str]) – Default file name template.
use_dataset_folder (bool) – If True, include dataset folder. Defaults to True.
folder_name_auto (bool) – If True, auto-generate step folder name. Defaults to True.

Returns:

Dictionary mapping target names to formatted file paths.

Return type:

Dict[str, str]

get_target_names()[source]

Get the names of all target variables.

Returns:: List of target variable names.
Return type:: List[str]

get_target_variables()[source]

Get the list of target variable definitions from the configuration.

Returns:: List of target variable definition dictionaries.
Return type:: List[Dict]

static is_flag_missing(target_value)[source]

Return True when a target variable has no usable QC flag defined.

A flag is considered missing when the flag key is absent, None, or an empty/whitespace-only string. This is the trigger for the label-free (skip_evaluation) classification path.

Parameters:: target_value (Dict) – A single target variable definition.
Returns:: True if no usable flag column is specified, else False.
Return type:: bool

select(dataset_name)[source]

Select and load a specific configuration entry from the YAML.

Parameters:: dataset_name (str) – The name of the configuration to select.
Raises:: ValueError – If validation fails or the dataset name is not found.
Returns:: None
Return type:: None

set_base_class(step_name, value)[source]

Set the associated class name for a specified step.

Parameters:

step_name (str) – The name of the step.
value (str) – The class name value to set.

Returns:

None

Return type:

None

update_feature_param_with_stats(types=None)[source]

Update feature parameters with corresponding summary statistics in-place.

For each feature whose stats_set.type is a scaling type (i.e. not raw), the resolved statistics are looked up in data’s feature_stats_set (by name and type) and stored under the feature’s stats key, ready for use by the feature classes.

Parameters:: types (Optional[List[str]]) – If provided, only resolve features whose stats_set.type is in this list. This allows the manually-supplied min_max statistics to be resolved at configuration-load time while deferring the data-derived auto_min_max and standard statistics until after the summary statistics have been computed. If None, every non-raw feature is resolved (the historical behaviour).
Returns:: None
Return type:: None

validate()[source]

Validate the loaded configuration against the corresponding schema.

Returns:: A message indicating whether validation succeeded or failed.
Return type:: str

aiqclib.common.base.dataset_base module

This module defines the abstract base class DataSetBase, which serves as a foundation for implementing various dataset classes.

It provides a common structure for dataset initialization, including validation of the expected_class_name attribute against the provided configuration. Subclasses are expected to override the expected_class_name attribute to match their specific class identifier in the system’s configuration.

class aiqclib.common.base.dataset_base.DataSetBase(step_name, config)[source]

Bases: ABC

Base class for dataset classes.

Subclasses must define an expected_class_name attribute, which is used to validate the YAML entry’s step_class_sets.

Variables:

expected_class_name (str or None) – The expected class name for validation against configuration. This must be overridden by child classes.
step_name (str) – The name of the step identified in the configuration.
config (ConfigBase) – A configuration object that provides the necessary information.

Parameters:

step_name (str)
config (ConfigBase)

Note

This class extends the abc.ABC in order to indicate that it is an abstract base class.

config: ConfigBase

expected_class_name: str | None = None

step_name: str

aiqclib.common.base.feature_base module

Standardized Feature Extraction and Scaling Module.

This module defines the FeatureBase abstract base class (ABC), which provides a standardized framework for feature engineering tasks using the Polars library. It ensures that subclasses implement a consistent pipeline for feature extraction and multi-stage scaling.

class aiqclib.common.base.feature_base.FeatureBase(target_name=None, feature_info=None, selected_profiles=None, filtered_input=None, selected_rows=None, summary_stats=None)[source]

Bases: ABC

Abstract base class for extracting and scaling features.

Child classes must implement all abstract methods to define specific logic for feature generation and normalization. This class serves as a container for the data and metadata required during the transformation lifecycle.

Variables:

target_name (Optional[str]) – Name of the target variable.
feature_info (Optional[Dict]) – Metadata or configuration for features.
selected_profiles (Optional[DataFrame]) – Polars DataFrame of pre-selected profiles.
filtered_input (Optional[DataFrame]) – Polars DataFrame of pre-filtered input data.
selected_rows (Optional[Dict[str, DataFrame]]) – Mapping of identifiers to specific Polars DataFrames.
summary_stats (Optional[DataFrame]) – Polars DataFrame containing summary statistics.
features (Optional[DataFrame]) – Polars DataFrame containing the processed features.

Parameters:

target_name (str | None)
feature_info (Dict | None)
selected_profiles (DataFrame | None)
filtered_input (DataFrame | None)
selected_rows (Dict[str, DataFrame] | None)
summary_stats (DataFrame | None)

abstractmethod extract_features()[source]

Extract features from the provided data sources.

This method must be implemented by subclasses to generate raw features from inputs like filtered_input or selected_rows. The resulting DataFrame should be assigned to self.features.

Returns:: None
Return type:: None

abstractmethod scale_first()[source]

Apply the first pass of scaling or normalization to the extracted features.

Typically used for initial transformations such as standard scaling or handling outliers. This method should update the self.features attribute.

Returns:: None
Return type:: None

abstractmethod scale_second()[source]

Apply a secondary scaling or refinement step to the features.

Used for additional adjustments or domain-specific normalizations required after the first scaling pass. This method should update the self.features attribute.

Returns:: None
Return type:: None

aiqclib.common.base.model_base module

This module provides the ModelBase abstract base class, which serves as the foundational interface for all machine learning model implementations within the library. It enforces a consistent structure for building, testing, and persisting models while managing configuration and result storage.

class aiqclib.common.base.model_base.ModelBase(config)[source]

Bases: ABC

Abstract base class for modeling tasks.

Subclasses must define:

expected_class_name to match the configuration.
The build() method for model building.
The test() method for model testing.

Note

Since this class inherits from abc.ABC, it cannot be directly instantiated and must be subclassed.

Parameters:: config (ConfigBase)

abstractmethod build()[source]

Build the model architecture or pipeline.

Subclasses must implement logic to create, configure, and compile the model.

Return type:: None

expected_class_name: str | None = None

load_model(file_name)[source]

Load or deserialize a model from the given file path.

Parameters:

file_name (str) – The path to the file from which the model will be loaded.

Raises:

FileNotFoundError – If the specified file does not exist.
ValueError – If the loaded model type does not match the expected class defined by the configuration.

Return type:

None

multi = False

save_model(file_name)[source]

Save or serialize the current model to the provided file path.

Parameters:: file_name (str) – The path indicating where the model will be saved.
Return type:: None

short_name: str | None = None

abstractmethod test()[source]

Evaluate the model performance on a provided test set or validation data.

Subclasses must implement how the model is used to make predictions and how accuracy or performance measures are computed.

Return type:: None

update_model_score()[source]

Updates the internal model-scores table with the current test set predictions.

Each row records the model that produced the prediction (method), the fold index (k), the ground truth (label), and the predicted probability (score). The data is stored in the model_score attribute as a Polars DataFrame.

The method column is the lowercased short_name of the model (e.g. "xgb", "dt") and is always present, for both single-model and suite pipelines. This makes the model-scores file self-describing about which model produced each row.

Note that predicted_label is intentionally NOT stored: it is derivable from score and a threshold (score >= threshold), so keeping it would bake in a single threshold and make the file less useful for external threshold-sweeping (ROC/PR analysis). Consumers apply their own threshold to score as needed.

If model_score is already populated (e.g., during cross-validation), the new results are appended (vstacked) to the existing DataFrame.

Raises:: ValueError – If test_set or predictions are None.
Return type:: None

abstractmethod update_nthreads(model)[source]

Update the number of threads set in the model.

Subclasses must implement logic to update the number of threads.

Parameters:: model (Self) – The model instance that needs to be updated.
Returns:: The model instance with updated thread settings.
Return type:: Self

aiqclib.common.base.scikit_learn_model_base module

This module defines SklearnModelBase, an abstract base class for models that adhere to the Scikit-Learn API (including XGBoost and native sklearn models).

It implements common workflows for data conversion, model building, prediction, reporting, and SHAP value calculation for Explainable AI (XAI).

class aiqclib.common.base.scikit_learn_model_base.SklearnModelBase(config)[source]

Bases: ModelBase

Abstract base class for Scikit-Learn compatible models.

This class implements the standard lifecycle methods (build(), test(), predict(), create_report()) assuming the underlying model object supports the standard fit, predict, and predict_proba methods.

It also integrates SHAP (SHapley Additive exPlanations) to provide feature importance values. SHAP calculation is controlled by the calculate_shap configuration flag, and can be overridden via self.enable_shap to disable it during computationally heavy steps like k-fold validation.

Subclasses must implement:

_get_model_class(): To return the specific class type.

Parameters:: config (ConfigBase)

build()[source]

Train the classifier using the assigned training set.

Return type:: None

Steps:

Convert the Polars DataFrame (training_set) to Pandas.
Separate features (X) and labels (y).
Initialize the model class provided by _get_model_class() with model_params.
Fit the model.

Raises:: ValueError – If training_set is None or empty.
Return type:: None

calculate_shap()[source]

Calculates SHAP values for the test set based on the specific model type.

It automatically selects the optimal Explainer (TreeExplainer, LinearExplainer, or KernelExplainer). SHAP results are formatted into a Polars DataFrame and stored in shap_values.

Raises:: ValueError – If test_set or predictions are None.
Return type:: None

create_report()[source]

Computes and compiles a comprehensive classification report based on test results.

Calculates precision, recall, f1-score, and support using sklearn.metrics.classification_report(). Stores the result in report.

Raises:: ValueError – If test_set or predictions are None.
Return type:: None

predict()[source]

Generates predictions for the test set using the trained model.

Converts the Polars test set to a Pandas DataFrame, makes predictions, and stores the results in predictions.

Raises:: ValueError – If test_set is None.
Return type:: None

safe_predict()[source]

Return type:: None

test()[source]

Evaluate the trained classifier on the assigned test set.

Return type:: None

Steps:

Call predict() to generate predictions on the test set.
Call create_report() to compute metrics.
Call update_model_score() to store scores.
Call calculate_shap() to compute feature importances (if enabled).

Raises:: ValueError – If test_set is None.
Return type:: None

update_nthreads(model)[source]

Update the number of threads set in the model.

Parameters:: model (Self) – The model instance whose thread count needs to be updated.
Returns:: The updated model instance.
Return type:: Self