aiqclib package

aiqclib Interface Module

This module provides a high-level interface to the aiqclib library, exposing core functionalities for configuration management, dataset preparation, model training and evaluation, and dataset classification.

aiqclib.__version__

The version of the aiqclib library.

Type:

str

aiqclib.classify_dataset(config)[source]

Execute a series of steps to classify all observations in the given data set, as defined by the provided configuration object.

This function performs the following steps in sequence:
  1. Load and read the initial input data.

  2. Calculate and write summary statistics.

  3. Label and write selected profiles.

  4. Locate and write target rows.

  5. Extract and write target features.

  6. Use the model to predict labels in the input data.

  7. Merge the results with the original input data.

Parameters:

config (ConfigBase) – A configuration object specifying the classes and parameters for each step in the dataset preparation and classification process.

Returns:

None. The function performs I/O operations and modifies datasets based on the configuration but does not return a value.

Return type:

None

aiqclib.create_training_dataset(config)[source]

Execute a series of steps to produce a training dataset.

This function orchestrates the sequential loading and processing of data through multiple preparation steps, as defined by the provided configuration object. It relies on a series of helper functions (e.g., load_stepX_dataset) and class methods to perform distinct operations, ultimately generating and writing the final training and validation datasets.

The processing involves the following stages: 1. Input Data Loading: Reads and prepares the initial raw data. 2. Summary Statistics Calculation: Computes and stores aggregate statistics. 3. Profile Selection: Identifies and labels specific profiles or data subsets. 4. Target Row Location: Pinpoints specific rows of interest within profiles. 5. Feature Extraction: Derives modeling features from the located rows. 6. Dataset Splitting: Divides features into training and validation sets.

Parameters:

config (ConfigBase) – A configuration object specifying the classes and parameters for each step in the dataset preparation process.

Returns:

None. This function performs I/O operations and does not return a value.

Return type:

None

Example:

from aiqclib.common.base.config_base import ConfigBase
cfg = ConfigBase(...)
create_training_dataset(cfg)
aiqclib.format_summary_stats(df, variables=[], summary_stats=['mean', 'median', 'sd', 'pct25', 'pct75'])[source]

Format a summary statistics DataFrame into a pretty-printed string.

This function takes a DataFrame of statistics (as produced by get_summary_stats()) and converts it into a nested dictionary, which is then formatted into a string for display. The output can be filtered by variable and statistic type.

Parameters:
  • df (DataFrame) – The input DataFrame containing summary statistics. It is expected to have a “stats” column for profile-level summaries, or only variable-level statistics for global summaries.

  • variables (List[str]) – An optional list of variable names to include. If empty, all variables are included.

  • summary_stats (List[str]) – An optional list of statistic names (e.g., “mean”, “sd”) to include for profile-level summaries. This parameter is ignored for global (non-“profiles”) summaries.

Returns:

A string containing the pretty-printed, formatted statistics.

Return type:

str

aiqclib.get_summary_stats(input_file, summary_type)[source]

Calculate and retrieve summary statistics from a dataset file.

This function loads a dataset, computes global and per-profile summary statistics, and returns the requested type of summary as a Polars DataFrame. It uses a built-in configuration template and dynamically sets the input path based on the provided file.

Parameters:
  • input_file (str) – The path to the input dataset file (e.g., a TSV or Parquet file).

  • summary_type (str) – The type of summary to return. Supported values are “profiles” (for per-profile stats) and “all” (for global stats).

Raises:
  • FileNotFoundError – If the input_file does not exist.

  • ValueError – If the summary_type is not a supported value.

Returns:

A Polars DataFrame containing the requested summary statistics.

Return type:

DataFrame

aiqclib.read_config(file_name, set_name=None, auto_select=True)[source]

Read a YAML configuration file as a ConfigBase object, automatically selecting the appropriate subclass based on the content.

This function:
  1. Resolves the file path by calling aiqclib.common.utils.config.get_config_file().

  2. Reads the specified YAML file and identifies the main key (e.g., “data_sets”, “training_sets”, or “classification_sets”) to map to the corresponding configuration class.

  3. Instantiates and returns the matched configuration class with the resolved path.

  4. If set_name is provided, it calls the select method on the instantiated configuration object.

Parameters:
  • file_name (str) – The path (including filename) to the YAML file.

  • set_name (Optional[str]) – The name (key) of the desired configuration set within the YAML’s dictionary. Defaults to None.

  • auto_select (bool) – If True, the first available data set name will be selected automatically if no specific set_name is provided. Defaults to True.

Returns:

An instantiated configuration object (either DataSetConfig, TrainingConfig, or ClassificationConfig).

Return type:

ConfigBase

Raises:

ValueError – If no valid top-level configuration key is found in the YAML file.

aiqclib.read_shap_scores(file_name, file_type=None, options=None, strip_suffix=True)[source]

Import a SHAP score file produced by aiqclib.

aiqclib writes per-instance SHAP values with three metadata columns (label, predicted_label, score) followed by one <feature>_shap column per feature. This function reads such a file into a Polars DataFrame and, by default, strips the _shap suffix so each feature column is named by its feature — convenient for downstream SHAP plots.

Parameters:
  • file_name (str) – Path to the SHAP score file.

  • file_type (Optional[str]) – Explicit file format ("parquet", "tsv", "tsv.gz", "csv", "csv.gz"). Inferred from the file extension when None.

  • options (Optional[Dict[str, Any]]) – Extra keyword arguments forwarded to the underlying Polars reader.

  • strip_suffix (bool) – Whether to strip the _shap suffix from the SHAP columns. Defaults to True.

Raises:
  • FileNotFoundError – If file_name does not exist.

  • ValueError – If the file type is unsupported, or if stripping the suffix would produce duplicate column names.

Returns:

A Polars DataFrame of SHAP scores.

Return type:

DataFrame

aiqclib.train_and_evaluate(config)[source]

Perform a training and evaluation process based on the specified configuration.

This function orchestrates the end-to-end workflow, including data loading, model validation, and final model building and testing.

Steps:
  1. Load and process input training data.

  2. Validate the model using the specified validation technique (e.g., k-fold).

  3. Build and test the final model, saving results and trained model artifacts.

Parameters:

config (ConfigBase) – A training configuration object specifying classes and parameters.

Returns:

None. The function performs I/O operations and does not return a value.

Return type:

None

aiqclib.write_config_template(file_name, stage, extension='')[source]

Write a YAML configuration template for the specified stage (“prepare”, “train”, or “classify”) to a file.

This function:
  1. Chooses a template generator based on the combination of stage and extension.

  2. Validates that the directory for file_name exists.

  3. Writes the generated YAML template text to the specified file.

Parameters:
  • file_name (str) – The path (including filename) where the YAML file will be written.

  • stage (str) – Determines which template to write; must be one of “prepare”, “train”, or “classify”.

  • extension (str) – Determines template extensions; must be one of “”, “full”, or “reduced”.

Raises:
  • ValueError – If the combined stage and extension is not found in the registry.

  • IOError – If the directory of the specified file path does not exist.

Return type:

None

Subpackages