aiqclib.interface package

Submodules

aiqclib.interface.classify module

Main orchestration module for the data classification pipeline.

This module provides the main entry point for executing a comprehensive data classification pipeline. It orchestrates a series of sequential steps, from initial data loading and preparation to feature extraction, model prediction, and final result merging. Each step is configured and executed based on the parameters defined in a provided configuration object.

aiqclib.interface.classify.classify_dataset(config)[source]

Execute a series of steps to classify all observations in the given data set, as defined by the provided configuration object.

This function performs the following steps in sequence:

Load and read the initial input data.
Calculate and write summary statistics.
Label and write selected profiles.
Locate and write target rows.
Extract and write target features.
Use the model to predict labels in the input data.
Merge the results with the original input data.

Parameters:: config (ConfigBase) – A configuration object specifying the classes and parameters for each step in the dataset preparation and classification process.
Returns:: None. The function performs I/O operations and modifies datasets based on the configuration but does not return a value.
Return type:: None

aiqclib.interface.config module

Module providing utilities for writing YAML configuration templates and reading them as instantiated configuration objects. Supports “prepare”, “train”, “classify”, and “nrt_qc” stages using corresponding registry lookups.

aiqclib.interface.config.read_config(file_name, set_name=None, auto_select=True)[source]

Read a YAML configuration file as a ConfigBase object, automatically selecting the appropriate subclass based on the content.

This function:

Resolves the file path by calling aiqclib.common.utils.config.get_config_file().
Reads the specified YAML file and identifies the main key (e.g., “data_sets”, “training_sets”, “classification_sets”, or “nrt_qc_sets”) to map to the corresponding configuration class.
Instantiates and returns the matched configuration class with the resolved path.
If set_name is provided, it calls the select method on the instantiated configuration object.

Parameters:

file_name (str) – The path (including filename) to the YAML file.
set_name (Optional[str]) – The name (key) of the desired configuration set within the YAML’s dictionary. Defaults to None.
auto_select (bool) – If True, the first available data set name will be selected automatically if no specific set_name is provided. Defaults to True.

Returns:

An instantiated configuration object (DataSetConfig, TrainingConfig, ClassificationConfig, or NRTQCConfig).

Return type:

ConfigBase

Raises:

ValueError – If no valid top-level configuration key is found in the YAML file.

aiqclib.interface.config.write_config_template(file_name, stage, extension='')[source]

Write a YAML configuration template for the specified stage (“prepare”, “train”, or “classify”) to a file.

This function:

Chooses a template generator based on the combination of stage and extension.
Validates that the directory for file_name exists.
Writes the generated YAML template text to the specified file.

Parameters:

file_name (str) – The path (including filename) where the YAML file will be written.
stage (str) – Determines which template to write; must be one of “prepare”, “train”, “classify”, or “nrt_qc”.
extension (str) – Determines template extensions; must be one of “”, “full”, or “reduced”.

Raises:

ValueError – If the combined stage and extension is not found in the registry.
IOError – If the directory of the specified file path does not exist.

Return type:

None

aiqclib.interface.nrtqc module

Main orchestration module for the Near-Real Time Quality Control (NRT QC) pipeline.

This module provides the main entry point for executing the NRT QC pipeline: reading the input data, applying the configured QC items, aggregating the per-item flags into final NRT flags, and — when the input already carries NRT QC flags — comparing the existing and newly computed flags.

aiqclib.interface.nrtqc.run_nrt_qc(config)[source]

Execute the NRT QC pipeline for the given configuration.

This function performs the following steps in sequence:

Load and read the initial input data.
Apply the configured QC items and write the per-item flag columns.
Aggregate the item flags into a final NRT flag per variable (applying flag propagation items such as temp_to_psal last) and write the output parquet: the original input columns plus all QC item columns and the final NRT flags.
When at least one variable has an existing flag column configured (its flag entry in the qc_variable_set), build and write the per-variable flag comparison reports; otherwise the step is skipped.

Parameters:: config (ConfigBase) – A configuration object specifying the classes and parameters for each step of the NRT QC pipeline.
Returns:: None. The function performs I/O operations based on the configuration but does not return a value.
Return type:: None

aiqclib.interface.prepare module

Data Preparation Pipeline Orchestrator

This module orchestrates the creation of a training dataset by sequentially loading and processing data through multiple preparation steps. It defines the create_training_dataset function, which acts as the main entry point for initiating the multi-stage data pipeline, from raw input to final training and validation datasets.

aiqclib.interface.prepare.create_training_dataset(config)[source]

Execute a series of steps to produce a training dataset.

This function orchestrates the sequential loading and processing of data through multiple preparation steps, as defined by the provided configuration object. It relies on a series of helper functions (e.g., load_stepX_dataset) and class methods to perform distinct operations, ultimately generating and writing the final training and validation datasets.

The processing involves the following stages: 1. Input Data Loading: Reads and prepares the initial raw data. 2. Summary Statistics Calculation: Computes and stores aggregate statistics. 3. Profile Selection: Identifies and labels specific profiles or data subsets. 4. Target Row Location: Pinpoints specific rows of interest within profiles. 5. Feature Extraction: Derives modeling features from the located rows. 6. Dataset Splitting: Divides features into training and validation sets.

Parameters:: config (ConfigBase) – A configuration object specifying the classes and parameters for each step in the dataset preparation process.
Returns:: None. This function performs I/O operations and does not return a value.
Return type:: None
Example:

from aiqclib.common.base.config_base import ConfigBase
cfg = ConfigBase(...)
create_training_dataset(cfg)

aiqclib.interface.shap_io module

High-level interface for importing SHAP score files.

This module exposes read_shap_scores(), which loads the SHAP score files produced by aiqclib during the testing and classification phases so they can be used for SHAP visualization and evaluation (mean-importance bar charts, summary plots, dependence plots, and so on).

aiqclib.interface.shap_io.read_shap_scores(file_name, file_type=None, options=None, strip_suffix=True)[source]

Import a SHAP score file produced by aiqclib.

aiqclib writes per-instance SHAP values with three metadata columns (label, predicted_label, score) followed by one <feature>_shap column per feature. This function reads such a file into a Polars DataFrame and, by default, strips the _shap suffix so each feature column is named by its feature — convenient for downstream SHAP plots.

Parameters:

file_name (str) – Path to the SHAP score file.
file_type (Optional[str]) – Explicit file format ("parquet", "tsv", "tsv.gz", "csv", "csv.gz"). Inferred from the file extension when None.
options (Optional[Dict[str, Any]]) – Extra keyword arguments forwarded to the underlying Polars reader.
strip_suffix (bool) – Whether to strip the _shap suffix from the SHAP columns. Defaults to True.

Raises:

FileNotFoundError – If file_name does not exist.
ValueError – If the file type is unsupported, or if stripping the suffix would produce duplicate column names.

Returns:

A Polars DataFrame of SHAP scores.

Return type:

DataFrame

aiqclib.interface.stats module

Utilities for generating and formatting summary statistics.

This module provides high-level functions to calculate and display summary statistics for a given dataset file. It uses a predefined configuration template to process the data, compute statistics at both global and per-profile levels, and format the results for human-readable output.

aiqclib.interface.stats.format_summary_stats(df, variables=[], summary_stats=['mean', 'median', 'sd', 'pct25', 'pct75'])[source]

Format a summary statistics DataFrame into a pretty-printed string.

This function takes a DataFrame of statistics (as produced by get_summary_stats()) and converts it into a nested dictionary, which is then formatted into a string for display. The output can be filtered by variable and statistic type.

Parameters:

df (DataFrame) – The input DataFrame containing summary statistics. It is expected to have a “stats” column for profile-level summaries, or only variable-level statistics for global summaries.
variables (List[str]) – An optional list of variable names to include. If empty, all variables are included.
summary_stats (List[str]) – An optional list of statistic names (e.g., “mean”, “sd”) to include for profile-level summaries. This parameter is ignored for global (non-“profiles”) summaries.

Returns:

A string containing the pretty-printed, formatted statistics.

Return type:

str

aiqclib.interface.stats.get_summary_stats(input_file, summary_type)[source]

Calculate and retrieve summary statistics from a dataset file.

This function loads a dataset, computes global and per-profile summary statistics, and returns the requested type of summary as a Polars DataFrame. It uses a built-in configuration template and dynamically sets the input path based on the provided file.

Parameters:

input_file (str) – The path to the input dataset file (e.g., a TSV or Parquet file).
summary_type (str) – The type of summary to return. Supported values are “profiles” (for per-profile stats) and “all” (for global stats).

Raises:

FileNotFoundError – If the input_file does not exist.
ValueError – If the summary_type is not a supported value.

Returns:

A Polars DataFrame containing the requested summary statistics.

Return type:

DataFrame

aiqclib.interface.train module

Orchestration module for the model training and evaluation pipeline.

This module defines the primary workflow for loading datasets, performing model validation, and executing the final model construction and testing phases based on a centralized configuration.

aiqclib.interface.train.train_and_evaluate(config)[source]

Perform a training and evaluation process based on the specified configuration.

This function orchestrates the end-to-end workflow, including data loading, model validation, and final model building and testing.

Steps:

Load and process input training data.
Validate the model using the specified validation technique (e.g., k-fold).
Build and test the final model, saving results and trained model artifacts.

Parameters:: config (ConfigBase) – A training configuration object specifying classes and parameters.
Returns:: None. The function performs I/O operations and does not return a value.
Return type:: None