aiqclib.interface package
Submodules
aiqclib.interface.classify module
Main orchestration module for the data classification pipeline.
This module provides the main entry point for executing a comprehensive data classification pipeline. It orchestrates a series of sequential steps, from initial data loading and preparation to feature extraction, model prediction, and final result merging. Each step is configured and executed based on the parameters defined in a provided configuration object.
- aiqclib.interface.classify.classify_dataset(config)[source]
Execute a series of steps to classify all observations in the given data set, as defined by the provided configuration object.
- This function performs the following steps in sequence:
Load and read the initial input data.
Calculate and write summary statistics.
Label and write selected profiles.
Locate and write target rows.
Extract and write target features.
Use the model to predict labels in the input data.
Merge the results with the original input data.
- Parameters:
config (
ConfigBase) – A configuration object specifying the classes and parameters for each step in the dataset preparation and classification process.- Returns:
None. The function performs I/O operations and modifies datasets based on the configuration but does not return a value.
- Return type:
None
aiqclib.interface.config module
Module providing utilities for writing YAML configuration templates and reading them as instantiated configuration objects. Supports “prepare”, “train”, and “classify” stages using corresponding registry lookups.
- aiqclib.interface.config.read_config(file_name, set_name=None, auto_select=True)[source]
Read a YAML configuration file as a
ConfigBaseobject, automatically selecting the appropriate subclass based on the content.- This function:
Resolves the file path by calling
aiqclib.common.utils.config.get_config_file().Reads the specified YAML file and identifies the main key (e.g., “data_sets”, “training_sets”, or “classification_sets”) to map to the corresponding configuration class.
Instantiates and returns the matched configuration class with the resolved path.
If
set_nameis provided, it calls theselectmethod on the instantiated configuration object.
- Parameters:
file_name (
str) – The path (including filename) to the YAML file.set_name (
Optional[str]) – The name (key) of the desired configuration set within the YAML’s dictionary. Defaults to None.auto_select (
bool) – If True, the first available data set name will be selected automatically if no specificset_nameis provided. Defaults to True.
- Returns:
An instantiated configuration object (either
DataSetConfig,TrainingConfig, orClassificationConfig).- Return type:
- Raises:
ValueError – If no valid top-level configuration key is found in the YAML file.
- aiqclib.interface.config.write_config_template(file_name, stage, extension='')[source]
Write a YAML configuration template for the specified stage (“prepare”, “train”, or “classify”) to a file.
- This function:
Chooses a template generator based on the combination of
stageandextension.Validates that the directory for
file_nameexists.Writes the generated YAML template text to the specified file.
- Parameters:
file_name (
str) – The path (including filename) where the YAML file will be written.stage (
str) – Determines which template to write; must be one of “prepare”, “train”, or “classify”.extension (
str) – Determines template extensions; must be one of “”, “full”, or “reduced”.
- Raises:
ValueError – If the combined stage and extension is not found in the registry.
IOError – If the directory of the specified file path does not exist.
- Return type:
None
aiqclib.interface.prepare module
Data Preparation Pipeline Orchestrator
This module orchestrates the creation of a training dataset by sequentially loading and processing data through multiple preparation steps. It defines the create_training_dataset function, which acts as the main entry point for initiating the multi-stage data pipeline, from raw input to final training and validation datasets.
- aiqclib.interface.prepare.create_training_dataset(config)[source]
Execute a series of steps to produce a training dataset.
This function orchestrates the sequential loading and processing of data through multiple preparation steps, as defined by the provided configuration object. It relies on a series of helper functions (e.g.,
load_stepX_dataset) and class methods to perform distinct operations, ultimately generating and writing the final training and validation datasets.The processing involves the following stages: 1. Input Data Loading: Reads and prepares the initial raw data. 2. Summary Statistics Calculation: Computes and stores aggregate statistics. 3. Profile Selection: Identifies and labels specific profiles or data subsets. 4. Target Row Location: Pinpoints specific rows of interest within profiles. 5. Feature Extraction: Derives modeling features from the located rows. 6. Dataset Splitting: Divides features into training and validation sets.
- Parameters:
config (
ConfigBase) – A configuration object specifying the classes and parameters for each step in the dataset preparation process.- Returns:
None. This function performs I/O operations and does not return a value.
- Return type:
None- Example:
from aiqclib.common.base.config_base import ConfigBase cfg = ConfigBase(...) create_training_dataset(cfg)
aiqclib.interface.stats module
Utilities for generating and formatting summary statistics.
This module provides high-level functions to calculate and display summary statistics for a given dataset file. It uses a predefined configuration template to process the data, compute statistics at both global and per-profile levels, and format the results for human-readable output.
- aiqclib.interface.stats.format_summary_stats(df, variables=[], summary_stats=['mean', 'median', 'sd', 'pct25', 'pct75'])[source]
Format a summary statistics DataFrame into a pretty-printed string.
This function takes a DataFrame of statistics (as produced by
get_summary_stats()) and converts it into a nested dictionary, which is then formatted into a string for display. The output can be filtered by variable and statistic type.- Parameters:
df (
DataFrame) – The input DataFrame containing summary statistics. It is expected to have a “stats” column for profile-level summaries, or only variable-level statistics for global summaries.variables (
List[str]) – An optional list of variable names to include. If empty, all variables are included.summary_stats (
List[str]) – An optional list of statistic names (e.g., “mean”, “sd”) to include for profile-level summaries. This parameter is ignored for global (non-“profiles”) summaries.
- Returns:
A string containing the pretty-printed, formatted statistics.
- Return type:
str
- aiqclib.interface.stats.get_summary_stats(input_file, summary_type)[source]
Calculate and retrieve summary statistics from a dataset file.
This function loads a dataset, computes global and per-profile summary statistics, and returns the requested type of summary as a Polars DataFrame. It uses a built-in configuration template and dynamically sets the input path based on the provided file.
- Parameters:
input_file (
str) – The path to the input dataset file (e.g., a TSV or Parquet file).summary_type (
str) – The type of summary to return. Supported values are “profiles” (for per-profile stats) and “all” (for global stats).
- Raises:
FileNotFoundError – If the
input_filedoes not exist.ValueError – If the
summary_typeis not a supported value.
- Returns:
A Polars DataFrame containing the requested summary statistics.
- Return type:
DataFrame
aiqclib.interface.train module
Orchestration module for the model training and evaluation pipeline.
This module defines the primary workflow for loading datasets, performing model validation, and executing the final model construction and testing phases based on a centralized configuration.
- aiqclib.interface.train.train_and_evaluate(config)[source]
Perform a training and evaluation process based on the specified configuration.
This function orchestrates the end-to-end workflow, including data loading, model validation, and final model building and testing.
- Steps:
Load and process input training data.
Validate the model using the specified validation technique (e.g., k-fold).
Build and test the final model, saving results and trained model artifacts.
- Parameters:
config (
ConfigBase) – A training configuration object specifying classes and parameters.- Returns:
None. The function performs I/O operations and does not return a value.
- Return type:
None