aiqclib.common.utils package

Submodules

aiqclib.common.utils.config module

A set of utilities for handling YAML configuration files.

This module provides utility functions for locating, reading, and parsing configuration files, typically in YAML format. It facilitates easy retrieval of specific items within the parsed configuration data.

aiqclib.common.utils.config.get_config_file(config_file)[source]

Determine the absolute path for a configuration file.

If the provided path does not exist, a FileNotFoundError is raised. If config_file is None, a ValueError is raised.

Parameters:

config_file (Optional[str]) – The path to the configuration file, or None.

Raises:
  • ValueError – If config_file is None.

  • FileNotFoundError – If the path specified by config_file does not exist.

Returns:

The resolved absolute path to the configuration file.

Return type:

str

aiqclib.common.utils.config.get_config_item(config, section, name)[source]

Retrieve a specific item from a section of a configuration dictionary.

This function iterates through a list of items within a specified section of the configuration, looking for an item where the "name" key matches the given name.

Parameters:
  • config (Dict[str, Any]) – The configuration dictionary, e.g., from read_config().

  • section (str) – The top-level key in config that contains a list of items.

  • name (str) – The value of the β€œname” key to match within the item.

Raises:
  • KeyError – If the section does not exist in the config dictionary.

  • TypeError – If the value at config[section] is not iterable.

  • ValueError – If no item with the specified name is found in the section.

Returns:

The dictionary of the matching configuration item.

Return type:

Dict[str, Any]

aiqclib.common.utils.config.read_config(config_file)[source]

Read and parse a YAML configuration file.

This function uses the provided config_file path to locate, read, and parse a YAML file into a Python dictionary.

Parameters:

config_file (Optional[str]) – Full path to the config file, or None to indicate no specific file was provided.

Raises:
  • ValueError – If config_file is None (propagated from get_config_file()).

  • FileNotFoundError – If no file is found at the resolved path (propagated from get_config_file()).

  • yaml.YAMLError – If the configuration file is not valid YAML.

Returns:

A dictionary representing the parsed YAML configuration.

Return type:

Dict[str, Any]

aiqclib.common.utils.file module

This module provides utility functions for reading various file formats into Polars DataFrames.

It supports common data formats like Parquet, TSV (tab-separated values), and CSV (comma-separated values), including their gzipped versions, and allows for automatic file type inference based on file extensions.

aiqclib.common.utils.file.read_input_file(input_file, file_type=None, options=None)[source]

Read an input file into a Polars DataFrame, supporting formats such as Parquet, TSV (optionally gzipped), and CSV (optionally gzipped).

Parameters:
  • input_file (str) – The full path to the file to be read.

  • file_type (Optional[str]) –

    The file format. Must be one of: - β€œparquet” - β€œtsv” - β€œtsv.gz” - β€œcsv” - β€œcsv.gz”

    If set to None or an empty string, the file type is inferred from the file extension. Defaults to None.

  • options (Optional[Dict[str, Any]]) – A dictionary of additional keyword arguments to pass to the Polars reading function (e.g., β€œhas_header”, β€œinfer_schema_length”). Defaults to None.

Raises:
  • FileNotFoundError – If the specified input_file does not exist.

  • ValueError – If the file type cannot be inferred or is not supported.

Returns:

A Polars DataFrame containing the contents of the file.

Return type:

DataFrame

Example Usage:
>>> import polars as pl
>>> # Assuming 'data.parquet' and 'data.tsv.gz' exist for demonstration
>>> # df = read_input_file("data.parquet")
>>> # df2 = read_input_file("data.tsv.gz", file_type="tsv.gz", options={"has_header": True})

aiqclib.common.utils.input_preprocess module

Automatic creation of the profile_no and observation_no identifier columns.

Some raw inputs do not carry the sequential identifiers aiqclib needs. When enabled in the configuration, this module derives them from other columns, following the documented preprocessing recipe:

  1. sort the rows so observations of one profile are grouped and ordered (by pressure);

  2. build a temporary profile_key from the columns that together identify a profile (by default platform_code, profile_timestamp, longitude and latitude);

  3. profile_no is the dense rank of that key within each platform_code;

  4. observation_no is the 1-indexed running count within each key;

  5. the temporary key is dropped.

The set of columns to create, the key columns and the sort columns are all configurable, so the inference can be tuned to a dataset (or disabled).

Warning

The profile key must genuinely identify a profile. Slightly jittered coordinates would split one profile into several; identical timestamps at the same coordinates would merge distinct profiles. Choose the key columns accordingly.

aiqclib.common.utils.input_preprocess.DEFAULT_CREATED_COLUMNS: List[str] = ['profile_no', 'observation_no']

Identifier columns created by default.

aiqclib.common.utils.input_preprocess.DEFAULT_KEY_COLUMNS: List[str] = ['platform_code', 'profile_timestamp', 'longitude', 'latitude']

Columns that, combined, identify a single profile.

aiqclib.common.utils.input_preprocess.DEFAULT_SORT_COLUMNS: List[str] = ['platform_code', 'profile_timestamp', 'longitude', 'latitude', 'pres']

Columns to sort by before numbering (the trailing pres orders observations within a profile).

aiqclib.common.utils.input_preprocess.PLATFORM_COLUMN: str = 'platform_code'

Column over which profile_no is ranked.

aiqclib.common.utils.input_preprocess.create_identifier_columns(df, key_columns=None, sort_columns=None, columns=None, platform_column='platform_code')[source]

Create profile_no and/or observation_no from other columns.

Parameters:
  • df (DataFrame) – The input data (typically right after column renaming).

  • key_columns (Optional[List[str]]) – Columns whose combination identifies a profile. Defaults to DEFAULT_KEY_COLUMNS.

  • sort_columns (Optional[List[str]]) – Columns to sort by before numbering. Defaults to DEFAULT_SORT_COLUMNS.

  • columns (Optional[List[str]]) – Which identifier columns to create; any subset of ["profile_no", "observation_no"]. Defaults to both. Listed columns are (re)generated, overwriting any existing column of the same name.

  • platform_column (str) – Column over which profile_no is ranked.

Raises:

ValueError – If a required source column is missing.

Returns:

The DataFrame with the requested identifier columns added.

Return type:

DataFrame

aiqclib.common.utils.input_validation module

Validation and automatic type correction for mandatory input columns.

aiqclib requires every input dataset to provide a small set of identity and coordinate columns with specific data types. This module centralises:

  • REQUIRED_INPUT_COLUMNS, the editable table of mandatory columns and their expected logical types; and

  • validate_and_convert_input_columns(), which checks that those columns are present and, where a column has the wrong type, attempts to convert it.

The validation is intended to run immediately after column renaming, so it sees the final column names. Automatic conversion is especially useful for TSV/CSV inputs, where numeric and datetime columns are frequently read as strings. As noted below, datetime conversion can only be done automatically for genuine date/datetime values (or string representations of them); numeric epoch encodings are ambiguous and must be converted up front (see the data-preprocessing guide).

aiqclib.common.utils.input_validation.required_column_names()[source]

Return the list of mandatory input column names.

Returns:

Names from REQUIRED_INPUT_COLUMNS, in definition order.

Return type:

List[str]

aiqclib.common.utils.input_validation.validate_and_convert_input_columns(df, required_columns=None)[source]

Validate mandatory input columns and convert mismatched types in place.

For each entry in required_columns this checks that the column exists and that its dtype matches the expected category. Columns with the wrong type are converted where possible (e.g. numeric strings from CSV/TSV become floats/integers, and date/datetime strings become datetimes).

Parameters:
  • df (DataFrame) – The input data, typically immediately after column renaming.

  • required_columns (Optional[Dict[str, str]]) – The mandatory-column table to validate against. Defaults to REQUIRED_INPUT_COLUMNS.

Raises:

ValueError – If any required column is missing, or if a column’s type cannot be converted to the expected type.

Returns:

The validated DataFrame, with any necessary conversions applied.

Return type:

DataFrame

aiqclib.common.utils.metric_plots module

This module provides functions for generating and saving performance metric plots, specifically Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves. It supports plotting for individual models across multiple cross-validation folds (with mean and standard deviation) or comparing multiple models/methods on a single plot. Plots are saved as SVG files.

aiqclib.common.utils.metric_plots.create_metric_plots(model)[source]

Create and save ROC and Precision-Recall plots as an SVG file for a single model.

Generates a figure with two subplots (ROC on left, PR on right) based on the data in model.model_scores. If the model-scores table contains multiple unique β€˜k’ values (folds), it plots individual fold curves and then the mean curve with a shaded confidence band (standard deviation).

The output file path is determined by model.output_file_names['metric_plot'].

Parameters:

model (object) –

An object containing evaluation results and output configuration. It is expected to have the following attributes:

  • model_scores (dict[str, polars.DataFrame]): A dictionary where keys are target names and values are Polars DataFrames. Each DataFrame must contain at least β€˜k’ (fold identifier), β€˜label’ (true binary labels), and β€˜score’ (prediction probabilities/scores) columns.

  • output_file_names (dict[str, dict[str, str]]): A dictionary containing output file paths. Specifically, output_file_names['metric_plot'][target_name] should provide the full path where the plot for a given target will be saved.

Raises:

ValueError – If model.model_scores is empty.

Returns:

None

Return type:

None

aiqclib.common.utils.metric_plots.create_multi_method_metric_plots(model)[source]

Create and save ROC and Precision-Recall plots for multiple methods overlaid on the same figure. Assumes the model-scores tables have a β€˜method’ column.

The output file path is determined by model.output_file_names['metric_plot'].

Parameters:

model (object) –

An object containing evaluation results and output configuration. It is expected to have the following attributes:

  • model_scores (dict[str, polars.DataFrame]): A dictionary where keys are target names and values are Polars DataFrames. Each DataFrame must contain at least β€˜method’ (method identifier), β€˜label’ (true binary labels), and β€˜score’ (prediction probabilities/scores) columns. It aggregates results across all folds/runs for each method.

  • output_file_names (dict[str, dict[str, str]]): A dictionary containing output file paths. Specifically, output_file_names['metric_plot'][target_name] should provide the full path where the plot for a given target will be saved.

Raises:

ValueError – If model.model_scores is empty.

Returns:

None

Return type:

None

Code Issue

The calculation of Average Precision (AP) for the Precision-Recall curve using pr_auc = auc(rec[::-1], prec[::-1]) is incorrect. The sklearn.metrics.precision_recall_curve function returns recall values that are already in increasing order. Therefore, auc(rec, prec) should be used directly to calculate the Area Under the Curve for the Precision-Recall plot. Reversing rec and prec before passing them to auc when rec is already increasing will lead to an incorrect AP value.

aiqclib.common.utils.normalization module

Normalization utilities.

This module centralises the logic shared by every feature class and the feature extraction step when applying normalization. It supports four normalization β€œtypes”, each selected per-feature via stats_set.type in the feature_param_sets section of a configuration file:

  • raw: no normalization (the default).

  • min_max: min-max scaling using values supplied by hand in the feature_stats_sets section of the config. This is the historical behaviour and is kept unchanged.

  • auto_min_max: min-max scaling using min/max values derived automatically from the dataset’s summary statistics.

  • standard: standard scaling (x - mean) / sd using mean/sd values derived automatically from the dataset’s summary statistics.

For auto_min_max and standard the derived values are written to a YAML normalization file during dataset preparation and re-loaded during classification, so the same fitted normalization is applied at classification time without re-entering any values (and without access to the original training data).

The helpers here are deliberately small and pure so they can be unit-tested with synthetic Polars frames, independently of the wider pipeline.

aiqclib.common.utils.normalization.AUTO_SCALING_TYPES = ('auto_min_max', 'standard')

Normalization types whose values are derived from data (and therefore must be persisted to a normalization file for reuse at classification time), as opposed to min_max whose values are supplied directly in the config.

aiqclib.common.utils.normalization.SCALING_TYPES = ('min_max', 'auto_min_max', 'standard')

Normalization types that actually transform feature values. raw is intentionally excluded because it is a no-op.

aiqclib.common.utils.normalization.aggregate_profile_stats(summary_stats, variables=None, exclude=['longitude', 'latitude'])[source]

Aggregate per-profile summary statistics across profiles.

This reshapes the long per-profile rows of a summary_stats table (i.e. the rows whose platform_code is not "all") into one row per (variable, stats) pair, computing the distribution of each per-profile statistic across profiles: its min, mean, pct97.5, max and sd.

The across-profile sd is the only addition relative to the historical SummaryStatsBase.create_summary_stats_profile output; it is required to standard-scale profile_summary_stats features (whose columns are themselves per-profile statistics).

Parameters:
  • summary_stats (DataFrame) – The combined summary statistics table produced by SummaryStatsBase.calculate_stats().

  • variables (Optional[List[str]]) – Optional list of variables to keep. None keeps all.

  • exclude (List[str]) – Variables to drop before aggregating (location variables have no meaningful per-profile spread).

Returns:

A long-form frame with columns variable, stats, min, mean, pct97.5, max and sd.

Return type:

DataFrame

aiqclib.common.utils.normalization.build_scaling_expr(col_name, params, stats_type)[source]

Build a Polars expression that normalizes a single column.

The formula depends on stats_type:

  • min_max / auto_min_max: (x - min) / (max - min)

  • standard: (x - mean) / sd

A zero denominator (a constant column, e.g. a per-profile location whose standard deviation is zero) is handled gracefully by only subtracting the centre, which yields 0 for the constant value rather than inf/nan.

Parameters:
  • col_name (str) – Name of the column to scale (the output keeps the name).

  • params (Dict) – The statistics for this column. For min-max types this is {"min": ..., "max": ...}; for standard it is {"mean": ..., "sd": ...}.

  • stats_type (str) – One of SCALING_TYPES.

Returns:

A Polars expression aliased back to col_name.

Return type:

Expr

aiqclib.common.utils.normalization.derive_observation_stats(summary_stats, variables, stats_type)[source]

Derive flat per-variable normalization stats from the global (β€œall”) rows.

For each requested variable this reads its global summary row (platform_code == "all") and extracts either {min, max} (for auto_min_max) or {mean, sd} (for standard).

Parameters:
  • summary_stats (DataFrame) – The combined summary statistics table.

  • variables (List[str]) – The variables (column names) to derive stats for.

  • stats_type (str) – "auto_min_max" or "standard".

Returns:

{variable: {"min"/"max"} or {"mean"/"sd"}}.

Return type:

Dict[str, Dict]

aiqclib.common.utils.normalization.derive_profile_stats(profile_stats_long, variables, summary_stats_names, stats_type)[source]

Derive nested per-(variable, stat) normalization stats across profiles.

Used for profile_summary_stats features. profile_stats_long is the output of aggregate_profile_stats(); for each requested variable and each requested per-profile statistic, this extracts {min, max} (for auto_min_max) or {mean, sd} (for standard).

Parameters:
  • profile_stats_long (DataFrame) – The across-profile aggregation.

  • variables (List[str]) – The variables (e.g. ["temp", "psal", "pres"]).

  • summary_stats_names (List[str]) – The per-profile statistics that become feature columns (e.g. ["mean", "median", "sd"]).

  • stats_type (str) – "auto_min_max" or "standard".

Returns:

{variable: {stat: {"min"/"max"} or {"mean"/"sd"}}}.

Return type:

Dict[str, Dict]

aiqclib.common.utils.normalization.is_scaling_type(stats_type)[source]

Return whether a given stats_set type performs a value transformation.

Parameters:

stats_type (Optional[str]) – The normalization type (e.g. "min_max", "raw").

Returns:

True for min_max, auto_min_max and standard; False for raw, None and any unknown value.

Return type:

bool

aiqclib.common.utils.normalization.read_normalization_file(input_file)[source]

Read a normalization YAML file written by write_normalization_file().

Parameters:

input_file (str) – Path to the YAML normalization file.

Raises:

FileNotFoundError – If the file does not exist.

Returns:

A dictionary shaped like a feature_stats_set entry (i.e. with name plus auto_min_max / standard lists).

Return type:

Dict

aiqclib.common.utils.normalization.scale_flat_columns(df, stats, stats_type)[source]

Apply scaling to a frame whose stats are keyed directly by column name.

Used by features that operate on raw observed variables (e.g. basic_values, flank_up, flank_down, location), where stats looks like {"temp": {"min": ..., "max": ...}, ...}.

Columns present in stats but absent from df are skipped, so a single shared stats set can be reused across features that expose different subsets of columns.

Parameters:
  • df (DataFrame) – The frame to transform.

  • stats (Dict[str, Dict]) – Mapping of column name to its statistics.

  • stats_type (str) – One of SCALING_TYPES.

Returns:

A new frame with the relevant columns scaled.

Return type:

DataFrame

aiqclib.common.utils.normalization.scale_nested_columns(df, stats, stats_type)[source]

Apply scaling to a frame whose stats are keyed by variable then stat.

Used by profile_summary_stats, whose feature columns are named {variable}_{stat} (e.g. temp_mean) and whose stats looks like {"temp": {"mean": {"min": ..., "max": ...}, ...}, ...}.

Columns derived from the nested keys but absent from df are skipped.

Parameters:
  • df (DataFrame) – The frame to transform.

  • stats (Dict[str, Dict]) – Nested mapping {variable: {stat: stats_dict}}.

  • stats_type (str) – One of SCALING_TYPES.

Returns:

A new frame with the relevant columns scaled.

Return type:

DataFrame

aiqclib.common.utils.normalization.write_normalization_file(output_file, stats_set_name, resolved)[source]

Write derived normalization values to a YAML file.

The file mirrors the structure of a single feature_stats_sets entry so it can be loaded straight back into a configuration’s feature_stats_set and consumed by the existing stats-injection machinery. For example:

name: feature_set_1_stats_set_1
auto_min_max:
  - name: basic_values3
    stats: {temp: {min: 0.0, max: 20.0}, ...}
standard:
  - name: location
    stats: {longitude: {mean: 18.8, sd: 2.0}, ...}
Parameters:
  • output_file (str) – Destination path. Parent directories are created.

  • stats_set_name (str) – The name recorded at the top of the file.

  • resolved (Dict[str, Dict[str, Dict]]) – {stats_type: {entry_name: stats_dict}} to serialise.

Returns:

None

Return type:

None