aiqclib.common.utils packageο
Submodulesο
aiqclib.common.utils.config moduleο
A set of utilities for handling YAML configuration files.
This module provides utility functions for locating, reading, and parsing configuration files, typically in YAML format. It facilitates easy retrieval of specific items within the parsed configuration data.
- aiqclib.common.utils.config.get_config_file(config_file)[source]ο
Determine the absolute path for a configuration file.
If the provided path does not exist, a
FileNotFoundErroris raised. If config_file isNone, a ValueError is raised.- Parameters:
config_file (
Optional[str]) β The path to the configuration file, orNone.- Raises:
ValueError β If config_file is
None.FileNotFoundError β If the path specified by config_file does not exist.
- Returns:
The resolved absolute path to the configuration file.
- Return type:
str
- aiqclib.common.utils.config.get_config_item(config, section, name)[source]ο
Retrieve a specific item from a section of a configuration dictionary.
This function iterates through a list of items within a specified section of the configuration, looking for an item where the
"name"key matches the given name.- Parameters:
config (
Dict[str,Any]) β The configuration dictionary, e.g., fromread_config().section (
str) β The top-level key in config that contains a list of items.name (
str) β The value of the βnameβ key to match within the item.
- Raises:
KeyError β If the section does not exist in the config dictionary.
TypeError β If the value at config[section] is not iterable.
ValueError β If no item with the specified name is found in the section.
- Returns:
The dictionary of the matching configuration item.
- Return type:
Dict[str,Any]
- aiqclib.common.utils.config.read_config(config_file)[source]ο
Read and parse a YAML configuration file.
This function uses the provided config_file path to locate, read, and parse a YAML file into a Python dictionary.
- Parameters:
config_file (
Optional[str]) β Full path to the config file, orNoneto indicate no specific file was provided.- Raises:
ValueError β If config_file is
None(propagated fromget_config_file()).FileNotFoundError β If no file is found at the resolved path (propagated from
get_config_file()).yaml.YAMLError β If the configuration file is not valid YAML.
- Returns:
A dictionary representing the parsed YAML configuration.
- Return type:
Dict[str,Any]
aiqclib.common.utils.file moduleο
This module provides utility functions for reading various file formats into Polars DataFrames.
It supports common data formats like Parquet, TSV (tab-separated values), and CSV (comma-separated values), including their gzipped versions, and allows for automatic file type inference based on file extensions.
- aiqclib.common.utils.file.read_input_file(input_file, file_type=None, options=None)[source]ο
Read an input file into a Polars DataFrame, supporting formats such as Parquet, TSV (optionally gzipped), and CSV (optionally gzipped).
- Parameters:
input_file (
str) β The full path to the file to be read.file_type (
Optional[str]) βThe file format. Must be one of: - βparquetβ - βtsvβ - βtsv.gzβ - βcsvβ - βcsv.gzβ
If set to None or an empty string, the file type is inferred from the file extension. Defaults to None.
options (
Optional[Dict[str,Any]]) β A dictionary of additional keyword arguments to pass to the Polars reading function (e.g., βhas_headerβ, βinfer_schema_lengthβ). Defaults to None.
- Raises:
FileNotFoundError β If the specified
input_filedoes not exist.ValueError β If the file type cannot be inferred or is not supported.
- Returns:
A Polars DataFrame containing the contents of the file.
- Return type:
DataFrame
- Example Usage:
>>> import polars as pl >>> # Assuming 'data.parquet' and 'data.tsv.gz' exist for demonstration >>> # df = read_input_file("data.parquet") >>> # df2 = read_input_file("data.tsv.gz", file_type="tsv.gz", options={"has_header": True})
aiqclib.common.utils.input_preprocess moduleο
Automatic creation of the profile_no and observation_no identifier
columns.
Some raw inputs do not carry the sequential identifiers aiqclib needs.
When enabled in the configuration, this module derives them from other columns,
following the documented preprocessing recipe:
sort the rows so observations of one profile are grouped and ordered (by pressure);
build a temporary
profile_keyfrom the columns that together identify a profile (by defaultplatform_code,profile_timestamp,longitudeandlatitude);profile_nois the dense rank of that key within eachplatform_code;observation_nois the 1-indexed running count within each key;the temporary key is dropped.
The set of columns to create, the key columns and the sort columns are all configurable, so the inference can be tuned to a dataset (or disabled).
Warning
The profile key must genuinely identify a profile. Slightly jittered coordinates would split one profile into several; identical timestamps at the same coordinates would merge distinct profiles. Choose the key columns accordingly.
- aiqclib.common.utils.input_preprocess.DEFAULT_CREATED_COLUMNS: List[str] = ['profile_no', 'observation_no']ο
Identifier columns created by default.
- aiqclib.common.utils.input_preprocess.DEFAULT_KEY_COLUMNS: List[str] = ['platform_code', 'profile_timestamp', 'longitude', 'latitude']ο
Columns that, combined, identify a single profile.
- aiqclib.common.utils.input_preprocess.DEFAULT_SORT_COLUMNS: List[str] = ['platform_code', 'profile_timestamp', 'longitude', 'latitude', 'pres']ο
Columns to sort by before numbering (the trailing
presorders observations within a profile).
- aiqclib.common.utils.input_preprocess.PLATFORM_COLUMN: str = 'platform_code'ο
Column over which
profile_nois ranked.
- aiqclib.common.utils.input_preprocess.create_identifier_columns(df, key_columns=None, sort_columns=None, columns=None, platform_column='platform_code')[source]ο
Create
profile_noand/orobservation_nofrom other columns.- Parameters:
df (
DataFrame) β The input data (typically right after column renaming).key_columns (
Optional[List[str]]) β Columns whose combination identifies a profile. Defaults toDEFAULT_KEY_COLUMNS.sort_columns (
Optional[List[str]]) β Columns to sort by before numbering. Defaults toDEFAULT_SORT_COLUMNS.columns (
Optional[List[str]]) β Which identifier columns to create; any subset of["profile_no", "observation_no"]. Defaults to both. Listed columns are (re)generated, overwriting any existing column of the same name.platform_column (
str) β Column over whichprofile_nois ranked.
- Raises:
ValueError β If a required source column is missing.
- Returns:
The DataFrame with the requested identifier columns added.
- Return type:
DataFrame
aiqclib.common.utils.input_validation moduleο
Validation and automatic type correction for mandatory input columns.
aiqclib requires every input dataset to provide a small set of identity and
coordinate columns with specific data types. This module centralises:
REQUIRED_INPUT_COLUMNS, the editable table of mandatory columns and their expected logical types; andvalidate_and_convert_input_columns(), which checks that those columns are present and, where a column has the wrong type, attempts to convert it.
The validation is intended to run immediately after column renaming, so it sees the final column names. Automatic conversion is especially useful for TSV/CSV inputs, where numeric and datetime columns are frequently read as strings. As noted below, datetime conversion can only be done automatically for genuine date/datetime values (or string representations of them); numeric epoch encodings are ambiguous and must be converted up front (see the data-preprocessing guide).
- aiqclib.common.utils.input_validation.required_column_names()[source]ο
Return the list of mandatory input column names.
- Returns:
Names from
REQUIRED_INPUT_COLUMNS, in definition order.- Return type:
List[str]
- aiqclib.common.utils.input_validation.validate_and_convert_input_columns(df, required_columns=None)[source]ο
Validate mandatory input columns and convert mismatched types in place.
For each entry in
required_columnsthis checks that the column exists and that its dtype matches the expected category. Columns with the wrong type are converted where possible (e.g. numeric strings from CSV/TSV become floats/integers, and date/datetime strings become datetimes).- Parameters:
df (
DataFrame) β The input data, typically immediately after column renaming.required_columns (
Optional[Dict[str,str]]) β The mandatory-column table to validate against. Defaults toREQUIRED_INPUT_COLUMNS.
- Raises:
ValueError β If any required column is missing, or if a columnβs type cannot be converted to the expected type.
- Returns:
The validated DataFrame, with any necessary conversions applied.
- Return type:
DataFrame
aiqclib.common.utils.metric_plots moduleο
This module provides functions for generating and saving performance metric plots, specifically Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves. It supports plotting for individual models across multiple cross-validation folds (with mean and standard deviation) or comparing multiple models/methods on a single plot. Plots are saved as SVG files.
- aiqclib.common.utils.metric_plots.create_metric_plots(model)[source]ο
Create and save ROC and Precision-Recall plots as an SVG file for a single model.
Generates a figure with two subplots (ROC on left, PR on right) based on the data in
model.model_scores. If the model-scores table contains multiple unique βkβ values (folds), it plots individual fold curves and then the mean curve with a shaded confidence band (standard deviation).The output file path is determined by
model.output_file_names['metric_plot'].- Parameters:
model (object) β
An object containing evaluation results and output configuration. It is expected to have the following attributes:
model_scores(dict[str, polars.DataFrame]): A dictionary where keys are target names and values are Polars DataFrames. Each DataFrame must contain at least βkβ (fold identifier), βlabelβ (true binary labels), and βscoreβ (prediction probabilities/scores) columns.output_file_names(dict[str, dict[str, str]]): A dictionary containing output file paths. Specifically,output_file_names['metric_plot'][target_name]should provide the full path where the plot for a given target will be saved.
- Raises:
ValueError β If
model.model_scoresis empty.- Returns:
None
- Return type:
None
- aiqclib.common.utils.metric_plots.create_multi_method_metric_plots(model)[source]ο
Create and save ROC and Precision-Recall plots for multiple methods overlaid on the same figure. Assumes the model-scores tables have a βmethodβ column.
The output file path is determined by
model.output_file_names['metric_plot'].- Parameters:
model (object) β
An object containing evaluation results and output configuration. It is expected to have the following attributes:
model_scores(dict[str, polars.DataFrame]): A dictionary where keys are target names and values are Polars DataFrames. Each DataFrame must contain at least βmethodβ (method identifier), βlabelβ (true binary labels), and βscoreβ (prediction probabilities/scores) columns. It aggregates results across all folds/runs for each method.output_file_names(dict[str, dict[str, str]]): A dictionary containing output file paths. Specifically,output_file_names['metric_plot'][target_name]should provide the full path where the plot for a given target will be saved.
- Raises:
ValueError β If
model.model_scoresis empty.- Returns:
None
- Return type:
None
Code Issue
The calculation of Average Precision (AP) for the Precision-Recall curve using
pr_auc = auc(rec[::-1], prec[::-1])is incorrect. Thesklearn.metrics.precision_recall_curvefunction returns recall values that are already in increasing order. Therefore, auc(rec, prec) should be used directly to calculate the Area Under the Curve for the Precision-Recall plot. Reversing rec and prec before passing them to auc when rec is already increasing will lead to an incorrect AP value.
aiqclib.common.utils.normalization moduleο
Normalization utilities.
This module centralises the logic shared by every feature class and the feature
extraction step when applying normalization. It supports four normalization
βtypesβ, each selected per-feature via stats_set.type in the
feature_param_sets section of a configuration file:
raw: no normalization (the default).min_max: min-max scaling using values supplied by hand in thefeature_stats_setssection of the config. This is the historical behaviour and is kept unchanged.auto_min_max: min-max scaling using min/max values derived automatically from the datasetβs summary statistics.standard: standard scaling(x - mean) / sdusing mean/sd values derived automatically from the datasetβs summary statistics.
For auto_min_max and standard the derived values are written to a YAML
normalization file during dataset preparation and re-loaded during
classification, so the same fitted normalization is applied at classification
time without re-entering any values (and without access to the original
training data).
The helpers here are deliberately small and pure so they can be unit-tested with synthetic Polars frames, independently of the wider pipeline.
- aiqclib.common.utils.normalization.AUTO_SCALING_TYPES = ('auto_min_max', 'standard')ο
Normalization types whose values are derived from data (and therefore must be persisted to a normalization file for reuse at classification time), as opposed to
min_maxwhose values are supplied directly in the config.
- aiqclib.common.utils.normalization.SCALING_TYPES = ('min_max', 'auto_min_max', 'standard')ο
Normalization types that actually transform feature values.
rawis intentionally excluded because it is a no-op.
- aiqclib.common.utils.normalization.aggregate_profile_stats(summary_stats, variables=None, exclude=['longitude', 'latitude'])[source]ο
Aggregate per-profile summary statistics across profiles.
This reshapes the long per-profile rows of a
summary_statstable (i.e. the rows whoseplatform_codeis not"all") into one row per(variable, stats)pair, computing the distribution of each per-profile statistic across profiles: itsmin,mean,pct97.5,maxandsd.The across-profile
sdis the only addition relative to the historicalSummaryStatsBase.create_summary_stats_profileoutput; it is required to standard-scaleprofile_summary_statsfeatures (whose columns are themselves per-profile statistics).- Parameters:
summary_stats (
DataFrame) β The combined summary statistics table produced bySummaryStatsBase.calculate_stats().variables (
Optional[List[str]]) β Optional list of variables to keep.Nonekeeps all.exclude (
List[str]) β Variables to drop before aggregating (location variables have no meaningful per-profile spread).
- Returns:
A long-form frame with columns
variable,stats,min,mean,pct97.5,maxandsd.- Return type:
DataFrame
- aiqclib.common.utils.normalization.build_scaling_expr(col_name, params, stats_type)[source]ο
Build a Polars expression that normalizes a single column.
The formula depends on
stats_type:min_max/auto_min_max:(x - min) / (max - min)standard:(x - mean) / sd
A zero denominator (a constant column, e.g. a per-profile location whose standard deviation is zero) is handled gracefully by only subtracting the centre, which yields
0for the constant value rather thaninf/nan.- Parameters:
col_name (
str) β Name of the column to scale (the output keeps the name).params (
Dict) β The statistics for this column. For min-max types this is{"min": ..., "max": ...}; forstandardit is{"mean": ..., "sd": ...}.stats_type (
str) β One ofSCALING_TYPES.
- Returns:
A Polars expression aliased back to
col_name.- Return type:
Expr
- aiqclib.common.utils.normalization.derive_observation_stats(summary_stats, variables, stats_type)[source]ο
Derive flat per-variable normalization stats from the global (βallβ) rows.
For each requested variable this reads its global summary row (
platform_code == "all") and extracts either{min, max}(forauto_min_max) or{mean, sd}(forstandard).- Parameters:
summary_stats (
DataFrame) β The combined summary statistics table.variables (
List[str]) β The variables (column names) to derive stats for.stats_type (
str) β"auto_min_max"or"standard".
- Returns:
{variable: {"min"/"max"} or {"mean"/"sd"}}.- Return type:
Dict[str,Dict]
- aiqclib.common.utils.normalization.derive_profile_stats(profile_stats_long, variables, summary_stats_names, stats_type)[source]ο
Derive nested per-(variable, stat) normalization stats across profiles.
Used for
profile_summary_statsfeatures.profile_stats_longis the output ofaggregate_profile_stats(); for each requested variable and each requested per-profile statistic, this extracts{min, max}(forauto_min_max) or{mean, sd}(forstandard).- Parameters:
profile_stats_long (
DataFrame) β The across-profile aggregation.variables (
List[str]) β The variables (e.g.["temp", "psal", "pres"]).summary_stats_names (
List[str]) β The per-profile statistics that become feature columns (e.g.["mean", "median", "sd"]).stats_type (
str) β"auto_min_max"or"standard".
- Returns:
{variable: {stat: {"min"/"max"} or {"mean"/"sd"}}}.- Return type:
Dict[str,Dict]
- aiqclib.common.utils.normalization.is_scaling_type(stats_type)[source]ο
Return whether a given
stats_settype performs a value transformation.- Parameters:
stats_type (
Optional[str]) β The normalization type (e.g."min_max","raw").- Returns:
Trueformin_max,auto_min_maxandstandard;Falseforraw,Noneand any unknown value.- Return type:
bool
- aiqclib.common.utils.normalization.read_normalization_file(input_file)[source]ο
Read a normalization YAML file written by
write_normalization_file().- Parameters:
input_file (
str) β Path to the YAML normalization file.- Raises:
FileNotFoundError β If the file does not exist.
- Returns:
A dictionary shaped like a
feature_stats_setentry (i.e. withnameplusauto_min_max/standardlists).- Return type:
Dict
- aiqclib.common.utils.normalization.scale_flat_columns(df, stats, stats_type)[source]ο
Apply scaling to a frame whose stats are keyed directly by column name.
Used by features that operate on raw observed variables (e.g.
basic_values,flank_up,flank_down,location), wherestatslooks like{"temp": {"min": ..., "max": ...}, ...}.Columns present in
statsbut absent fromdfare skipped, so a single shared stats set can be reused across features that expose different subsets of columns.- Parameters:
df (
DataFrame) β The frame to transform.stats (
Dict[str,Dict]) β Mapping of column name to its statistics.stats_type (
str) β One ofSCALING_TYPES.
- Returns:
A new frame with the relevant columns scaled.
- Return type:
DataFrame
- aiqclib.common.utils.normalization.scale_nested_columns(df, stats, stats_type)[source]ο
Apply scaling to a frame whose stats are keyed by
variablethenstat.Used by
profile_summary_stats, whose feature columns are named{variable}_{stat}(e.g.temp_mean) and whosestatslooks like{"temp": {"mean": {"min": ..., "max": ...}, ...}, ...}.Columns derived from the nested keys but absent from
dfare skipped.- Parameters:
df (
DataFrame) β The frame to transform.stats (
Dict[str,Dict]) β Nested mapping{variable: {stat: stats_dict}}.stats_type (
str) β One ofSCALING_TYPES.
- Returns:
A new frame with the relevant columns scaled.
- Return type:
DataFrame
- aiqclib.common.utils.normalization.write_normalization_file(output_file, stats_set_name, resolved)[source]ο
Write derived normalization values to a YAML file.
The file mirrors the structure of a single
feature_stats_setsentry so it can be loaded straight back into a configurationβsfeature_stats_setand consumed by the existing stats-injection machinery. For example:name: feature_set_1_stats_set_1 auto_min_max: - name: basic_values3 stats: {temp: {min: 0.0, max: 20.0}, ...} standard: - name: location stats: {longitude: {mean: 18.8, sd: 2.0}, ...}
- Parameters:
output_file (
str) β Destination path. Parent directories are created.stats_set_name (
str) β Thenamerecorded at the top of the file.resolved (
Dict[str,Dict[str,Dict]]) β{stats_type: {entry_name: stats_dict}}to serialise.
- Returns:
None
- Return type:
None