aiqclib.prepare.step5_extract_features package

Submodules

aiqclib.prepare.step5_extract_features.dataset_a module

This module defines the ExtractDataSetA class, a specialized feature extraction class for Copernicus CTD data. It extends ExtractFeatureBase to implement specific data processing and feature generation steps for this dataset, integrating with the aiqclib framework’s configuration and data flow.

class aiqclib.prepare.step5_extract_features.dataset_a.ExtractDataSetA(config, input_data=None, selected_profiles=None, selected_rows=None, summary_stats=None)[source]

Bases: ExtractFeatureBase

A subclass of ExtractFeatureBase designed to extract features specifically from Copernicus CTD data.

This class sets its expected_class_name to "ExtractDataSetA", ensuring it is recognized in the YAML configuration as a valid extract class within the aiqclib framework. It inherits the full feature extraction pipeline and lifecycle management from its base class, ExtractFeatureBase.

Variables:

expected_class_name (str) – The name expected in configuration files to identify this class.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

  • selected_profiles (DataFrame | None)

  • selected_rows (Dict[str, DataFrame] | None)

  • summary_stats (DataFrame | None)

expected_class_name: str = 'ExtractDataSetA'

aiqclib.prepare.step5_extract_features.extract_base module

This module provides the ExtractFeatureBase abstract base class, designed for orchestrating feature extraction workflows using Polars. It facilitates the processing of dataset targets, dynamic loading of feature extraction logic, and the persistence of generated features to disk.

class aiqclib.prepare.step5_extract_features.extract_base.ExtractFeatureBase(config, input_data=None, selected_profiles=None, selected_rows=None, summary_stats=None)[source]

Bases: DataSetBase

Abstract base class for extracting features from dataset rows.

This class provides the core framework for managing data, applying feature extraction logic, and saving the results. It inherits from DataSetBase to ensure configuration consistency and utilizes a feature loader to dynamically compose feature extraction steps. The extracted features, once generated, can be written to Parquet files.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

  • selected_profiles (DataFrame | None)

  • selected_rows (Dict[str, DataFrame] | None)

  • summary_stats (DataFrame | None)

apply_normalization()[source]

Resolve and inject data-derived normalization statistics.

When at least one feature uses auto_min_max or standard:

  • in "fit" mode (dataset preparation) the statistics are derived from summary_stats and written to the normalization file;

  • in "apply" mode (classification) the statistics are read back from the normalization file produced during preparation.

In both cases the resolved statistics are injected into the feature parameters so the feature classes can scale their columns. Features that only use raw or manual min_max are unaffected (and, if no feature uses a data-derived type, this method does nothing).

Returns:

None

Return type:

None

default_file_name: str

The default pattern to use when writing feature files for each target.

drop_col_names: list[str]

Column names used for intermediate processing (e.g., to maintain matching references between positive and negative rows). These columns will be dropped from the final feature set.

extract_features(target_name, feature_info)[source]

Use a feature loader to retrieve and run a feature extraction process.

This method dynamically loads a feature extraction class based on the provided feature_info, passes the relevant data, and then executes the scaling and extraction steps defined within that class.

Parameters:
  • target_name (str) – The target for which features will be extracted.

  • feature_info (Dict) – A dictionary of feature extraction parameters.

Returns:

A DataFrame containing newly extracted or transformed features.

Return type:

DataFrame

extract_target_features(target_name)[source]

Build the features for a specified target.

This method retrieves the relevant rows for the given target, extracts features using the configured feature information, and then joins them with essential metadata columns. Finally, it drops any specified temporary columns.

Parameters:

target_name (str) – The key identifying which target to process.

Return type:

None

feature_info: Dict

A dictionary specifying feature extraction parameters from the config.

filtered_input: DataFrame | None
input_data: DataFrame | None
normalization_role: str = 'fit'

Determines how data-derived normalization (auto_min_max / standard) is handled. "fit" (the preparation default) derives the normalization values from the dataset’s summary statistics and writes them to the normalization file. "apply" (used at classification time) loads those previously-fitted values from the file instead. Subclasses override this.

output_file_names: Dict[str, str]

A dictionary mapping target names to corresponding output Parquet file paths.

process_targets()[source]

Generate features for all targets found in the configuration.

Data-derived normalization is resolved first (see apply_normalization()), then features are generated for each target name returned by get_target_names().

Return type:

None

selected_profiles: DataFrame | None
selected_rows: Dict[str, DataFrame] | None

A dict of Polars DataFrames, one per target, indicating rows to be used.

summary_stats: DataFrame | None

A Polars DataFrame presenting summary stats for optional use in scaling features.

target_features: Dict[str, DataFrame]

A dictionary mapping target names to DataFrames of extracted features.

write_target_features()[source]

Write the extracted features to their respective files.

Iterates through the target_features dictionary and writes each Polars DataFrame to a Parquet file, creating necessary directories.

Raises:

ValueError – If target_features is empty.

Return type:

None