aiqclib.prepare.step3_select_profiles package

Submodules

aiqclib.prepare.step3_select_profiles.dataset_a module

This module defines the SelectDataSetA class for oceanographic profile selection.

The module provides functionality to categorize profiles as “positive” (containing errors) or “negative” (clean) based on Quality Control (QC) flags and pairs them based on temporal proximity for machine learning dataset construction.

class aiqclib.prepare.step3_select_profiles.dataset_a.SelectDataSetA(config, input_data=None)[source]

Bases: ProfileSelectionBase

Selects positive/negative profiles from Copernicus CTD data.

This class implements a strategy for labeling oceanographic profiles as “positive” (bad) or “negative” (good) based on their quality control (QC) flags. The main steps are:

  1. Select Positive Profiles: Identify profiles with at least one “bad” QC flag (e.g., a value of 4) in key sensor measurements.

  2. Select Negative Profiles: Identify profiles where all measurements for all key sensors are “good” (e.g., a QC flag of 1).

  3. Find Profile Pairs: For each positive profile, find the temporally closest negative profile to create a balanced and relevant dataset.

  4. Combine Data: Merge the labeled positive and negative profiles into a single DataFrame.

Variables:
  • expected_class_name (str) – The expected name of the class, used for configuration validation.

  • pos_profile_df (Optional[polars.DataFrame]) – DataFrame containing positively-labeled profiles.

  • neg_profile_df (Optional[polars.DataFrame]) – DataFrame containing negatively-labeled profiles.

  • key_col_names (List[str]) – Column names used as unique identifiers for profiles.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

expected_class_name: str = 'SelectDataSetA'
find_profile_pairs()[source]

Pair positive profiles with their temporally closest negative profile.

This method reduces the set of negative profiles to only those that are the nearest in time to a positive profile. This helps create a more balanced and comparable dataset for training or analysis.

This method updates pos_profile_df by adding label and neg_profile_id columns. It also updates neg_profile_df by filtering it to the matched profiles and adding corresponding labels.

Returns:

None

Return type:

None

label_profiles()[source]

Execute the full profile selection and labeling workflow.

This method orchestrates the process by calling, in order:

  1. select_positive_profiles()

  2. select_negative_profiles()

  3. find_profile_pairs()

The final combined DataFrame of labeled profiles is stored in the selected_profiles attribute of the base class.

Returns:

None

Return type:

None

select_negative_profiles()[source]

Select profiles with consistently “good” QC flags.

A profile is considered “negative” (i.e., contains only good data) if, for every monitored parameter (e.g., temperature, salinity), none of its measurements have a “bad” flag and at least one has a “good” flag. The resulting unique profiles are stored in the neg_profile_df attribute.

Returns:

None

Return type:

None

select_positive_profiles()[source]

Select profiles with “bad” QC flags.

A profile is considered “positive” (i.e., contains errors) if any of its measurements have a QC flag defined as a positive flag in the configuration (e.g., a flag of 4). The resulting unique profiles are stored in the pos_profile_df attribute.

Returns:

None

Return type:

None

aiqclib.prepare.step3_select_profiles.dataset_all module

Module for selecting and labeling oceanographic profiles based on QC flags.

This module defines the SelectDataSetAll class, which identifies “bad” (positive) and “good” (negative) profiles based on Quality Control (QC) criteria and prepares a labeled dataset for machine learning applications.

class aiqclib.prepare.step3_select_profiles.dataset_all.SelectDataSetAll(config, input_data=None)[source]

Bases: ProfileSelectionBase

Selects positive/negative profiles from Copernicus CTD data.

This class implements a strategy for labeling oceanographic profiles as “positive” (bad) or “negative” (good) based on their quality control (QC) flags.

Variables:
  • expected_class_name (str) – The expected name of the class for config validation.

  • pos_profile_df (Optional[polars.DataFrame]) – DataFrame containing positively-labeled profiles.

  • neg_profile_df (Optional[polars.DataFrame]) – DataFrame containing negatively-labeled profiles.

  • key_col_names (List[str]) – Column names used as unique identifiers for profiles.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

expected_class_name: str = 'SelectDataSetAll'
label_profiles()[source]

Execute the full profile selection and labeling workflow.

Orchestrates the identification of positive and negative profiles and vstacks them into the selected_profiles attribute.

Return type:

None

select_negative_profiles()[source]

Select profiles with consistently “good” QC flags.

A profile is considered “negative” if no measurements have a “bad” flag and at least one measurement has a “good” flag for all monitored parameters. Results are stored in neg_profile_df.

Return type:

None

select_positive_profiles()[source]

Select profiles with “bad” QC flags.

A profile is considered “positive” if any of its measurements have a QC flag defined as a positive flag in the configuration. Results are stored in pos_profile_df.

Return type:

None

aiqclib.prepare.step3_select_profiles.select_base module

Module for profile selection and group labeling base class.

This module defines the abstract base class ProfileSelectionBase, which serves as a template for selecting and labeling profiles within the aiqclib framework. It integrates with configuration and dataset base classes to standardize the profile selection step and output.

class aiqclib.prepare.step3_select_profiles.select_base.ProfileSelectionBase(config, input_data=None)[source]

Bases: DataSetBase

Abstract base class for profile selection and group labeling.

Inherits from aiqclib.common.base.dataset_base.DataSetBase to leverage configuration handling and validation. Subclasses must implement the concrete logic for profile labeling.

Subclasses must define:

  • expected_class_name (a class attribute) if they are intended to be instantiated (otherwise an error is raised by the base class).

  • A custom label_profiles() method that implements the specific profile selection and labeling logic.

Variables:
  • default_file_name (str) – The default file name for selected profiles.

  • output_file_name (str) – The full path and name of the output Parquet file where selected profiles will be written.

  • input_data (Optional[polars.DataFrame]) – An optional Polars DataFrame used as initial data for profile selection.

  • selected_profiles (Optional[polars.DataFrame]) – A Polars DataFrame containing the profiles after selection and labeling, typically including a “group_label” column.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

default_file_name: str
input_data: DataFrame | None
abstractmethod label_profiles()[source]

Abstract method to be implemented by subclasses for labeling profiles to identify positive and negative groups.

Implementations of this method should perform the core logic for profile selection and labeling, assigning the resulting DataFrame (which should typically include a ‘group_label’ column) to the selected_profiles instance variable.

Return type:

None

output_file_name: str
selected_profiles: DataFrame | None
write_selected_profiles()[source]

Write the selected profiles to a Parquet file.

The output file path is determined by output_file_name. This method also ensures that the target directory exists before writing the file.

Raises:

ValueError – If the selected_profiles instance variable is None, indicating that no profiles have been selected or labeled before attempting to write.

Return type:

None