aiqclib.prepare.step3_select_profiles package
Submodules
aiqclib.prepare.step3_select_profiles.dataset_a module
This module defines the SelectDataSetA class for oceanographic profile selection.
The module provides functionality to categorize profiles as “positive” (containing errors) or “negative” (clean) based on Quality Control (QC) flags and pairs them based on temporal proximity for machine learning dataset construction.
- class aiqclib.prepare.step3_select_profiles.dataset_a.SelectDataSetA(config, input_data=None)[source]
Bases:
ProfileSelectionBaseSelects positive/negative profiles from Copernicus CTD data.
This class implements a strategy for labeling oceanographic profiles as “positive” (bad) or “negative” (good) based on their quality control (QC) flags. The main steps are:
Select Positive Profiles: Identify profiles with at least one “bad” QC flag (e.g., a value of 4) in key sensor measurements.
Select Negative Profiles: Identify profiles where all measurements for all key sensors are “good” (e.g., a QC flag of 1).
Find Profile Pairs: For each positive profile, find the temporally closest negative profile to create a balanced and relevant dataset.
Combine Data: Merge the labeled positive and negative profiles into a single DataFrame.
- Variables:
expected_class_name (str) – The expected name of the class, used for configuration validation.
pos_profile_df (Optional[polars.DataFrame]) – DataFrame containing positively-labeled profiles.
neg_profile_df (Optional[polars.DataFrame]) – DataFrame containing negatively-labeled profiles.
key_col_names (List[str]) – Column names used as unique identifiers for profiles.
- Parameters:
config (ConfigBase)
input_data (DataFrame | None)
- expected_class_name: str = 'SelectDataSetA'
- find_profile_pairs()[source]
Pair positive profiles with their temporally closest negative profile.
This method reduces the set of negative profiles to only those that are the nearest in time to a positive profile. This helps create a more balanced and comparable dataset for training or analysis.
This method updates
pos_profile_dfby addinglabelandneg_profile_idcolumns. It also updatesneg_profile_dfby filtering it to the matched profiles and adding corresponding labels.- Returns:
None
- Return type:
None
- label_profiles()[source]
Execute the full profile selection and labeling workflow.
This method orchestrates the process by calling, in order:
The final combined DataFrame of labeled profiles is stored in the
selected_profilesattribute of the base class.- Returns:
None
- Return type:
None
- select_negative_profiles()[source]
Select profiles with consistently “good” QC flags.
A profile is considered “negative” (i.e., contains only good data) if, for every monitored parameter (e.g., temperature, salinity), none of its measurements have a “bad” flag and at least one has a “good” flag. The resulting unique profiles are stored in the
neg_profile_dfattribute.- Returns:
None
- Return type:
None
- select_positive_profiles()[source]
Select profiles with “bad” QC flags.
A profile is considered “positive” (i.e., contains errors) if any of its measurements have a QC flag defined as a positive flag in the configuration (e.g., a flag of 4). The resulting unique profiles are stored in the
pos_profile_dfattribute.- Returns:
None
- Return type:
None
aiqclib.prepare.step3_select_profiles.dataset_all module
Module for selecting and labeling oceanographic profiles based on QC flags.
This module defines the SelectDataSetAll class, which identifies
“bad” (positive) and “good” (negative) profiles based on Quality Control (QC)
criteria and prepares a labeled dataset for machine learning applications.
- class aiqclib.prepare.step3_select_profiles.dataset_all.SelectDataSetAll(config, input_data=None)[source]
Bases:
ProfileSelectionBaseSelects positive/negative profiles from Copernicus CTD data.
This class implements a strategy for labeling oceanographic profiles as “positive” (bad) or “negative” (good) based on their quality control (QC) flags.
- Variables:
expected_class_name (str) – The expected name of the class for config validation.
pos_profile_df (Optional[polars.DataFrame]) – DataFrame containing positively-labeled profiles.
neg_profile_df (Optional[polars.DataFrame]) – DataFrame containing negatively-labeled profiles.
key_col_names (List[str]) – Column names used as unique identifiers for profiles.
- Parameters:
config (ConfigBase)
input_data (DataFrame | None)
- expected_class_name: str = 'SelectDataSetAll'
- label_profiles()[source]
Execute the full profile selection and labeling workflow.
Orchestrates the identification of positive and negative profiles and vstacks them into the
selected_profilesattribute.- Return type:
None
aiqclib.prepare.step3_select_profiles.select_base module
Module for profile selection and group labeling base class.
This module defines the abstract base class ProfileSelectionBase, which
serves as a template for selecting and labeling profiles within the aiqclib
framework. It integrates with configuration and dataset base classes to
standardize the profile selection step and output.
- class aiqclib.prepare.step3_select_profiles.select_base.ProfileSelectionBase(config, input_data=None)[source]
Bases:
DataSetBaseAbstract base class for profile selection and group labeling.
Inherits from
aiqclib.common.base.dataset_base.DataSetBaseto leverage configuration handling and validation. Subclasses must implement the concrete logic for profile labeling.Subclasses must define:
expected_class_name(a class attribute) if they are intended to be instantiated (otherwise an error is raised by the base class).A custom
label_profiles()method that implements the specific profile selection and labeling logic.
- Variables:
default_file_name (str) – The default file name for selected profiles.
output_file_name (str) – The full path and name of the output Parquet file where selected profiles will be written.
input_data (Optional[polars.DataFrame]) – An optional Polars DataFrame used as initial data for profile selection.
selected_profiles (Optional[polars.DataFrame]) – A Polars DataFrame containing the profiles after selection and labeling, typically including a “group_label” column.
- Parameters:
config (ConfigBase)
input_data (DataFrame | None)
- default_file_name: str
- input_data: DataFrame | None
- abstractmethod label_profiles()[source]
Abstract method to be implemented by subclasses for labeling profiles to identify positive and negative groups.
Implementations of this method should perform the core logic for profile selection and labeling, assigning the resulting DataFrame (which should typically include a ‘group_label’ column) to the
selected_profilesinstance variable.- Return type:
None
- output_file_name: str
- selected_profiles: DataFrame | None
- write_selected_profiles()[source]
Write the selected profiles to a Parquet file.
The output file path is determined by
output_file_name. This method also ensures that the target directory exists before writing the file.- Raises:
ValueError – If the
selected_profilesinstance variable is None, indicating that no profiles have been selected or labeled before attempting to write.- Return type:
None