aiqclib.prepare.step4_select_rows package
Submodules
aiqclib.prepare.step4_select_rows.dataset_a module
Module for locating specific data rows within oceanographic datasets.
This module defines the LocateDataSetA class, which identifies positive
(bad quality) and negative (good quality) observations from oceanographic
profiles. It facilitates the creation of paired datasets for machine
learning by aligning observations based on profile and pressure proximity.
- class aiqclib.prepare.step4_select_rows.dataset_a.LocateDataSetA(config, input_data=None, selected_profiles=None)[source]
Bases:
LocatePositionBaseA subclass of
aiqclib.prepare.step4_select_rows.locate_base.LocatePositionBasethat locates both positive and negative rows from BO NRT+Cora test data for training or evaluation purposes.The workflow involves:
Selecting rows that have “bad” QC flags (positive examples).
Selecting rows that have “good” QC flags (negative examples).
Aligning these two sets to form paired data examples, often based on proximity in profile and pressure.
Concatenating and labeling them for subsequent steps in a machine learning pipeline.
- Parameters:
config (ConfigBase)
input_data (DataFrame | None)
selected_profiles (DataFrame | None)
- expected_class_name: str = 'LocateDataSetA'
- locate_target_rows(target_name, target_value)[source]
Locate training data rows by consolidating positive and negative subsets. This method first calls
select_positive_rows()andselect_negative_rows()to gather the respective dataframes, then stacks them, adds a unique row index, and creates a pair_id for linking paired observations.- Parameters:
target_name (
str) – Name of the target variable (e.g., ‘TEMP_QC’).target_value (
Dict) – A dictionary of target metadata, including the QC flag variable name used for both positive and negative selection.
- Return type:
None
- negative_rows: Dict[str, DataFrame]
Dictionary for holding subsets of negative rows keyed by target name.
- positive_rows: Dict[str, DataFrame]
Dictionary for holding subsets of positive rows keyed by target name.
- select_negative_rows(target_name, target_value)[source]
Identify and collect negative rows that align with positive rows, forming pairs where possible. Negative rows are typically “good” observations from nearby profiles, matched by pressure.
- Parameters:
target_name (
str) – The target name used to locate the corresponding positive rows.target_value (
Dict) – A dictionary of target metadata, including the QC flag variable name used for selecting negative observations (e.g., flag=1 or any “good” flag).
- Return type:
None
- select_negative_rows_closest_day(target_name, target_value)[source]
Identify and collect negative rows that align with positive rows, forming pairs where possible. Negative rows are typically “good” observations from nearby profiles, matched by pressure.
The alignment process involves:
Selecting positive rows.
Joining with negative profiles.
Joining with the full input data to get observation details.
Calculating pressure differences with corresponding positive observations.
Selecting the negative observation that best matches in pressure for each positive observation to form a pair.
- Parameters:
target_name (
str) – The target name used to locate the corresponding positive rows.target_value (
Dict) – A dictionary of target metadata, including the QC flag variable name used for selecting negative observations (e.g., flag=1 or any “good” flag).
- Return type:
None
- select_negative_rows_neighbor_n(target_name, target_value)[source]
Identify and collect negative rows that align with positive rows, forming pairs where possible. Negative rows are typically “good” observations from nearby profiles, matched by pressure.
The alignment process involves:
Selecting positive profiles.
Generating neighbouring observation numbers.
Joining with the full input data to get observation details.
Selecting the negative observations.
- Parameters:
target_name (
str) – The target name used to locate the corresponding positive rows.target_value (
Dict) – A dictionary of target metadata, including the QC flag variable name used for selecting negative observations (e.g., flag=1 or any “good” flag).
- Return type:
None
- select_positive_rows(target_name, target_value)[source]
Identify and collect positive rows for a given target. Positive rows are defined as observations within profiles that have a specific “bad” QC flag.
- Parameters:
target_name (
str) – The name (key) of the target in the config’s target dictionary.target_value (
Dict) – A dictionary of target metadata, including the QC flag variable name that indicates a “bad” observation (e.g., flag=4).
- Return type:
None
aiqclib.prepare.step4_select_rows.dataset_all module
This module provides the LocateDataSetAll class, which is used for identifying and extracting positive and negative data rows from oceanographic profiles. It filters observations based on quality control (QC) flags to prepare datasets for machine learning training or evaluation tasks.
- class aiqclib.prepare.step4_select_rows.dataset_all.LocateDataSetAll(config, input_data=None, selected_profiles=None)[source]
Bases:
LocatePositionBaseA subclass of
aiqclib.prepare.step4_select_rows.locate_base.LocatePositionBasethat locates both positive and negative rows from data for training or evaluation purposes.- The workflow involves:
Selecting rows that have “bad” QC flags (positive examples).
Selecting rows that have “good” QC flags (negative examples).
Concatenating and labeling them for subsequent steps in a machine learning pipeline.
- Parameters:
config (ConfigBase)
input_data (DataFrame | None)
selected_profiles (DataFrame | None)
- expected_class_name: str = 'LocateDataSetAll'
- locate_target_rows(target_name, target_value)[source]
Locate target rows for training or evaluation by calling select_all_rows.
- Parameters:
target_name (
str) – Name of the target variable.target_value (
Dict) – A dictionary of target metadata used for labeling.
- Return type:
None
- select_all_rows(target_name, target_value)[source]
Collect all rows for a specified target by applying flag-based labeling to each record.
- Parameters:
target_name (
str) – The name (key) of the target in the configuration’s target dictionary.target_value (
Dict) – A dictionary of target metadata, including QC flag names and values.
- Raises:
ValueError – If the internal input_data attribute is None.
- Return type:
None
aiqclib.prepare.step4_select_rows.locate_base module
This module defines the abstract base class LocatePositionBase for identifying
and extracting specific rows from a dataset based on defined target criteria.
It provides a structured approach for processing different targets, typically for purposes like creating training datasets or selecting specific data subsets, leveraging configuration settings and handling data I/O.
- class aiqclib.prepare.step4_select_rows.locate_base.LocatePositionBase(config, input_data=None, selected_profiles=None)[source]
Bases:
DataSetBaseAbstract base class for locating and extracting target rows from a dataset.
This class extends
aiqclib.common.base.dataset_base.DataSetBaseto validate that the YAML configuration matches the expected structure and to provide a framework for operations related to identifying rows of interest (e.g., training data). Subclasses must implement:The
locate_target_rows()method for per-target row identification logic.Potentially define
expected_class_nameif this class is intended to be directly instantiated and matched against the YAML’sbase_classconfiguration.
- Parameters:
config (ConfigBase)
input_data (DataFrame | None)
selected_profiles (DataFrame | None)
- default_file_name: str
Default file name template for writing target rows (one file per target). The
{target_name}placeholder will be replaced.- Type:
str
- input_data: DataFrame | None
An optional Polars DataFrame from which target rows will be extracted. This is the primary input dataset.
- Type:
Optional[
polars.DataFrame]
- abstractmethod locate_target_rows(target_name, target_value)[source]
Abstract method to locate rows in
input_dataorselected_profilesrelevant to a specific target.Subclasses must implement this method to define the specific logic for identifying and extracting the subset of rows matching the criteria defined by the target. The identified rows for the given target should be stored in the
selected_rowsdictionary under thetarget_namekey.- Parameters:
target_name (
str) – The name of the target variable (e.g., ‘training_data’).target_value (
Dict) – A dictionary containing metadata or specific criteria for the target, as defined in the configuration.
- Return type:
None
- output_file_names: Dict[str, str]
Dictionary mapping each target name to the corresponding output Parquet file path derived from the configuration.
- Type:
Dict[str, str]
- process_targets()[source]
Iterate over all defined targets and call
locate_target_rows()on each.This method retrieves the target definitions (names and other metadata) from the configuration object (
config) and then sequentially processes each target. The concrete logic for identifying rows per target is implemented in subclasses via the abstractlocate_target_rows()method.- Return type:
None
- selected_profiles: DataFrame | None
An optional Polars DataFrame of pre-selected profiles or rows that might be combined with the input data during the target-location process, or used as a filter.
- Type:
Optional[
polars.DataFrame]
- selected_rows: Dict[str, DataFrame]
A dictionary to store the resulting target rows for each target as a Polars DataFrame, keyed by target name.
- Type:
Dict[str,
polars.DataFrame]
- write_selected_rows()[source]
Write the identified target rows to separate Parquet files.
This method iterates through the
selected_rowsdictionary. For each target, it constructs the output file path using the template defined inoutput_file_namesand writes the corresponding Polars DataFrame to a Parquet file. Directories are created if they do not exist.- Raises:
ValueError – If the
selected_rowsdictionary is empty, indicating that no target rows have been identified or processed.- Return type:
None