aiqclib.prepare.step4_select_rows package

Submodules

aiqclib.prepare.step4_select_rows.dataset_a module

Module for locating specific data rows within oceanographic datasets.

This module defines the LocateDataSetA class, which identifies positive (bad quality) and negative (good quality) observations from oceanographic profiles. It facilitates the creation of paired datasets for machine learning by aligning observations based on profile and pressure proximity.

class aiqclib.prepare.step4_select_rows.dataset_a.LocateDataSetA(config, input_data=None, selected_profiles=None)[source]

Bases: LocatePositionBase

A subclass of aiqclib.prepare.step4_select_rows.locate_base.LocatePositionBase that locates both positive and negative rows from BO NRT+Cora test data for training or evaluation purposes.

The workflow involves:

  • Selecting rows that have “bad” QC flags (positive examples).

  • Selecting rows that have “good” QC flags (negative examples).

  • Aligning these two sets to form paired data examples, often based on proximity in profile and pressure.

  • Concatenating and labeling them for subsequent steps in a machine learning pipeline.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

  • selected_profiles (DataFrame | None)

expected_class_name: str = 'LocateDataSetA'
locate_target_rows(target_name, target_value)[source]

Locate training data rows by consolidating positive and negative subsets. This method first calls select_positive_rows() and select_negative_rows() to gather the respective dataframes, then stacks them, adds a unique row index, and creates a pair_id for linking paired observations.

Parameters:
  • target_name (str) – Name of the target variable (e.g., ‘TEMP_QC’).

  • target_value (Dict) – A dictionary of target metadata, including the QC flag variable name used for both positive and negative selection.

Return type:

None

negative_rows: Dict[str, DataFrame]

Dictionary for holding subsets of negative rows keyed by target name.

positive_rows: Dict[str, DataFrame]

Dictionary for holding subsets of positive rows keyed by target name.

select_negative_rows(target_name, target_value)[source]

Identify and collect negative rows that align with positive rows, forming pairs where possible. Negative rows are typically “good” observations from nearby profiles, matched by pressure.

Parameters:
  • target_name (str) – The target name used to locate the corresponding positive rows.

  • target_value (Dict) – A dictionary of target metadata, including the QC flag variable name used for selecting negative observations (e.g., flag=1 or any “good” flag).

Return type:

None

select_negative_rows_closest_day(target_name, target_value)[source]

Identify and collect negative rows that align with positive rows, forming pairs where possible. Negative rows are typically “good” observations from nearby profiles, matched by pressure.

The alignment process involves:

  1. Selecting positive rows.

  2. Joining with negative profiles.

  3. Joining with the full input data to get observation details.

  4. Calculating pressure differences with corresponding positive observations.

  5. Selecting the negative observation that best matches in pressure for each positive observation to form a pair.

Parameters:
  • target_name (str) – The target name used to locate the corresponding positive rows.

  • target_value (Dict) – A dictionary of target metadata, including the QC flag variable name used for selecting negative observations (e.g., flag=1 or any “good” flag).

Return type:

None

select_negative_rows_neighbor_n(target_name, target_value)[source]

Identify and collect negative rows that align with positive rows, forming pairs where possible. Negative rows are typically “good” observations from nearby profiles, matched by pressure.

The alignment process involves:

  1. Selecting positive profiles.

  2. Generating neighbouring observation numbers.

  3. Joining with the full input data to get observation details.

  4. Selecting the negative observations.

Parameters:
  • target_name (str) – The target name used to locate the corresponding positive rows.

  • target_value (Dict) – A dictionary of target metadata, including the QC flag variable name used for selecting negative observations (e.g., flag=1 or any “good” flag).

Return type:

None

select_positive_rows(target_name, target_value)[source]

Identify and collect positive rows for a given target. Positive rows are defined as observations within profiles that have a specific “bad” QC flag.

Parameters:
  • target_name (str) – The name (key) of the target in the config’s target dictionary.

  • target_value (Dict) – A dictionary of target metadata, including the QC flag variable name that indicates a “bad” observation (e.g., flag=4).

Return type:

None

aiqclib.prepare.step4_select_rows.dataset_all module

This module provides the LocateDataSetAll class, which is used for identifying and extracting positive and negative data rows from oceanographic profiles. It filters observations based on quality control (QC) flags to prepare datasets for machine learning training or evaluation tasks.

class aiqclib.prepare.step4_select_rows.dataset_all.LocateDataSetAll(config, input_data=None, selected_profiles=None)[source]

Bases: LocatePositionBase

A subclass of aiqclib.prepare.step4_select_rows.locate_base.LocatePositionBase that locates both positive and negative rows from data for training or evaluation purposes.

The workflow involves:
  • Selecting rows that have “bad” QC flags (positive examples).

  • Selecting rows that have “good” QC flags (negative examples).

  • Concatenating and labeling them for subsequent steps in a machine learning pipeline.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

  • selected_profiles (DataFrame | None)

expected_class_name: str = 'LocateDataSetAll'
locate_target_rows(target_name, target_value)[source]

Locate target rows for training or evaluation by calling select_all_rows.

Parameters:
  • target_name (str) – Name of the target variable.

  • target_value (Dict) – A dictionary of target metadata used for labeling.

Return type:

None

select_all_rows(target_name, target_value)[source]

Collect all rows for a specified target by applying flag-based labeling to each record.

Parameters:
  • target_name (str) – The name (key) of the target in the configuration’s target dictionary.

  • target_value (Dict) – A dictionary of target metadata, including QC flag names and values.

Raises:

ValueError – If the internal input_data attribute is None.

Return type:

None

aiqclib.prepare.step4_select_rows.locate_base module

This module defines the abstract base class LocatePositionBase for identifying and extracting specific rows from a dataset based on defined target criteria.

It provides a structured approach for processing different targets, typically for purposes like creating training datasets or selecting specific data subsets, leveraging configuration settings and handling data I/O.

class aiqclib.prepare.step4_select_rows.locate_base.LocatePositionBase(config, input_data=None, selected_profiles=None)[source]

Bases: DataSetBase

Abstract base class for locating and extracting target rows from a dataset.

This class extends aiqclib.common.base.dataset_base.DataSetBase to validate that the YAML configuration matches the expected structure and to provide a framework for operations related to identifying rows of interest (e.g., training data). Subclasses must implement:

  • The locate_target_rows() method for per-target row identification logic.

  • Potentially define expected_class_name if this class is intended to be directly instantiated and matched against the YAML’s base_class configuration.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

  • selected_profiles (DataFrame | None)

default_file_name: str

Default file name template for writing target rows (one file per target). The {target_name} placeholder will be replaced.

Type:

str

input_data: DataFrame | None

An optional Polars DataFrame from which target rows will be extracted. This is the primary input dataset.

Type:

Optional[polars.DataFrame]

abstractmethod locate_target_rows(target_name, target_value)[source]

Abstract method to locate rows in input_data or selected_profiles relevant to a specific target.

Subclasses must implement this method to define the specific logic for identifying and extracting the subset of rows matching the criteria defined by the target. The identified rows for the given target should be stored in the selected_rows dictionary under the target_name key.

Parameters:
  • target_name (str) – The name of the target variable (e.g., ‘training_data’).

  • target_value (Dict) – A dictionary containing metadata or specific criteria for the target, as defined in the configuration.

Return type:

None

output_file_names: Dict[str, str]

Dictionary mapping each target name to the corresponding output Parquet file path derived from the configuration.

Type:

Dict[str, str]

process_targets()[source]

Iterate over all defined targets and call locate_target_rows() on each.

This method retrieves the target definitions (names and other metadata) from the configuration object (config) and then sequentially processes each target. The concrete logic for identifying rows per target is implemented in subclasses via the abstract locate_target_rows() method.

Return type:

None

selected_profiles: DataFrame | None

An optional Polars DataFrame of pre-selected profiles or rows that might be combined with the input data during the target-location process, or used as a filter.

Type:

Optional[polars.DataFrame]

selected_rows: Dict[str, DataFrame]

A dictionary to store the resulting target rows for each target as a Polars DataFrame, keyed by target name.

Type:

Dict[str, polars.DataFrame]

write_selected_rows()[source]

Write the identified target rows to separate Parquet files.

This method iterates through the selected_rows dictionary. For each target, it constructs the output file path using the template defined in output_file_names and writes the corresponding Polars DataFrame to a Parquet file. Directories are created if they do not exist.

Raises:

ValueError – If the selected_rows dictionary is empty, indicating that no target rows have been identified or processed.

Return type:

None