aiqclib.prepare.step6_split_dataset package

Submodules

aiqclib.prepare.step6_split_dataset.dataset_a module

This module defines the SplitDataSetA class, which is responsible for partitioning feature data into training and test sets specifically for Copernicus CTD data. It ensures that related positive and negative samples (grouped by pair identifiers) remain together during the split and assigns k-fold indices for cross-validation.

class aiqclib.prepare.step6_split_dataset.dataset_a.SplitDataSetA(config, target_features=None)[source]

Bases: SplitDataSetBase

A subclass of SplitDataSetBase that splits feature data into training and test sets for Copernicus CTD data.

This class performs the following tasks:
  • Randomly samples a fraction of rows for the test set.

  • Ensures matching positive and negative rows are grouped by shared identifiers (e.g., pair_id).

  • Splits out the remainder into a training set.

  • Assigns k-fold indices to the training set rows.

  • Optionally drops columns that are not required for subsequent analysis.

Note

This class is specifically designed for Copernicus CTD data structures where positive and negative samples are linked via a pair_id.

Parameters:
  • config (ConfigBase)

  • target_features (Dict[str, DataFrame] | None)

add_k_fold(target_name)[source]

Assign a k-fold identifier to each row in the training set for cross-validation.

Positive samples are distributed across k folds, and negative samples are assigned the same fold index as their corresponding positive sample via pair_id.

Parameters:

target_name (str) – The target name identifying the training set within training_sets.

Return type:

None

drop_col_names

Column names used for intermediate processing (e.g., to maintain matching references between positive and negative rows).

drop_columns(target_name)[source]

Remove working columns (e.g., profile_id, pair_id) from the datasets.

This cleans the DataFrames to include only features and labels required for model training and evaluation.

Parameters:

target_name (str) – The target name identifying which training and test sets to modify.

Return type:

None

expected_class_name: str = 'SplitDataSetA'
split_test_set(target_name)[source]

Split the specified target’s DataFrame into training and test sets.

The method samples positive labels (label=1) based on a configured fraction, then identifies corresponding negative labels (label=0) using pair_id to ensure consistency. The remaining data is assigned to the training set.

Parameters:

target_name (str) – The target name identifying which DataFrame in target_features to split.

Return type:

None

aiqclib.prepare.step6_split_dataset.dataset_all module

This module defines the SplitDataSetAll class, which is responsible for partitioning feature data into training and test sets. It provides functionality for random sampling, k-fold cross-validation index assignment, and column cleanup for Copernicus CTD datasets.

class aiqclib.prepare.step6_split_dataset.dataset_all.SplitDataSetAll(config, target_features=None)[source]

Bases: SplitDataSetBase

A subclass of SplitDataSetBase that splits feature data into training and test sets for Copernicus CTD data.

This class performs the following tasks:
  • Randomly samples a fraction of rows for the test set.

  • Ensures matching positive and negative rows are grouped by shared identifiers (e.g., pair_id).

  • Splits out the remainder into a training set.

  • Assigns k-fold indices to the training set rows.

  • Optionally drops columns that are not required for subsequent analysis.

Note

This class, SplitDataSetAll, is specifically designed to split feature data into training and test sets with particular handling for Copernicus CTD data.

Parameters:
  • config (ConfigBase)

  • target_features (Dict[str, DataFrame] | None)

add_k_fold(target_name)[source]

Assign a k-fold identifier to each row in the training set for cross-validation.

  1. Extracts rows labeled 1 (positive) and unevenly distributes them across the specified number of folds.

  2. Joins negative rows based on pair_id so they share the same fold assignment.

Parameters:

target_name (str) – The target name identifying the training set within training_sets.

Return type:

None

drop_col_names

Column names used for intermediate processing (e.g., to maintain matching references between positive and negative rows).

drop_columns(target_name)[source]

Remove specified working columns from both the training and test sets, leaving only the essential columns for subsequent steps.

Parameters:

target_name (str) – The target name identifying which training and test sets to modify.

Return type:

None

expected_class_name: str = 'SplitDataSetAll'
split_test_set(target_name)[source]

Split the specified target’s DataFrame into training and test sets.

  1. A random fraction of rows labeled 1 (positive) is sampled to form the test set.

  2. Rows labeled 0 (negative) with matching pair_id are joined to that test set.

  3. The remaining rows form the training set.

Parameters:

target_name (str) – The target name identifying which DataFrame in target_features to split.

Return type:

None

aiqclib.prepare.step6_split_dataset.split_base module

This module defines the abstract base class SplitDataSetBase for managing the splitting of target feature DataFrames into training and test sets, and for assigning k-fold cross-validation labels.

It extends DataSetBase to provide a standardized structure for data splitting operations, integrating configuration management and supporting the output of processed datasets to Parquet files.

class aiqclib.prepare.step6_split_dataset.split_base.SplitDataSetBase(config, target_features=None)[source]

Bases: DataSetBase

Abstract base class to perform train/test splitting and k-fold assignment for target feature DataFrames.

This class extends aiqclib.common.base.dataset_base.DataSetBase to validate and incorporate YAML-based configuration. It provides methods for writing out the resulting training and test sets into Parquet files.

Subclasses must implement the abstract methods: split_test_set(), add_k_fold(), and drop_columns().

Note

Since this class inherits from aiqclib.common.base.dataset_base.DataSetBase and is marked as an abstract base class, it may require an expected_class_name defined by subclasses if they are intended to be instantiated.

Parameters:
  • config (ConfigBase)

  • target_features (Dict[str, DataFrame] | None)

abstractmethod add_k_fold(target_name)[source]

Add k-fold cross-validation columns or labels to the training set.

Typically, this method would modify the DataFrame in training_sets[target_name].

Parameters:

target_name (str) – The target name being processed.

Return type:

None

default_file_names: Dict[str, str]

Default file naming templates for train and test sets.

default_k_fold: int

Default number of folds for k-fold cross-validation if unspecified.

default_test_set_fraction: float

Default fraction for test sets if none is specified in the config.

abstractmethod drop_columns(target_name)[source]

Drop unnecessary columns from both training and test sets.

Parameters:

target_name (str) – The target name being processed.

Return type:

None

get_k_fold()[source]

Retrieve the number of folds for cross-validation from configuration or fallback.

Returns:

An integer representing how many folds are used during k-fold cross-validation steps.

Return type:

int

get_test_set_fraction()[source]

Retrieve the test set fraction (0-1) from configuration or fallback.

Returns:

A float in the range [0, 1] representing the fraction of data reserved for testing.

Return type:

float

output_file_names: Dict[str, Dict[str, str]]

File paths for each target’s train/test sets, keyed by β€œtrain” and β€œtest”.

process_targets()[source]

Perform test splitting, k-fold assignment, and column dropping for each target defined in the dataset configuration.

Uses the abstract methods split_test_set(), add_k_fold(), and drop_columns() for each target name.

Return type:

None

abstractmethod split_test_set(target_name)[source]

Split the DataFrame for a given target into training and test sets.

Must store any resulting DataFrames in training_sets and test_sets using the target name as a key.

Parameters:

target_name (str) – The identifier of the target to split.

Return type:

None

target_features: Dict[str, DataFrame] | None

A dictionary of Polars DataFrames of feature columns for all targets, if available.

test_sets: Dict[str, DataFrame]

A dictionary of Polars DataFrames holding test splits by target name.

training_sets: Dict[str, DataFrame]

A dictionary of Polars DataFrames holding training splits by target name.

write_data_sets()[source]

Write both training and test sets to disk.

Simply calls write_test_sets() and write_training_sets().

Return type:

None

write_test_sets()[source]

Write the test splits to Parquet files.

Raises:

ValueError – If test_sets is empty (i.e., no splits have been created).

Return type:

None

write_training_sets()[source]

Write the training splits to Parquet files.

Raises:

ValueError – If training_sets is empty (i.e., no splits have been created).

Return type:

None