aiqclib.prepare.step6_split_dataset packageο
Submodulesο
aiqclib.prepare.step6_split_dataset.dataset_a moduleο
This module defines the SplitDataSetA class, which is responsible for partitioning feature data into training and test sets specifically for Copernicus CTD data. It ensures that related positive and negative samples (grouped by pair identifiers) remain together during the split and assigns k-fold indices for cross-validation.
- class aiqclib.prepare.step6_split_dataset.dataset_a.SplitDataSetA(config, target_features=None)[source]ο
Bases:
SplitDataSetBaseA subclass of
SplitDataSetBasethat splits feature data into training and test sets for Copernicus CTD data.- This class performs the following tasks:
Randomly samples a fraction of rows for the test set.
Ensures matching positive and negative rows are grouped by shared identifiers (e.g.,
pair_id).Splits out the remainder into a training set.
Assigns k-fold indices to the training set rows.
Optionally drops columns that are not required for subsequent analysis.
Note
This class is specifically designed for Copernicus CTD data structures where positive and negative samples are linked via a
pair_id.- Parameters:
config (ConfigBase)
target_features (Dict[str, DataFrame] | None)
- add_k_fold(target_name)[source]ο
Assign a k-fold identifier to each row in the training set for cross-validation.
Positive samples are distributed across k folds, and negative samples are assigned the same fold index as their corresponding positive sample via
pair_id.- Parameters:
target_name (
str) β The target name identifying the training set withintraining_sets.- Return type:
None
- drop_col_namesο
Column names used for intermediate processing (e.g., to maintain matching references between positive and negative rows).
- drop_columns(target_name)[source]ο
Remove working columns (e.g., profile_id, pair_id) from the datasets.
This cleans the DataFrames to include only features and labels required for model training and evaluation.
- Parameters:
target_name (
str) β The target name identifying which training and test sets to modify.- Return type:
None
- expected_class_name: str = 'SplitDataSetA'ο
- split_test_set(target_name)[source]ο
Split the specified targetβs DataFrame into training and test sets.
The method samples positive labels (label=1) based on a configured fraction, then identifies corresponding negative labels (label=0) using
pair_idto ensure consistency. The remaining data is assigned to the training set.- Parameters:
target_name (
str) β The target name identifying which DataFrame intarget_featuresto split.- Return type:
None
aiqclib.prepare.step6_split_dataset.dataset_all moduleο
This module defines the SplitDataSetAll class, which is responsible for partitioning feature data into training and test sets. It provides functionality for random sampling, k-fold cross-validation index assignment, and column cleanup for Copernicus CTD datasets.
- class aiqclib.prepare.step6_split_dataset.dataset_all.SplitDataSetAll(config, target_features=None)[source]ο
Bases:
SplitDataSetBaseA subclass of
SplitDataSetBasethat splits feature data into training and test sets for Copernicus CTD data.- This class performs the following tasks:
Randomly samples a fraction of rows for the test set.
Ensures matching positive and negative rows are grouped by shared identifiers (e.g.,
pair_id).Splits out the remainder into a training set.
Assigns k-fold indices to the training set rows.
Optionally drops columns that are not required for subsequent analysis.
Note
This class,
SplitDataSetAll, is specifically designed to split feature data into training and test sets with particular handling for Copernicus CTD data.- Parameters:
config (ConfigBase)
target_features (Dict[str, DataFrame] | None)
- add_k_fold(target_name)[source]ο
Assign a k-fold identifier to each row in the training set for cross-validation.
Extracts rows labeled 1 (positive) and unevenly distributes them across the specified number of folds.
Joins negative rows based on
pair_idso they share the same fold assignment.
- Parameters:
target_name (
str) β The target name identifying the training set withintraining_sets.- Return type:
None
- drop_col_namesο
Column names used for intermediate processing (e.g., to maintain matching references between positive and negative rows).
- drop_columns(target_name)[source]ο
Remove specified working columns from both the training and test sets, leaving only the essential columns for subsequent steps.
- Parameters:
target_name (
str) β The target name identifying which training and test sets to modify.- Return type:
None
- expected_class_name: str = 'SplitDataSetAll'ο
- split_test_set(target_name)[source]ο
Split the specified targetβs DataFrame into training and test sets.
A random fraction of rows labeled 1 (positive) is sampled to form the test set.
Rows labeled 0 (negative) with matching
pair_idare joined to that test set.The remaining rows form the training set.
- Parameters:
target_name (
str) β The target name identifying which DataFrame intarget_featuresto split.- Return type:
None
aiqclib.prepare.step6_split_dataset.split_base moduleο
This module defines the abstract base class SplitDataSetBase for managing the splitting of target feature DataFrames into training and test sets, and for assigning k-fold cross-validation labels.
It extends DataSetBase to provide a standardized structure for data splitting operations, integrating configuration management and supporting the output of processed datasets to Parquet files.
- class aiqclib.prepare.step6_split_dataset.split_base.SplitDataSetBase(config, target_features=None)[source]ο
Bases:
DataSetBaseAbstract base class to perform train/test splitting and k-fold assignment for target feature DataFrames.
This class extends
aiqclib.common.base.dataset_base.DataSetBaseto validate and incorporate YAML-based configuration. It provides methods for writing out the resulting training and test sets into Parquet files.Subclasses must implement the abstract methods:
split_test_set(),add_k_fold(), anddrop_columns().Note
Since this class inherits from
aiqclib.common.base.dataset_base.DataSetBaseand is marked as an abstract base class, it may require anexpected_class_namedefined by subclasses if they are intended to be instantiated.- Parameters:
config (ConfigBase)
target_features (Dict[str, DataFrame] | None)
- abstractmethod add_k_fold(target_name)[source]ο
Add k-fold cross-validation columns or labels to the training set.
Typically, this method would modify the DataFrame in
training_sets[target_name].- Parameters:
target_name (
str) β The target name being processed.- Return type:
None
- default_file_names: Dict[str, str]ο
Default file naming templates for train and test sets.
- default_k_fold: intο
Default number of folds for k-fold cross-validation if unspecified.
- default_test_set_fraction: floatο
Default fraction for test sets if none is specified in the config.
- abstractmethod drop_columns(target_name)[source]ο
Drop unnecessary columns from both training and test sets.
- Parameters:
target_name (
str) β The target name being processed.- Return type:
None
- get_k_fold()[source]ο
Retrieve the number of folds for cross-validation from configuration or fallback.
- Returns:
An integer representing how many folds are used during k-fold cross-validation steps.
- Return type:
int
- get_test_set_fraction()[source]ο
Retrieve the test set fraction (0-1) from configuration or fallback.
- Returns:
A float in the range [0, 1] representing the fraction of data reserved for testing.
- Return type:
float
- output_file_names: Dict[str, Dict[str, str]]ο
File paths for each targetβs train/test sets, keyed by βtrainβ and βtestβ.
- process_targets()[source]ο
Perform test splitting, k-fold assignment, and column dropping for each target defined in the dataset configuration.
Uses the abstract methods
split_test_set(),add_k_fold(), anddrop_columns()for each target name.- Return type:
None
- abstractmethod split_test_set(target_name)[source]ο
Split the DataFrame for a given target into training and test sets.
Must store any resulting DataFrames in
training_setsandtest_setsusing the target name as a key.- Parameters:
target_name (
str) β The identifier of the target to split.- Return type:
None
- target_features: Dict[str, DataFrame] | Noneο
A dictionary of Polars DataFrames of feature columns for all targets, if available.
- test_sets: Dict[str, DataFrame]ο
A dictionary of Polars DataFrames holding test splits by target name.
- training_sets: Dict[str, DataFrame]ο
A dictionary of Polars DataFrames holding training splits by target name.
- write_data_sets()[source]ο
Write both training and test sets to disk.
Simply calls
write_test_sets()andwrite_training_sets().- Return type:
None
- write_test_sets()[source]ο
Write the test splits to Parquet files.
- Raises:
ValueError β If
test_setsis empty (i.e., no splits have been created).- Return type:
None
- write_training_sets()[source]ο
Write the training splits to Parquet files.
- Raises:
ValueError β If
training_setsis empty (i.e., no splits have been created).- Return type:
None