Configuration of Dataset Preparation

The prepare workflow (stage="prepare") is central to setting up your data for machine learning tasks within this library. It provides comprehensive control over the entire data processing pipeline, from preparing feature data sets from your raw data and creating the training, validation, and test data sets.

Core Concepts: Modular Configuration

The configuration for dataset preparation is designed around a powerful “building blocks” concept. Instead of defining a monolithic configuration, you define various sets of specialized configurations once, give each set a unique name, and then combine them as needed to construct a complete and flexible data processing pipeline. This modularity promotes reusability, simplifies experimentation, and enhances maintainability.

The primary configuration sections (building blocks) are:

path_info_sets: Defines reusable directory structures for input data and processed outputs.
target_sets: Specifies the prediction target variables, including their quality control (QC) flags.
summary_stats_sets: Configures summary statistics.
feature_sets: (Advanced) Lists the specific feature engineering methods to be applied.
feature_param_sets: Provides detailed parameters and settings for each chosen feature engineering method.
feature_stats_sets: (Advanced) Provides summary statistics values for normalizing features.
step_class_sets: (Advanced) Allows users to define custom Python classes for individual processing steps, enabling deep customization of the pipeline’s behavior.
step_param_sets: Supplies general parameters that control the behavior of the default or custom processing steps.
data_sets: The central assembly section, where you combine named blocks from the sections above to define a complete and executable data processing pipeline.

Note

dmaclib provides methods to down-sample the negative data set. Please refer to the Down-sampling of the Negative Dataset guide for details.

Detailed Configuration Sections

path_info_sets

This section defines the critical file system locations for both your raw input data and the various processed output artifacts. You can define multiple named path configurations to easily switch between different storage environments or project setups.

common.base_path: The root directory where all processed data and intermediate artifacts will be saved by this workflow.
input.base_path: The directory containing your raw input data files.
split.step_folder_name: The name of the subdirectory where the final training, validation, and test datasets will be stored (e.g., training).

path_info_sets:
  - name: data_set_1
    common:
      base_path: /path/to/data
    input:
      base_path: /path/to/input
      step_folder_name: ""
    split:
      step_folder_name: training

target_sets

This section specifies the target variables that your machine learning model will predict. For each target variable, you must also define its corresponding quality control (QC) flag column. These flags are crucial for identifying good versus bad data points, allowing the pipeline to filter or weight data appropriately. You define both positive (good) and negative (bad) flag values.

target_sets:
  - name: target_set_1
    variables:
      - name: temp
        flag: temp_qc
        pos_flag_values: [ 4, 6, 7 ]
        neg_flag_values: [ 1 ]

summary_stats_sets

This section defines summary statistics that will be used for feature values or feature normalization.

summary_stats_sets:
  - name: summary_stats_set_1
    stats:
      - name: location
        col_names: [ longitude, latitude ]
      - name: profile_summary_stats
        col_names: [ temp, psal, pres ]
      - name: basic_values3
        col_names: [ temp, psal, pres ]

aiqclib currently provides the following summary statistics.

location: global summary statistics of locations for feature normalization.
profile_summary_stats: profile level summary statistics used as features and for feature normalization.
basic_values3: global summary statistics of specified variables for feature normalization.

feature_sets & feature_param_sets

These two interconnected sections are dedicated to configuring your feature engineering process.

feature_sets: This block lists the names of the specific feature engineering methods you want to apply to your data.
feature_param_sets: This block provides the detailed parameters and configurations for each of the feature methods listed in your chosen feature_sets block. This allows for fine-grained control over how each feature is generated.

# A list of features to apply
feature_sets:
  - name: feature_set_1
    features:
      - location
      - day_of_year
      - profile_summary_stats
      - basic_values
      - flank_up
      - flank_down

# Parameters for the features listed above
feature_param_sets:
  - name: feature_set_1_param_set_1
    params:
      - feature: location
        stats_set: { type: raw }
        col_names: [ longitude, latitude ]
      - feature: day_of_year
        convert: cosine                         # or sine
        col_names: [ profile_timestamp ]
      - feature: profile_summary_stats
        stats_set: { type: raw }
        col_names: [ temp, psal, pres ]
        summary_stats_names: [ mean, median, sd, pct25, pct75 ]
      - feature: basic_values
        stats_set: { type: raw }
        col_names: [ temp, psal, pres ]
      - feature: flank_up
        flank_up: 5
        stats_set: { type: raw }
        col_names: [ temp, psal, pres ]
      - feature: flank_down
        flank_down: 5
        stats_set: { type: raw }
        col_names: [ temp, psal, pres ]

feature_stats_sets

(Advanced Use)

This section defines summary statistics that will be used for normalization or scaling of feature values. These statistics are typically derived from your dataset itself to ensure proper scaling.

feature_stats_sets:
  - name: feature_set_1_stats_set_1

Important

As it is crucial to normalize features for non-tree based machine learning methods, such as SVM and logistic regression, you need to provide summary statistics (like min/max values) of your data in the configuration file. The aiqclib library offers convenient functions to calculate the summary statistics. Please refer to the Feature Normalization guide for details.

step_class_sets

(Advanced Use) This section allows you to define and reference custom Python classes that implement the logic for specific processing steps within the data preparation pipeline. While the library provides default implementations for all steps, this block gives advanced users the flexibility to replace or extend pipeline behaviors with their own code. Each entry maps a step name (e.g., input, summary) to the name of a Python class.

step_class_sets:
  - name: data_set_step_set_1
    steps:
      input: InputDataSetA
      summary: SummaryDataSetA
      select: SelectDataSetAll
      locate: LocateDataSetAll
      extract: ExtractDataSetA
      split: SplitDataSetAll

step_param_sets

This section provides general parameters that control the behavior of the various data processing steps within the pipeline (whether default or custom step_class_sets). Examples of parameters include data filtering rules, sampling ratios, and split configurations.

steps.input.sub_steps.filter_rows: A boolean flag to enable/disable row filtering based on filter_method_dict.
steps.input.filter_method_dict.remove_years: Specifies a list of years to be excluded from the dataset.
steps.input.filter_method_dict.keep_years: Specifies a list of years to be kept for training.
steps.split.test_set_fraction: Defines the proportion of data to allocate to the test set.
steps.split.k_fold: Defines the k of k-fold cross validation

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years: [2023],
                                     keep_years: [] } }
      summary: { }
      select: { }
      locate: { }
      extract: { }
      split: { test_set_fraction: 0.1,
               k_fold: 5 }

data_sets

This is the main “pipeline assembly” section. Each entry in this list defines a complete data preparation job by linking together the named building blocks defined in the other sections. This section essentially orchestrates which specific configuration sets are used for a given dataset processing run.

name: A unique identifier for this particular dataset preparation job (e.g., dataset_0001).
dataset_folder_name: The name of the specific folder that will be created within the common.base_path to store outputs for this job (e.g., dataset_0001).
input_file_name: The specific raw data file (located in input.base_path) to be processed for this job.
path_info: The name of the path configuration to use from path_info_sets.
target_set: The name of the target configuration to use from target_sets.
…and similarly for all other configuration sets.

data_sets:
  - name: dataset_0001
    dataset_folder_name: dataset_0001
    input_file_name: nrt_cora_bo_4.parquet
    path_info: data_set_1
    target_set: target_set_1
    # ... other set references would follow here

Note

While you can define multiple data sets in the data_sets section, a specific one must be selected for subsequent processes. Please consult the dedicated Selecting Specific Configurations page for instructions on how to do this.

Full Example

Below is a complete example of a prepare_config.yaml file, demonstrating how all the building blocks are combined. The lines you will most commonly need to edit or customize are highlighted for quick reference.

Full prepare_config.yaml example

---
path_info_sets:
  - name: data_set_1
    common:
      base_path: /path/to/data # Root output directory for processed data
    input:
      base_path: /path/to/input # Directory containing raw input files
      step_folder_name: ""
    split:
      step_folder_name: training

target_sets:
  - name: target_set_1
    variables:
      - name: temp
        flag: temp_qc
        pos_flag_values: [ 4, 6, 7 ]
        neg_flag_values: [ 1 ]
      - name: psal
        flag: psal_qc
        pos_flag_values: [ 4, 6, 7 ]
        neg_flag_values: [ 1 ]
      - name: pres
        flag: pres_qc
        pos_flag_values: [ 4, 6, 7 ]
        neg_flag_values: [ 1 ]

summary_stats_sets:
  - name: summary_stats_set_1
    stats:
      - name: location
        col_names: [ longitude, latitude ]
      - name: profile_summary_stats
        col_names: [ temp, psal, pres ]
      - name: basic_values3
        col_names: [ temp, psal, pres ]

feature_sets:
  - name: feature_set_1
    features:
      - location
      - day_of_year
      - profile_summary_stats
      - basic_values
      - flank_up
      - flank_down

feature_param_sets:
  - name: feature_set_1_param_set_1
    params:
      - feature: location
        stats_set: { type: raw }
        col_names: [ longitude, latitude ]
      - feature: day_of_year
        convert: cosine                         # or sine
        col_names: [ profile_timestamp ]
      - feature: profile_summary_stats
        stats_set: { type: raw }
        col_names: [ temp, psal, pres ]
        summary_stats_names: [ mean, median, sd, pct25, pct75 ]
      - feature: basic_values
        stats_set: { type: raw }
        col_names: [ temp, psal, pres ]
      - feature: flank_up
        flank_up: 5
        stats_set: { type: raw }
        col_names: [ temp, psal, pres ]
      - feature: flank_down
        flank_down: 5
        stats_set: { type: raw }
        col_names: [ temp, psal, pres ]

feature_stats_sets:
  - name: feature_set_1_stats_set_1

step_class_sets:
  - name: data_set_step_set_1
    steps:
      input: InputDataSetA
      summary: SummaryDataSetA
      select: SelectDataSetAll
      locate: LocateDataSetAll
      extract: ExtractDataSetA
      split: SplitDataSetAll

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years: [2023],
                                     keep_years: [] } }
      summary: { }
      select: { }
      locate: { }
      extract: { }
      split: { test_set_fraction: 0.1,
               k_fold: 5 }

data_sets:
  - name: dataset_0001  # Your unique name for this dataset job
    dataset_folder_name: dataset_0001  # The folder name for output files
    input_file_name: nrt_cora_bo_4.parquet # The specific raw input file to process
    path_info: data_set_1
    target_set: target_set_1
    summary_stats_set: summary_stats_set_1
    feature_set: feature_set_1
    feature_param_set: feature_set_1_param_set_1
    feature_stats_set: feature_set_1_stats_set_1
    step_class_set: data_set_step_set_1
    step_param_set: data_set_param_set_1