Configuration of Dataset Preparation
The prepare workflow (stage="prepare") is central to setting up your data for machine learning tasks within this library. It provides comprehensive control over the entire data processing pipeline, from preparing feature data sets from your raw data and creating the training, validation, and test data sets.
Core Concepts: Modular Configuration
The configuration for dataset preparation is designed around a powerful “building blocks” concept. Instead of defining a monolithic configuration, you define various sets of specialized configurations once, give each set a unique name, and then combine them as needed to construct a complete and flexible data processing pipeline. This modularity promotes reusability, simplifies experimentation, and enhances maintainability.
The primary configuration sections (building blocks) are:
path_info_sets: Defines reusable directory structures for input data and processed outputs.
target_sets: Specifies the prediction target variables, including their quality control (QC) flags.
summary_stats_sets: Configures summary statistics.
feature_sets: (Advanced) Lists the specific feature engineering methods to be applied.
feature_param_sets: Provides detailed parameters and settings for each chosen feature engineering method.
feature_stats_sets: (Advanced) Provides summary statistics values for normalizing features.
step_class_sets: (Advanced) Allows users to define custom Python classes for individual processing steps, enabling deep customization of the pipeline’s behavior.
step_param_sets: Supplies general parameters that control the behavior of the default or custom processing steps.
data_sets: The central assembly section, where you combine named blocks from the sections above to define a complete and executable data processing pipeline.
Note
dmaclib provides methods to down-sample the negative data set. Please refer to the Down-sampling of the Negative Dataset guide for details.
Detailed Configuration Sections
path_info_sets
This section defines the critical file system locations for both your raw input data and the various processed output artifacts. You can define multiple named path configurations to easily switch between different storage environments or project setups.
common.base_path: The root directory where all processed data and intermediate artifacts will be saved by this workflow.
input.base_path: The directory containing your raw input data files.
split.step_folder_name: The name of the subdirectory where the final training, validation, and test datasets will be stored (e.g., training).
path_info_sets:
- name: data_set_1
common:
base_path: /path/to/data
input:
base_path: /path/to/input
step_folder_name: ""
split:
step_folder_name: training
target_sets
This section specifies the target variables that your machine learning model will predict. For each target variable, you must also define its corresponding quality control (QC) flag column. These flags are crucial for identifying good versus bad data points, allowing the pipeline to filter or weight data appropriately. You define both positive (good) and negative (bad) flag values.
target_sets:
- name: target_set_1
variables:
- name: temp
flag: temp_qc
pos_flag_values: [ 4, 6, 7 ]
neg_flag_values: [ 1 ]
summary_stats_sets
This section defines summary statistics that will be used for feature values or feature normalization.
summary_stats_sets:
- name: summary_stats_set_1
stats:
- name: location
col_names: [ longitude, latitude ]
- name: profile_summary_stats
col_names: [ temp, psal, pres ]
- name: basic_values3
col_names: [ temp, psal, pres ]
aiqclib currently provides the following summary statistics.
location: global summary statistics of locations for feature normalization.
profile_summary_stats: profile level summary statistics used as features and for feature normalization.
basic_values3: global summary statistics of specified variables for feature normalization.
feature_sets & feature_param_sets
These two interconnected sections are dedicated to configuring your feature engineering process.
feature_sets: This block lists the names of the specific feature engineering methods you want to apply to your data.
feature_param_sets: This block provides the detailed parameters and configurations for each of the feature methods listed in your chosen
feature_setsblock. This allows for fine-grained control over how each feature is generated.
# A list of features to apply
feature_sets:
- name: feature_set_1
features:
- location
- day_of_year
- profile_summary_stats
- basic_values
- flank_up
- flank_down
# Parameters for the features listed above
feature_param_sets:
- name: feature_set_1_param_set_1
params:
- feature: location
stats_set: { type: raw }
col_names: [ longitude, latitude ]
- feature: day_of_year
convert: cosine # or sine
col_names: [ profile_timestamp ]
- feature: profile_summary_stats
stats_set: { type: raw }
col_names: [ temp, psal, pres ]
summary_stats_names: [ mean, median, sd, pct25, pct75 ]
- feature: basic_values
stats_set: { type: raw }
col_names: [ temp, psal, pres ]
- feature: flank_up
flank_up: 5
stats_set: { type: raw }
col_names: [ temp, psal, pres ]
- feature: flank_down
flank_down: 5
stats_set: { type: raw }
col_names: [ temp, psal, pres ]
feature_stats_sets
(Advanced Use)
This section defines summary statistics that will be used for normalization or scaling of feature values. These statistics are typically derived from your dataset itself to ensure proper scaling.
feature_stats_sets:
- name: feature_set_1_stats_set_1
Important
As it is crucial to normalize features for non-tree based machine learning methods, such as SVM and logistic regression, you need to provide summary statistics (like min/max values) of your data in the configuration file. The aiqclib library offers convenient functions to calculate the summary statistics. Please refer to the Feature Normalization guide for details.
step_class_sets
(Advanced Use)
This section allows you to define and reference custom Python classes that implement the logic for specific processing steps within the data preparation pipeline. While the library provides default implementations for all steps, this block gives advanced users the flexibility to replace or extend pipeline behaviors with their own code. Each entry maps a step name (e.g., input, summary) to the name of a Python class.
step_class_sets:
- name: data_set_step_set_1
steps:
input: InputDataSetA
summary: SummaryDataSetA
select: SelectDataSetAll
locate: LocateDataSetAll
extract: ExtractDataSetA
split: SplitDataSetAll
step_param_sets
This section provides general parameters that control the behavior of the various data processing steps within the pipeline (whether default or custom step_class_sets). Examples of parameters include data filtering rules, sampling ratios, and split configurations.
steps.input.sub_steps.filter_rows: A boolean flag to enable/disable row filtering based on
filter_method_dict.steps.input.filter_method_dict.remove_years: Specifies a list of years to be excluded from the dataset.
steps.input.filter_method_dict.keep_years: Specifies a list of years to be kept for training.
steps.split.test_set_fraction: Defines the proportion of data to allocate to the test set.
steps.split.k_fold: Defines the k of k-fold cross validation
step_param_sets:
- name: data_set_param_set_1
steps:
input: { sub_steps: { rename_columns: false,
filter_rows: true },
rename_dict: { },
filter_method_dict: { remove_years: [2023],
keep_years: [] } }
summary: { }
select: { }
locate: { }
extract: { }
split: { test_set_fraction: 0.1,
k_fold: 5 }
data_sets
This is the main “pipeline assembly” section. Each entry in this list defines a complete data preparation job by linking together the named building blocks defined in the other sections. This section essentially orchestrates which specific configuration sets are used for a given dataset processing run.
name: A unique identifier for this particular dataset preparation job (e.g.,
dataset_0001).dataset_folder_name: The name of the specific folder that will be created within the
common.base_pathto store outputs for this job (e.g.,dataset_0001).input_file_name: The specific raw data file (located in
input.base_path) to be processed for this job.path_info: The
nameof the path configuration to use frompath_info_sets.target_set: The
nameof the target configuration to use fromtarget_sets.…and similarly for all other configuration sets.
data_sets:
- name: dataset_0001
dataset_folder_name: dataset_0001
input_file_name: nrt_cora_bo_4.parquet
path_info: data_set_1
target_set: target_set_1
# ... other set references would follow here
Note
While you can define multiple data sets in the data_sets section, a specific one must be selected for subsequent processes. Please consult the dedicated Selecting Specific Configurations page for instructions on how to do this.
Full Example
Below is a complete example of a prepare_config.yaml file, demonstrating how all the building blocks are combined. The lines you will most commonly need to edit or customize are highlighted for quick reference.
---
path_info_sets:
- name: data_set_1
common:
base_path: /path/to/data # Root output directory for processed data
input:
base_path: /path/to/input # Directory containing raw input files
step_folder_name: ""
split:
step_folder_name: training
target_sets:
- name: target_set_1
variables:
- name: temp
flag: temp_qc
pos_flag_values: [ 4, 6, 7 ]
neg_flag_values: [ 1 ]
- name: psal
flag: psal_qc
pos_flag_values: [ 4, 6, 7 ]
neg_flag_values: [ 1 ]
- name: pres
flag: pres_qc
pos_flag_values: [ 4, 6, 7 ]
neg_flag_values: [ 1 ]
summary_stats_sets:
- name: summary_stats_set_1
stats:
- name: location
col_names: [ longitude, latitude ]
- name: profile_summary_stats
col_names: [ temp, psal, pres ]
- name: basic_values3
col_names: [ temp, psal, pres ]
feature_sets:
- name: feature_set_1
features:
- location
- day_of_year
- profile_summary_stats
- basic_values
- flank_up
- flank_down
feature_param_sets:
- name: feature_set_1_param_set_1
params:
- feature: location
stats_set: { type: raw }
col_names: [ longitude, latitude ]
- feature: day_of_year
convert: cosine # or sine
col_names: [ profile_timestamp ]
- feature: profile_summary_stats
stats_set: { type: raw }
col_names: [ temp, psal, pres ]
summary_stats_names: [ mean, median, sd, pct25, pct75 ]
- feature: basic_values
stats_set: { type: raw }
col_names: [ temp, psal, pres ]
- feature: flank_up
flank_up: 5
stats_set: { type: raw }
col_names: [ temp, psal, pres ]
- feature: flank_down
flank_down: 5
stats_set: { type: raw }
col_names: [ temp, psal, pres ]
feature_stats_sets:
- name: feature_set_1_stats_set_1
step_class_sets:
- name: data_set_step_set_1
steps:
input: InputDataSetA
summary: SummaryDataSetA
select: SelectDataSetAll
locate: LocateDataSetAll
extract: ExtractDataSetA
split: SplitDataSetAll
step_param_sets:
- name: data_set_param_set_1
steps:
input: { sub_steps: { rename_columns: false,
filter_rows: true },
rename_dict: { },
filter_method_dict: { remove_years: [2023],
keep_years: [] } }
summary: { }
select: { }
locate: { }
extract: { }
split: { test_set_fraction: 0.1,
k_fold: 5 }
data_sets:
- name: dataset_0001 # Your unique name for this dataset job
dataset_folder_name: dataset_0001 # The folder name for output files
input_file_name: nrt_cora_bo_4.parquet # The specific raw input file to process
path_info: data_set_1
target_set: target_set_1
summary_stats_set: summary_stats_set_1
feature_set: feature_set_1
feature_param_set: feature_set_1_param_set_1
feature_stats_set: feature_set_1_stats_set_1
step_class_set: data_set_step_set_1
step_param_set: data_set_param_set_1