Configuration of Dataset Preparation
======================================
The ``prepare`` workflow (``stage="prepare"``) is central to setting up your data for machine learning tasks within this library. It provides comprehensive control over the entire data processing pipeline, from  preparing feature data sets from your raw data and creating the training, validation, and test data sets.

Core Concepts: Modular Configuration
------------------------------------
The configuration for dataset preparation is designed around a powerful "building blocks" concept. Instead of defining a monolithic configuration, you define various sets of specialized configurations once, give each set a unique name, and then combine them as needed to construct a complete and flexible data processing pipeline. This modularity promotes reusability, simplifies experimentation, and enhances maintainability.

The primary configuration sections (building blocks) are:

*   **path_info_sets**: Defines reusable directory structures for input data and processed outputs.
*   **target_sets**: Specifies the prediction target variables, including their quality control (QC) flags.
*   **summary_stats_sets**: Configures summary statistics.
*   **feature_sets**: (**Advanced**) Lists the specific feature engineering methods to be applied.
*   **feature_param_sets**: Provides detailed parameters and settings for each chosen feature engineering method.
*   **feature_stats_sets**: (**Advanced**) Provides summary statistics values for normalizing features.
*   **step_class_sets**: (**Advanced**) Allows users to define custom Python classes for individual processing steps, enabling deep customization of the pipeline's behavior.
*   **step_param_sets**: Supplies general parameters that control the behavior of the default or custom processing steps.
*   **data_sets**: The central assembly section, where you combine named blocks from the sections above to define a complete and executable data processing pipeline.

.. note::

   ``dmaclib`` provides methods to down-sample the negative data set. Please refer to the :doc:`../../how-to/down_sampling_negative` guide for details.

Detailed Configuration Sections
-------------------------------

`path_info_sets`
^^^^^^^^^^^^^^^^
This section defines the critical file system locations for both your raw input data and the various processed output artifacts. You can define multiple named path configurations to easily switch between different storage environments or project setups.

*   **common.base_path**: The root directory where all processed data and intermediate artifacts will be saved by this workflow.
*   **input.base_path**: The directory containing your raw input data files.
*   **split.step_folder_name**: The name of the subdirectory where the final training, validation, and test datasets will be stored (e.g., `training`).

.. code-block:: yaml

   path_info_sets:
     - name: data_set_1
       common:
         base_path: /path/to/data
       input:
         base_path: /path/to/input
         step_folder_name: ""
       split:
         step_folder_name: training

`target_sets`
^^^^^^^^^^^^^
This section specifies the target variables that your machine learning model will predict. For each target variable, you must also define its corresponding quality control (QC) flag column. These flags are crucial for identifying good versus bad data points, allowing the pipeline to filter or weight data appropriately. You define both positive (good) and negative (bad) flag values.

.. code-block:: yaml

   target_sets:
     - name: target_set_1
       variables:
         - name: temp
           flag: temp_qc
           pos_flag_values: [ 4, 6, 7 ]
           neg_flag_values: [ 1 ]

`summary_stats_sets`
^^^^^^^^^^^^^^^^^^^^
This section defines summary statistics that will be used for feature values or feature normalization.

.. code-block:: yaml

   summary_stats_sets:
     - name: summary_stats_set_1
       stats:
         - name: location
           col_names: [ longitude, latitude ]
         - name: profile_summary_stats
           col_names: [ temp, psal, pres ]
         - name: basic_values3
           col_names: [ temp, psal, pres ]

``aiqclib`` currently provides the following summary statistics.

*   **location**: global summary statistics of locations for feature normalization.
*   **profile_summary_stats**: profile level summary statistics used as features and for feature normalization.
*   **basic_values3**: global summary statistics of specified variables for feature normalization.

`feature_sets` & `feature_param_sets`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
These two interconnected sections are dedicated to configuring your feature engineering process.

*   **feature_sets**: This block lists the *names* of the specific feature engineering methods you want to apply to your data.
*   **feature_param_sets**: This block provides the detailed parameters and configurations for each of the feature methods listed in your chosen ``feature_sets`` block. This allows for fine-grained control over how each feature is generated.

.. code-block:: yaml

   # A list of features to apply
   feature_sets:
     - name: feature_set_1
       features:
         - location
         - day_of_year
         - profile_summary_stats
         - basic_values
         - flank_up
         - flank_down

   # Parameters for the features listed above
   feature_param_sets:
     - name: feature_set_1_param_set_1
       params:
         - feature: location
           stats_set: { type: raw }
           col_names: [ longitude, latitude ]
         - feature: day_of_year
           convert: cosine                         # or sine
           col_names: [ profile_timestamp ]
         - feature: profile_summary_stats
           stats_set: { type: raw }
           col_names: [ temp, psal, pres ]
           summary_stats_names: [ mean, median, sd, pct25, pct75 ]
         - feature: basic_values
           stats_set: { type: raw }
           col_names: [ temp, psal, pres ]
         - feature: flank_up
           flank_up: 5
           stats_set: { type: raw }
           col_names: [ temp, psal, pres ]
         - feature: flank_down
           flank_down: 5
           stats_set: { type: raw }
           col_names: [ temp, psal, pres ]

`feature_stats_sets`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(**Advanced Use**)

This section defines summary statistics that will be used for normalization or scaling of feature values. These statistics are typically derived from your dataset itself to ensure proper scaling.

.. code-block:: yaml

   feature_stats_sets:
     - name: feature_set_1_stats_set_1

.. important::

   As it is crucial to normalize features for non-tree based machine learning methods, such as SVM and logistic regression, you need to provide summary statistics (like min/max values) of your data in the configuration file. The ``aiqclib`` library offers convenient functions to calculate the summary statistics.  Please refer to the :doc:`../../how-to/feature_normalization` guide for details.

`step_class_sets`
^^^^^^^^^^^^^^^^^
(**Advanced Use**)
This section allows you to define and reference custom Python classes that implement the logic for specific processing steps within the data preparation pipeline. While the library provides default implementations for all steps, this block gives advanced users the flexibility to replace or extend pipeline behaviors with their own code. Each entry maps a step name (e.g., ``input``, ``summary``) to the name of a Python class.

.. code-block:: yaml

   step_class_sets:
     - name: data_set_step_set_1
       steps:
         input: InputDataSetA
         summary: SummaryDataSetA
         select: SelectDataSetAll
         locate: LocateDataSetAll
         extract: ExtractDataSetA
         split: SplitDataSetAll

`step_param_sets`
^^^^^^^^^^^^^^^^^
This section provides general parameters that control the behavior of the various data processing steps within the pipeline (whether default or custom ``step_class_sets``). Examples of parameters include data filtering rules, sampling ratios, and split configurations.

*   **steps.input.sub_steps.filter_rows**: A boolean flag to enable/disable row filtering based on ``filter_method_dict``.
*   **steps.input.filter_method_dict.remove_years**: Specifies a list of years to be excluded from the dataset.
*   **steps.input.filter_method_dict.keep_years**: Specifies a list of years to be kept for training.
*   **steps.split.test_set_fraction**: Defines the proportion of data to allocate to the test set.
*   **steps.split.k_fold**: Defines the `k` of k-fold cross validation

.. code-block:: yaml

   step_param_sets:
     - name: data_set_param_set_1
       steps:
         input: { sub_steps: { rename_columns: false,
                               filter_rows: true },
                  rename_dict: { },
                  filter_method_dict: { remove_years: [2023],
                                        keep_years: [] } }
         summary: { }
         select: { }
         locate: { }
         extract: { }
         split: { test_set_fraction: 0.1,
                  k_fold: 5 }

`data_sets`
^^^^^^^^^^^
This is the main "pipeline assembly" section. Each entry in this list defines a complete data preparation job by linking together the named building blocks defined in the other sections. This section essentially orchestrates which specific configuration sets are used for a given dataset processing run.

*   **name**: A unique identifier for this particular dataset preparation job (e.g., ``dataset_0001``).
*   **dataset_folder_name**: The name of the specific folder that will be created within the ``common.base_path`` to store outputs for this job (e.g., ``dataset_0001``).
*   **input_file_name**: The specific raw data file (located in ``input.base_path``) to be processed for this job.
*   **path_info**: The ``name`` of the path configuration to use from ``path_info_sets``.
*   **target_set**: The ``name`` of the target configuration to use from ``target_sets``.
*   ...and similarly for all other configuration sets.

.. code-block:: yaml

   data_sets:
     - name: dataset_0001
       dataset_folder_name: dataset_0001
       input_file_name: nrt_cora_bo_4.parquet
       path_info: data_set_1
       target_set: target_set_1
       # ... other set references would follow here

.. note::
   While you can define multiple data sets in the ``data_sets`` section, a specific one must be selected for subsequent processes. Please consult the dedicated :doc:`../../how-to/selecting_specific_configurations` page for instructions on how to do this.

Full Example
------------

Below is a complete example of a ``prepare_config.yaml`` file, demonstrating how all the building blocks are combined. The lines you will most commonly need to edit or customize are highlighted for quick reference.

.. code-block:: yaml
   :caption: Full prepare_config.yaml example
   :emphasize-lines: 5, 7, 65, 69, 90, 92, 93, 98, 99, 102, 103, 104

   ---
   path_info_sets:
     - name: data_set_1
       common:
         base_path: /path/to/data # Root output directory for processed data
       input:
         base_path: /path/to/input # Directory containing raw input files
         step_folder_name: ""
       split:
         step_folder_name: training

   target_sets:
     - name: target_set_1
       variables:
         - name: temp
           flag: temp_qc
           pos_flag_values: [ 4, 6, 7 ]
           neg_flag_values: [ 1 ]
         - name: psal
           flag: psal_qc
           pos_flag_values: [ 4, 6, 7 ]
           neg_flag_values: [ 1 ]
         - name: pres
           flag: pres_qc
           pos_flag_values: [ 4, 6, 7 ]
           neg_flag_values: [ 1 ]

   summary_stats_sets:
     - name: summary_stats_set_1
       stats:
         - name: location
           col_names: [ longitude, latitude ]
         - name: profile_summary_stats
           col_names: [ temp, psal, pres ]
         - name: basic_values3
           col_names: [ temp, psal, pres ]

   feature_sets:
     - name: feature_set_1
       features:
         - location
         - day_of_year
         - profile_summary_stats
         - basic_values
         - flank_up
         - flank_down

   feature_param_sets:
     - name: feature_set_1_param_set_1
       params:
         - feature: location
           stats_set: { type: raw }
           col_names: [ longitude, latitude ]
         - feature: day_of_year
           convert: cosine                         # or sine
           col_names: [ profile_timestamp ]
         - feature: profile_summary_stats
           stats_set: { type: raw }
           col_names: [ temp, psal, pres ]
           summary_stats_names: [ mean, median, sd, pct25, pct75 ]
         - feature: basic_values
           stats_set: { type: raw }
           col_names: [ temp, psal, pres ]
         - feature: flank_up
           flank_up: 5
           stats_set: { type: raw }
           col_names: [ temp, psal, pres ]
         - feature: flank_down
           flank_down: 5
           stats_set: { type: raw }
           col_names: [ temp, psal, pres ]

   feature_stats_sets:
     - name: feature_set_1_stats_set_1

   step_class_sets:
     - name: data_set_step_set_1
       steps:
         input: InputDataSetA
         summary: SummaryDataSetA
         select: SelectDataSetAll
         locate: LocateDataSetAll
         extract: ExtractDataSetA
         split: SplitDataSetAll

   step_param_sets:
     - name: data_set_param_set_1
       steps:
         input: { sub_steps: { rename_columns: false,
                               filter_rows: true },
                  rename_dict: { },
                  filter_method_dict: { remove_years: [2023],
                                        keep_years: [] } }
         summary: { }
         select: { }
         locate: { }
         extract: { }
         split: { test_set_fraction: 0.1,
                  k_fold: 5 }

   data_sets:
     - name: dataset_0001  # Your unique name for this dataset job
       dataset_folder_name: dataset_0001  # The folder name for output files
       input_file_name: nrt_cora_bo_4.parquet # The specific raw input file to process
       path_info: data_set_1
       target_set: target_set_1
       summary_stats_set: summary_stats_set_1
       feature_set: feature_set_1
       feature_param_set: feature_set_1_param_set_1
       feature_stats_set: feature_set_1_stats_set_1
       step_class_set: data_set_step_set_1
       step_param_set: data_set_param_set_1