Down-sampling of the Negative Dataset ===================================== This guide demonstrates how to control the negative dataset during the data preparation stage. Methods ------- ``aiqclib`` provides two methods to control the negative dataset: 1. Selection of negative profiles 2. Selection of neighboring observations within positive profiles Preparation ----------- The generation of negative data is controlled by a configuration file. The following command with ``extension="reduced"`` produces a template configuration file for controlling the negative dataset. .. code-block:: python import aiqclib as aq import os config_path = os.path.expanduser("~/aiqc_project/config/prepare_config.yaml") aq.write_config_template( file_name=config_path, stage="prepare", extension="reduced" ) The ``step_class_sets`` and ``step_param_sets`` sections in this configuration template are different from the default template produced by ``extension=""``. .. code-block:: yaml :emphasize-lines: 6, 7, 9 step_class_sets: - name: data_set_step_set_1 steps: input: InputDataSetA summary: SummaryDataSetA select: SelectDataSetA # Not SelectDataSetAll locate: LocateDataSetA # Not LocateDataSetAll extract: ExtractDataSetA split: SplitDataSetA # Not SplitDataSetAll .. code-block:: yaml :emphasize-lines: 10, 11 step_param_sets: - name: data_set_param_set_1 steps: input: { sub_steps: { rename_columns: false, filter_rows: true }, rename_dict: { }, filter_method_dict: { remove_years:, keep_years: [] } } summary: { } select: { neg_pos_ratio: 5 } locate: { neighbor_n: 5 } extract: { } split: { test_set_fraction: 0.1, k_fold: 10 } Selection of Negative Profiles ------------------------------ Positive profiles are selected before the negative profile selection. Positive profiles are defined as profiles that have at least one flagged (bad/invalid) observation. 1. **Profile identification**: Positive profiles (with flagged observations) and negative profiles (without) are identified. 2. **Profile pairing**: Each positive profile is paired with several negative profiles based on date differences for contextual similarity. The number of paired negative profiles is defined by ``neg_pos_ratio`` in the ``step_param_sets`` section. .. code-block:: yaml :emphasize-lines: 10 step_param_sets: - name: data_set_param_set_1 steps: input: { sub_steps: { rename_columns: false, filter_rows: true }, rename_dict: { }, filter_method_dict: { remove_years:, keep_years: [] } } summary: { } select: { neg_pos_ratio: 5 } locate: { neighbor_n: 5 } extract: { } split: { test_set_fraction: 0.1, k_fold: 10 } Once the pairs are formed, the observations of similar depth between the pairs are used to select negative observations. A pair usually produces a pair of positive and negative observations. For example, ``neg_pos_ratio: 5`` selects five negative profiles for each positive profile, which then produces five negative observations per positive observation. Selection of Neighboring Observations ------------------------------------- Negative observations can also be selected from positive profiles. When positive observations are identified and ``neighbor_n`` is set in the ``step_param_sets`` section, several upward and downward neighboring observations are selected unless they are also positive observations. The number of neighboring negative observations is defined by ``neighbor_n`` in the ``step_param_sets`` section. .. code-block:: yaml :emphasize-lines: 11 step_param_sets: - name: data_set_param_set_1 steps: input: { sub_steps: { rename_columns: false, filter_rows: true }, rename_dict: { }, filter_method_dict: { remove_years:, keep_years: [] } } summary: { } select: { neg_pos_ratio: 5 } locate: { neighbor_n: 5 } extract: { } split: { test_set_fraction: 0.1, k_fold: 10 } For example, ``neighbor_n: 5`` selects up to 10 negative observations from both the upward and downward neighbors around a positive observation.