Down-sampling of the Negative Dataset

This guide demonstrates how to control the negative dataset during the data preparation stage.

Methods

aiqclib provides two methods to control the negative dataset:

  1. Selection of negative profiles

  2. Selection of neighboring observations within positive profiles

Preparation

The generation of negative data is controlled by a configuration file. The following command with extension="reduced" produces a template configuration file for controlling the negative dataset.

import aiqclib as aq
import os

config_path = os.path.expanduser("~/aiqc_project/config/prepare_config.yaml")
aq.write_config_template(
    file_name=config_path,
    stage="prepare",
    extension="reduced"
)

The step_class_sets and step_param_sets sections in this configuration template are different from the default template produced by extension="".

step_class_sets:
  - name: data_set_step_set_1
    steps:
      input: InputDataSetA
      summary: SummaryDataSetA
      select: SelectDataSetA   # Not SelectDataSetAll
      locate: LocateDataSetA   # Not LocateDataSetAll
      extract: ExtractDataSetA
      split: SplitDataSetA     # Not SplitDataSetAll
step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years:,
                                     keep_years: [] } }
      summary: { }
      select: { neg_pos_ratio: 5 }
      locate: { neighbor_n: 5 }
      extract: { }
      split: { test_set_fraction: 0.1,
               k_fold: 10 }

Selection of Negative Profiles

Positive profiles are selected before the negative profile selection. Positive profiles are defined as profiles that have at least one flagged (bad/invalid) observation.

  1. Profile identification: Positive profiles (with flagged observations) and negative profiles (without) are identified.

  2. Profile pairing: Each positive profile is paired with several negative profiles based on date differences for contextual similarity.

The number of paired negative profiles is defined by neg_pos_ratio in the step_param_sets section.

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years:,
                                     keep_years: [] } }
      summary: { }
      select: { neg_pos_ratio: 5 }
      locate: { neighbor_n: 5 }
      extract: { }
      split: { test_set_fraction: 0.1,
               k_fold: 10 }

Once the pairs are formed, the observations of similar depth between the pairs are used to select negative observations. A pair usually produces a pair of positive and negative observations. For example, neg_pos_ratio: 5 selects five negative profiles for each positive profile, which then produces five negative observations per positive observation.

Selection of Neighboring Observations

Negative observations can also be selected from positive profiles. When positive observations are identified and neighbor_n is set in the step_param_sets section, several upward and downward neighboring observations are selected unless they are also positive observations.

The number of neighboring negative observations is defined by neighbor_n in the step_param_sets section.

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years:,
                                     keep_years: [] } }
      summary: { }
      select: { neg_pos_ratio: 5 }
      locate: { neighbor_n: 5 }
      extract: { }
      split: { test_set_fraction: 0.1,
               k_fold: 10 }

For example, neighbor_n: 5 selects up to 10 negative observations from both the upward and downward neighbors around a positive observation.