Down-sampling of the Negative Dataset

This guide demonstrates how to control the negative dataset during the data preparation stage.

Methods

aiqclib provides two methods to control the negative dataset:

Selection of negative profiles
Selection of neighboring observations within positive profiles

Preparation

The generation of negative data is controlled by a configuration file. The following command with extension="reduced" produces a template configuration file for controlling the negative dataset.

import aiqclib as aq
import os

config_path = os.path.expanduser("~/aiqc_project/config/prepare_config.yaml")
aq.write_config_template(
    file_name=config_path,
    stage="prepare",
    extension="reduced"
)

The step_class_sets and step_param_sets sections in this configuration template are different from the default template produced by extension="".

step_class_sets:
  - name: data_set_step_set_1
    steps:
      input: InputDataSetA
      summary: SummaryDataSetA
      select: SelectDataSetA   # Not SelectDataSetAll
      locate: LocateDataSetA   # Not LocateDataSetAll
      extract: ExtractDataSetA
      split: SplitDataSetA     # Not SplitDataSetAll

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years:,
                                     keep_years: [] } }
      summary: { }
      select: { neg_pos_ratio: 5 }
      locate: { neighbor_n: 5 }
      extract: { }
      split: { test_set_fraction: 0.1,
               k_fold: 10 }

Selection of Negative Profiles

Positive profiles are selected before the negative profile selection. Positive profiles are defined as profiles that have at least one flagged (bad/invalid) observation.

Profile identification: Positive profiles (with flagged observations) and negative profiles (without) are identified.
Profile pairing: Each positive profile is paired with several negative profiles based on date differences for contextual similarity.

The number of paired negative profiles is defined by neg_pos_ratio in the step_param_sets section.

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years:,
                                     keep_years: [] } }
      summary: { }
      select: { neg_pos_ratio: 5 }
      locate: { neighbor_n: 5 }
      extract: { }
      split: { test_set_fraction: 0.1,
               k_fold: 10 }

Once the pairs are formed, the observations of similar depth between the pairs are used to select negative observations. A pair usually produces a pair of positive and negative observations. For example, neg_pos_ratio: 5 selects five negative profiles for each positive profile, which then produces five negative observations per positive observation.

Selection of Neighboring Observations

Negative observations can also be selected from positive profiles. When positive observations are identified and neighbor_n is set in the step_param_sets section, several upward and downward neighboring observations are selected unless they are also positive observations.

The number of neighboring negative observations is defined by neighbor_n in the step_param_sets section.

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years:,
                                     keep_years: [] } }
      summary: { }
      select: { neg_pos_ratio: 5 }
      locate: { neighbor_n: 5 }
      extract: { }
      split: { test_set_fraction: 0.1,
               k_fold: 10 }

For example, neighbor_n: 5 selects up to 10 negative observations from both the upward and downward neighbors around a positive observation.