Feature Normalization
aiqclib uses XGBoost by default, which does not need normalized
features. Non-tree-based models such as SVM do, so each feature can be
normalized independently by setting stats_set.type in
feature_param_sets.
Normalization methods
Type |
Formula |
Values come from |
|---|---|---|
|
no scaling |
— |
|
|
values you enter in |
|
|
derived from your data automatically |
|
|
derived from your data automatically |
Templates from write_config_template use raw for every feature.
day_of_year is always converted with convert: cosine and is not
affected by stats_set.
Automatic normalization
For auto_min_max and standard you do not enter any values. Set the
type on each feature and leave feature_stats_sets as a name-only entry:
feature_param_sets:
- name: feature_set_1_param_set_1
params:
- feature: basic_values
stats_set: { type: standard, name: basic_values }
col_names: [ temp, psal, pres ]
- feature: profile_summary_stats
stats_set: { type: auto_min_max, name: profile_summary_stats }
col_names: [ temp, psal, pres ]
summary_stats_names: [ mean, median, sd, pct25, pct75 ]
feature_stats_sets:
- name: feature_set_1_stats_set_1 # values are filled in automatically
During prepare the values are computed from the data and written to a
normalization file. During classify that file is read back, so the same
scaling is reapplied without re-entering values or needing the training data.
The file defaults to normalization_stats.yaml in a normalize folder
under the dataset. To change the name, set it on a normalize step; its
location can be set under path_info_sets like any other step:
step_param_sets:
- name: step_param_set_1
steps:
normalize: { file_name: normalization_stats.yaml }
Point the classify configuration at the same file produced during
prepare.
Manual normalization (min_max)
Use min_max to set explicit ranges yourself.
1. Generate a template with worked examples. The full template
contains a complete min_max example for every feature.
import aiqclib as aq
aq.write_config_template(file_name="prepare_config.yaml",
stage="prepare", extension="full")
aq.write_config_template(file_name="classification_config.yaml",
stage="classify", extension="full")
2. Inspect your data to choose ranges.
import aiqclib as aq
input_file = "~/aiqc_project/input/nrt_cora_bo_4.parquet"
print(aq.format_summary_stats(aq.get_summary_stats(input_file, "all")))
print(aq.format_summary_stats(aq.get_summary_stats(input_file, "profiles")))
3. Set the type/name and the values. In feature_param_sets the
name links each feature to a block in feature_stats_sets:
feature_param_sets:
- name: feature_set_1_param_set_1
params:
- feature: location
stats_set: { type: min_max, name: location }
col_names: [ longitude, latitude ]
- feature: basic_values
stats_set: { type: min_max, name: basic_values3 }
col_names: [ temp, psal, pres ]
feature_stats_sets:
- name: feature_set_1_stats_set_1
min_max:
- name: location
stats: { longitude: { min: 14.5, max: 23.5 },
latitude: { min: 55, max: 66 } }
- name: basic_values3
stats: { temp: { min: 0, max: 20 },
psal: { min: 0, max: 20 },
pres: { min: 0, max: 200 } }
# profile_summary_stats uses a nested {variable: {stat: {min, max}}}
# form; see the full template for the complete example.
Unlike the automatic methods, min_max values are read from each
configuration directly (no file is written), so set the same values in both
the prepare and classify configurations.