Profile Summary Statistics

The profile_summary_stats feature is a profile-level feature that represents the summary statistics of specified variables. All observations belonging to the same profile generally have the same profile_summary_stats feature values. The profile_summary_stats feature can contain the following nine statistics:

  1. min: minimum

  2. max: maximum

  3. mean: mean

  4. median: median

  5. pct25: 25th percentile

  6. pct75: 75th percentile

  7. pct2.5: 2.5th percentile

  8. pct97.5: 97.5th percentile

  9. sd: standard deviation

Configuration: Summary Statistics

The profile_summary_stats feature requires the calculation of summary statistics prior to feature extraction. This can be specified in the summary_stats_sets section of a configuration file. The variables used for the feature should be specified in col_names.

summary_stats_sets:
  - name: summary_stats_set_1
    stats:
      - name: profile_summary_stats
        col_names: [ temp, psal, pres ]

Configuration: Setup

To include the profile_summary_stats feature in your training and classification datasets, the value profile_summary_stats needs to be specified in the feature_sets section.

feature_sets:
  - name: feature_set_1
    features:
      - profile_summary_stats

Configuration: Parameters

The profile_summary_stats feature requires three mandatory parameters: col_names, summary_stats_names, and stats_set.

  • The col_names parameter specifies the column names in the input dataset that will be used for the profile_summary_stats feature.

  • The summary_stats_names parameter specifies the names of the summary statistics to be used as features.

  • The stats_set parameter specifies how the feature values are normalized. aiqclib currently supports raw and min_max as normalization methods. The name value in stats_set must correspond to a name in the feature_stats_sets section.

feature_param_sets:
  - name: feature_set_1_param_set_1
    params:
      - feature: profile_summary_stats
        col_names: [ temp, psal, pres ]
        summary_stats_names: [ mean, median, sd, pct25, pct75 ]
        stats_set: { type: min_max, name: profile_summary_stats }

Configuration: Normalization

If the normalization method is not set to raw, the summary statistics specified here will be used for normalization.

feature_stats_sets:
  - name: feature_set_1_stats_set_1
    min_max:
      - name: profile_summary_stats
        stats: { temp: { mean: { min: 0, max: 12.5 },
                         median: { min: 0, max: 15 },
                         sd: { min: 0, max: 6.5 },
                         pct25: { min: 0, max: 12 },
                         pct75: { min: 1, max: 19 } },
                 psal: { mean: { min: 2.9, max: 12 },
                         median: { min: 2.9, max: 12 },
                         sd: { min: 0, max: 4 },
                         pct25: { min: 2.5, max: 8.5 },
                         pct75: { min: 3, max: 16 } },
                 pres: { mean: { min: 24, max: 105 },
                         median: { min: 24, max: 105 },
                         sd: { min: 13, max: 60 },
                         pct25: { min: 12, max: 53 },
                         pct75: { min: 35, max: 156 } } }

Note

aiqclib offers helper functions to calculate summary statistics (like min/max values). Please refer to the Feature Normalization guide for details.