aiqclib.prepare.step2_calc_stats package

Submodules

aiqclib.prepare.step2_calc_stats.dataset_a module

This module defines the SummaryDataSetA class, a specialized implementation of SummaryStatsBase for calculating summary statistics on specific datasets, such as Copernicus CTD data, using the Polars DataFrame library. It integrates with a configuration management system to ensure proper data processing.

class aiqclib.prepare.step2_calc_stats.dataset_a.SummaryDataSetA(config, input_data=None)[source]

Bases: SummaryStatsBase

Specialized class for calculating summary statistics for Copernicus CTD data.

This class extends aiqclib.prepare.step2_calc_stats.summary_base.SummaryStatsBase and leverages the Polars DataFrame library for efficient data processing. It identifies itself via the expected_class_name attribute to match corresponding YAML configuration entries.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

expected_class_name: str = 'SummaryDataSetA'

aiqclib.prepare.step2_calc_stats.summary_base module

Summary Statistics Module.

This module provides the SummaryStatsBase class, which serves as a base for calculating, aggregating, and exporting summary statistics from tabular datasets using the Polars library. It handles global and per-profile calculations and supports exporting results to TSV format.

class aiqclib.prepare.step2_calc_stats.summary_base.SummaryStatsBase(config, input_data=None)[source]

Bases: DataSetBase

Abstract base class for calculating summary statistics.

This class provides a framework for generating and writing summary statistics for a dataset. It handles both global (dataset-wide) and per-profile statistics for a specified set of numeric columns. Subclasses must define an expected_class_name to be instantiated.

Variables:
  • default_file_name (str) – The default filename for the output stats file.

  • output_file_name (str) – The full path for the output summary stats file, derived from the configuration.

  • input_data (polars.DataFrame or None) – The DataFrame containing the data to be analyzed.

  • summary_stats (polars.DataFrame or None) – DataFrame holding the combined global and per-profile statistics after calculation.

  • summary_stats_observation (polars.DataFrame or None) – DataFrame holding aggregated global statistics for key variables.

  • summary_stats_profile (polars.DataFrame or None) – DataFrame holding aggregated per-profile statistics for key variables.

  • val_col_names (list[str]) – List of numeric columns for which to compute statistics.

  • stats_col_names (list[str]) – The schema (column names) for the output statistics DataFrame.

  • profile_col_names (list[str]) – List of columns used to identify unique profiles for grouping.

Parameters:
  • config (ConfigBase)

  • input_data (DataFrame | None)

calculate_global_stats(val_col_name)[source]

Compute global summary statistics for a specified column.

These statistics are calculated across the entire dataset.

Parameters:

val_col_name (str) – Name of the column for which to calculate global statistics.

Returns:

A DataFrame with one row containing the summary statistics, structured to be compatible with per-profile stats.

Return type:

DataFrame

calculate_profile_stats(grouped_df, val_col_name)[source]

Compute per-profile summary statistics for a column.

Parameters:
  • grouped_df (DataFrame) – A Polars DataFrame already grouped by profile identifier columns (e.g., platform_code, profile_no).

  • val_col_name (str) – The name of the column for which to calculate per-profile stats.

Returns:

A DataFrame containing statistics for each profile.

Return type:

DataFrame

calculate_stats()[source]

Calculate and combine global and per-profile statistics.

This method computes statistics for each column in val_col_names at both the global and per-profile level, then concatenates them into a single DataFrame stored in summary_stats.

Returns:

None

Return type:

None

create_summary_stats_observation()[source]

Create a summarized view of global observation statistics.

This method filters the main statistics table for global (β€œall”) data, selects a subset of key metrics, and stores the result in summary_stats_observation.

Raises:

ValueError – If summary_stats has not been calculated yet.

Returns:

None

Return type:

None

create_summary_stats_profile()[source]

Create a summarized view of per-profile statistics.

This method filters the main statistics table for per-profile data, reshapes it to aggregate statistics (min, mean, max, etc.) across all profiles, and stores the result in summary_stats_profile.

Raises:

ValueError – If summary_stats has not been calculated yet.

Returns:

None

Return type:

None

default_file_name: str
static get_stats_expression(val_col_name)[source]

Build a list of Polars expressions to compute summary statistics.

Parameters:

val_col_name (str) – The name of the column to analyze.

Returns:

A list of Polars expressions for calculating min, max, mean, median, quantiles, and standard deviation.

Return type:

List[Expr]

input_data: DataFrame | None
output_file_name: str
summary_stats: DataFrame | None
summary_stats_observation: DataFrame | None
summary_stats_profile: DataFrame | None
write_summary_stats()[source]

Write the computed summary statistics to a TSV file.

The output path is determined by output_file_name.

Raises:

ValueError – If summary_stats has not been calculated yet.

Returns:

None

Return type:

None