aiqclib.prepare.step1_read_input package

Submodules

aiqclib.prepare.step1_read_input.dataset_a module

This module provides the InputDataSetA class, a specific implementation for reading and preparing Copernicus CTD data.

The module extends the base functionality provided by InputDataSetBase to implement concrete logic for data retrieval and initial processing as part of the data preparation pipeline.

class aiqclib.prepare.step1_read_input.dataset_a.InputDataSetA(config)[source]

Bases: InputDataSetBase

A subclass of InputDataSetBase providing specific logic to read Copernicus CTD data.

This class ensures compatibility with YAML configuration files by defining the expected class name used during the dynamic instantiation process.

Variables:: expected_class_name (str) – String identifier used to match configuration keys.
Parameters:: config (ConfigBase)

expected_class_name: str = 'InputDataSetA'

aiqclib.prepare.step1_read_input.input_base module

This module defines the InputDataSetBase class, providing a foundational structure for loading, preprocessing, and managing input data within the DMQC library. It includes capabilities for reading various file formats, renaming columns, and filtering rows based on configurable parameters, serving as a base for domain-specific input data handling.

class aiqclib.prepare.step1_read_input.input_base.InputDataSetBase(config)[source]

Bases: DataSetBase

Base class for input data loading.

It extends aiqclib.common.base.dataset_base.DataSetBase by adding mechanisms for reading raw data from a file, renaming columns, and filtering rows.

Subclasses must implement or customize methods such as rename_columns() and filter_rows() to handle domain-specific requirements.

Variables:

input_file_name (str) – The absolute or resolved file path from which data will be read.
input_data (Optional[polars.DataFrame]) – Polars DataFrame holding the loaded input data. Defaults to None until read_input_data() is called.

Parameters:

config (ConfigBase)

create_columns()[source]

Derive the profile_no / observation_no identifier columns.

Runs after rename_columns() and before validate_input_columns(), so the created columns are subsequently validated. It is enabled by setting the optional input sub-step create_columns to true (it is disabled by default, so inputs that already provide these identifiers are never overwritten unintentionally).

The behaviour is tuned via the optional create_column_dict input parameter, which may contain key_columns (columns that identify a profile), sort_columns (ordering before numbering) and columns (which identifiers to create). Missing keys fall back to the documented defaults.

Raises:: ValueError – If a required source column is missing.
Return type:: None

filter_rows()[source]

Filter rows in input_data based on year constraints or other rules.

If sub_steps.filter_rows is enabled and relevant fields exist, it will either remove certain years via remove_years() or keep only a specified set of years via keep_years().

Raises:: polars.exceptions.ColumnNotFoundError – If ‘profile_timestamp’ column is not present in input_data when year-based filtering is attempted.
Return type:: None

input_data: DataFrame | None

input_file_name: str

keep_years()[source]

Keep only data rows for years listed under keep_years in the config.

Updates input_data by filtering in rows whose year is in the keep_years list. This method assumes the existence of a ‘profile_timestamp’ column in input_data to extract the year.

Raises:: polars.exceptions.ColumnNotFoundError – If ‘profile_timestamp’ column is not present in input_data.
Return type:: None

read_input_data()[source]

Load data from the configured file into input_data.

The method retrieves file_type and read_file_options from the config and uses aiqclib.common.utils.file.read_input_file() to read the file specified by input_file_name.

After reading the data, it calls rename_columns(), create_columns() (which derives profile_no / observation_no when enabled), validate_input_columns() (which checks the mandatory columns and corrects their types) and filter_rows() to modify the DataFrame.

Raises:

FileNotFoundError – If the specified file cannot be found.
polars.exceptions.NoDataError – If the file is empty or cannot be parsed.
Exception – For other errors during file reading or processing.

Return type:

None

remove_years()[source]

Remove data rows for years listed under remove_years in the config.

Updates input_data by filtering out rows whose year is in the remove_years list. This method assumes the existence of a ‘profile_timestamp’ column in input_data to extract the year.

Raises:: polars.exceptions.ColumnNotFoundError – If ‘profile_timestamp’ column is not present in input_data.
Return type:: None

rename_columns()[source]

Rename columns in input_data using rename mappings from the config.

If sub_steps.rename_columns is enabled and a rename_dict is present, columns will be renamed accordingly. Otherwise, the method does nothing.

Raises:: polars.exceptions.ColumnNotFoundError – If a column specified in rename_dict for renaming does not exist in the DataFrame.
Return type:: None

validate_input_columns()[source]

Validate mandatory input columns and correct their types in place.

Runs immediately after rename_columns() so the final column names are checked. It verifies that every column in aiqclib.common.utils.input_validation.REQUIRED_INPUT_COLUMNS is present and, where a column’s data type does not match the expected type, attempts to convert it (helpful for TSV/CSV inputs whose numeric and datetime columns are often read as strings).

Validation can be disabled by setting the optional input sub-step validate_columns to false in the configuration; it is enabled by default when the flag is absent.

Raises:: ValueError – If a required column is missing, or if a column’s type cannot be converted to the expected type.
Return type:: None