aiqclib.prepare.step1_read_input package
Submodules
aiqclib.prepare.step1_read_input.dataset_a module
This module provides the InputDataSetA class, a specific implementation for reading and preparing Copernicus CTD data.
The module extends the base functionality provided by InputDataSetBase to implement concrete logic for data retrieval and initial processing as part of the data preparation pipeline.
- class aiqclib.prepare.step1_read_input.dataset_a.InputDataSetA(config)[source]
Bases:
InputDataSetBaseA subclass of
InputDataSetBaseproviding specific logic to read Copernicus CTD data.This class ensures compatibility with YAML configuration files by defining the expected class name used during the dynamic instantiation process.
- Variables:
expected_class_name (str) – String identifier used to match configuration keys.
- Parameters:
config (ConfigBase)
- expected_class_name: str = 'InputDataSetA'
aiqclib.prepare.step1_read_input.input_base module
This module defines the InputDataSetBase class, providing a foundational structure for loading, preprocessing, and managing input data within the DMQC library. It includes capabilities for reading various file formats, renaming columns, and filtering rows based on configurable parameters, serving as a base for domain-specific input data handling.
- class aiqclib.prepare.step1_read_input.input_base.InputDataSetBase(config)[source]
Bases:
DataSetBaseBase class for input data loading.
It extends
aiqclib.common.base.dataset_base.DataSetBaseby adding mechanisms for reading raw data from a file, renaming columns, and filtering rows.Subclasses must implement or customize methods such as
rename_columns()andfilter_rows()to handle domain-specific requirements.- Variables:
input_file_name (str) – The absolute or resolved file path from which data will be read.
input_data (Optional[polars.DataFrame]) – Polars DataFrame holding the loaded input data. Defaults to None until
read_input_data()is called.
- Parameters:
config (ConfigBase)
- create_columns()[source]
Derive the
profile_no/observation_noidentifier columns.Runs after
rename_columns()and beforevalidate_input_columns(), so the created columns are subsequently validated. It is enabled by setting the optional input sub-stepcreate_columnstotrue(it is disabled by default, so inputs that already provide these identifiers are never overwritten unintentionally).The behaviour is tuned via the optional
create_column_dictinput parameter, which may containkey_columns(columns that identify a profile),sort_columns(ordering before numbering) andcolumns(which identifiers to create). Missing keys fall back to the documented defaults.- Raises:
ValueError – If a required source column is missing.
- Return type:
None
- filter_rows()[source]
Filter rows in
input_databased on year constraints or other rules.If
sub_steps.filter_rowsis enabled and relevant fields exist, it will either remove certain years viaremove_years()or keep only a specified set of years viakeep_years().- Raises:
polars.exceptions.ColumnNotFoundError – If ‘profile_timestamp’ column is not present in
input_datawhen year-based filtering is attempted.- Return type:
None
- input_data: DataFrame | None
- input_file_name: str
- keep_years()[source]
Keep only data rows for years listed under
keep_yearsin the config.Updates
input_databy filtering in rows whose year is in thekeep_yearslist. This method assumes the existence of a ‘profile_timestamp’ column ininput_datato extract the year.- Raises:
polars.exceptions.ColumnNotFoundError – If ‘profile_timestamp’ column is not present in
input_data.- Return type:
None
- read_input_data()[source]
Load data from the configured file into
input_data.The method retrieves
file_typeandread_file_optionsfrom the config and usesaiqclib.common.utils.file.read_input_file()to read the file specified byinput_file_name.After reading the data, it calls
rename_columns(),create_columns()(which derivesprofile_no/observation_nowhen enabled),validate_input_columns()(which checks the mandatory columns and corrects their types) andfilter_rows()to modify the DataFrame.- Raises:
FileNotFoundError – If the specified file cannot be found.
polars.exceptions.NoDataError – If the file is empty or cannot be parsed.
Exception – For other errors during file reading or processing.
- Return type:
None
- remove_years()[source]
Remove data rows for years listed under
remove_yearsin the config.Updates
input_databy filtering out rows whose year is in theremove_yearslist. This method assumes the existence of a ‘profile_timestamp’ column ininput_datato extract the year.- Raises:
polars.exceptions.ColumnNotFoundError – If ‘profile_timestamp’ column is not present in
input_data.- Return type:
None
- rename_columns()[source]
Rename columns in
input_datausing rename mappings from the config.If
sub_steps.rename_columnsis enabled and arename_dictis present, columns will be renamed accordingly. Otherwise, the method does nothing.- Raises:
polars.exceptions.ColumnNotFoundError – If a column specified in
rename_dictfor renaming does not exist in the DataFrame.- Return type:
None
- validate_input_columns()[source]
Validate mandatory input columns and correct their types in place.
Runs immediately after
rename_columns()so the final column names are checked. It verifies that every column inaiqclib.common.utils.input_validation.REQUIRED_INPUT_COLUMNSis present and, where a column’s data type does not match the expected type, attempts to convert it (helpful for TSV/CSV inputs whose numeric and datetime columns are often read as strings).Validation can be disabled by setting the optional input sub-step
validate_columnstofalsein the configuration; it is enabled by default when the flag is absent.- Raises:
ValueError – If a required column is missing, or if a column’s type cannot be converted to the expected type.
- Return type:
None