Data Preprocessing Utilities

aiqclib expects a few identity and coordinate columns in your raw input. Most of the preparation is now handled for you: after the optional column rename step, aiqclib validates the required columns, auto-corrects their data types, and can derive profile_no and observation_no. The only thing it cannot infer is a timestamp stored as a plain number (see below).

Required Input Data Columns

Column	Type	Description
`platform_code`	text	Identifier for the platform (e.g. a buoy ID).
`profile_no`	integer	Sequential within a `platform_code`; identifies one profile (a single measurement event).
`profile_timestamp`	datetime	The profile’s date and time (a real `Datetime`).
`longitude`	float	Longitude of the profile.
`latitude`	float	Latitude of the profile.
`observation_no`	integer	Sequential within a profile; the order of observations.
`pres`	float	Pressure for each observation.

Other columns (targets such as temp, QC flags, etc.) are passed through untouched.

What `aiqclib` does for you

The input step can perform three preprocessing tasks, in this order:

Step	Controlled by (`input` sub-step)	Default
Rename columns to the required names	`rename_columns` + `rename_dict`	off
Create `profile_no` / `observation_no`	`create_columns` + `create_column_dict`	off
Validate columns and correct their types	`validate_columns`	on

Renaming. If your raw columns use different names, map them with rename_dict ({ original_name: required_name }) and set rename_columns: true. Renaming runs first, so the create and validate steps see the final column names.

Creating identifiers. profile_no is numbered within each platform_code and observation_no within each profile (ordered by pres). Choose key_columns so they genuinely identify a profile: jittered coordinates would split one profile, while identical timestamps at the same location would merge two.

Validating. The required columns are checked, and mismatched types are converted where possible — handy for CSV/TSV inputs, where numbers and dates often arrive as strings.

Configure all three in the input step:

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input:
        sub_steps:
          rename_columns: true
          filter_rows: false
          validate_columns: true      # on by default
          create_columns: true        # opt-in
        rename_dict: { date: profile_timestamp }
        create_column_dict:           # all keys optional
          key_columns:  [ platform_code, profile_timestamp, longitude, latitude ]
          sort_columns: [ platform_code, profile_timestamp, longitude, latitude, pres ]
          columns:      [ profile_no, observation_no ]

Note

validate_columns is enabled by default. create_columns is off by default, so inputs that already provide these identifiers are never overwritten. Set columns: [ observation_no ] to create only one of them.

Converting a numeric timestamp

profile_timestamp must be a real datetime. A numeric epoch (e.g. days since 1950-01-01) is ambiguous, so aiqclib cannot convert it automatically — do this yourself before running the workflow:

import polars as pl
from datetime import datetime

# 'profile_time' is days since 1950-01-01
df = df.with_columns(
    (
        pl.lit(datetime(1950, 1, 1))
        + pl.duration(days=pl.col("profile_time").floor())
        + pl.duration(
            seconds=(pl.col("profile_time") - pl.col("profile_time").floor()) * 86400
        )
    )
    .cast(pl.Datetime("ms"))
    .alias("profile_timestamp")
)

Important

Remove duplicate rows at the platform/profile level first — duplicates can produce incorrect datasets even when everything else is correct.

Save the Preprocessed Data

Write the result to Parquet and point input_file_name in your prepare_config.yaml at it:

import os

output_file = os.path.expanduser("~/aiqc_project/input/nrt_cora_bo_preprocessed.parquet")
df.write_parquet(output_file)

Next Steps

Return to the tutorial: Step 2: Dataset Preparation.