Data Preprocessing Utilities

aiqclib expects a few identity and coordinate columns in your raw input. Most of the preparation is now handled for you: after the optional column rename step, aiqclib validates the required columns, auto-corrects their data types, and can derive profile_no and observation_no. The only thing it cannot infer is a timestamp stored as a plain number (see below).

Required Input Data Columns

Column

Type

Description

platform_code

text

Identifier for the platform (e.g. a buoy ID).

profile_no

integer

Sequential within a platform_code; identifies one profile (a single measurement event).

profile_timestamp

datetime

The profile’s date and time (a real Datetime).

longitude

float

Longitude of the profile.

latitude

float

Latitude of the profile.

observation_no

integer

Sequential within a profile; the order of observations.

pres

float

Pressure for each observation.

Other columns (targets such as temp, QC flags, etc.) are passed through untouched.

What aiqclib does for you

The input step can perform three preprocessing tasks, in this order:

Step

Controlled by (input sub-step)

Default

Rename columns to the required names

rename_columns + rename_dict

off

Create profile_no / observation_no

create_columns + create_column_dict

off

Validate columns and correct their types

validate_columns

on

Renaming. If your raw columns use different names, map them with rename_dict ({ original_name: required_name }) and set rename_columns: true. Renaming runs first, so the create and validate steps see the final column names.

Creating identifiers. profile_no is numbered within each platform_code and observation_no within each profile (ordered by pres). Choose key_columns so they genuinely identify a profile: jittered coordinates would split one profile, while identical timestamps at the same location would merge two.

Validating. The required columns are checked, and mismatched types are converted where possible β€” handy for CSV/TSV inputs, where numbers and dates often arrive as strings.

Configure all three in the input step:

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input:
        sub_steps:
          rename_columns: true
          filter_rows: false
          validate_columns: true      # on by default
          create_columns: true        # opt-in
        rename_dict: { date: profile_timestamp }
        create_column_dict:           # all keys optional
          key_columns:  [ platform_code, profile_timestamp, longitude, latitude ]
          sort_columns: [ platform_code, profile_timestamp, longitude, latitude, pres ]
          columns:      [ profile_no, observation_no ]

Note

validate_columns is enabled by default. create_columns is off by default, so inputs that already provide these identifiers are never overwritten. Set columns: [ observation_no ] to create only one of them.

Converting a numeric timestamp

profile_timestamp must be a real datetime. A numeric epoch (e.g. days since 1950-01-01) is ambiguous, so aiqclib cannot convert it automatically β€” do this yourself before running the workflow:

import polars as pl
from datetime import datetime

# 'profile_time' is days since 1950-01-01
df = df.with_columns(
    (
        pl.lit(datetime(1950, 1, 1))
        + pl.duration(days=pl.col("profile_time").floor())
        + pl.duration(
            seconds=(pl.col("profile_time") - pl.col("profile_time").floor()) * 86400
        )
    )
    .cast(pl.Datetime("ms"))
    .alias("profile_timestamp")
)

Important

Remove duplicate rows at the platform/profile level first β€” duplicates can produce incorrect datasets even when everything else is correct.

Save the Preprocessed Data

Write the result to Parquet and point input_file_name in your prepare_config.yaml at it:

import os

output_file = os.path.expanduser("~/aiqc_project/input/nrt_cora_bo_preprocessed.parquet")
df.write_parquet(output_file)

Next Steps

Return to the tutorial: Step 2: Dataset Preparation.