Data Preprocessing Utilitiesο
aiqclib expects a few identity and coordinate columns in your raw input.
Most of the preparation is now handled for you: after the optional column
rename step, aiqclib validates the required columns, auto-corrects their
data types, and can derive profile_no and observation_no. The only thing
it cannot infer is a timestamp stored as a plain number (see below).
Required Input Data Columnsο
Column |
Type |
Description |
|---|---|---|
|
text |
Identifier for the platform (e.g. a buoy ID). |
|
integer |
Sequential within a |
|
datetime |
The profileβs date and time (a real |
|
float |
Longitude of the profile. |
|
float |
Latitude of the profile. |
|
integer |
Sequential within a profile; the order of observations. |
|
float |
Pressure for each observation. |
Other columns (targets such as temp, QC flags, etc.) are passed through
untouched.
What aiqclib does for youο
The input step can perform three preprocessing tasks, in this order:
Step |
Controlled by ( |
Default |
|---|---|---|
Rename columns to the required names |
|
off |
Create |
|
off |
Validate columns and correct their types |
|
on |
Renaming. If your raw columns use different names, map them with
rename_dict ({ original_name: required_name }) and set
rename_columns: true. Renaming runs first, so the create and validate steps
see the final column names.
Creating identifiers. profile_no is numbered within each
platform_code and observation_no within each profile (ordered by
pres). Choose key_columns so they genuinely identify a profile: jittered
coordinates would split one profile, while identical timestamps at the same
location would merge two.
Validating. The required columns are checked, and mismatched types are converted where possible β handy for CSV/TSV inputs, where numbers and dates often arrive as strings.
Configure all three in the input step:
step_param_sets:
- name: data_set_param_set_1
steps:
input:
sub_steps:
rename_columns: true
filter_rows: false
validate_columns: true # on by default
create_columns: true # opt-in
rename_dict: { date: profile_timestamp }
create_column_dict: # all keys optional
key_columns: [ platform_code, profile_timestamp, longitude, latitude ]
sort_columns: [ platform_code, profile_timestamp, longitude, latitude, pres ]
columns: [ profile_no, observation_no ]
Note
validate_columns is enabled by default. create_columns is off by
default, so inputs that already provide these identifiers are never
overwritten. Set columns: [ observation_no ] to create only one of them.
Converting a numeric timestampο
profile_timestamp must be a real datetime. A numeric epoch (e.g. days since
1950-01-01) is ambiguous, so aiqclib cannot convert it automatically β
do this yourself before running the workflow:
import polars as pl
from datetime import datetime
# 'profile_time' is days since 1950-01-01
df = df.with_columns(
(
pl.lit(datetime(1950, 1, 1))
+ pl.duration(days=pl.col("profile_time").floor())
+ pl.duration(
seconds=(pl.col("profile_time") - pl.col("profile_time").floor()) * 86400
)
)
.cast(pl.Datetime("ms"))
.alias("profile_timestamp")
)
Important
Remove duplicate rows at the platform/profile level first β duplicates can produce incorrect datasets even when everything else is correct.
Save the Preprocessed Dataο
Write the result to Parquet and point input_file_name in your
prepare_config.yaml at it:
import os
output_file = os.path.expanduser("~/aiqc_project/input/nrt_cora_bo_preprocessed.parquet")
df.write_parquet(output_file)
Next Stepsο
Return to the tutorial: Step 2: Dataset Preparation.