Data Preprocessing Utilities
============================

``aiqclib`` expects a few identity and coordinate columns in your raw input.
Most of the preparation is now handled for you: after the optional column
**rename** step, ``aiqclib`` validates the required columns, auto-corrects their
data types, and can derive ``profile_no`` and ``observation_no``. The only thing
it cannot infer is a timestamp stored as a plain number (see below).

Required Input Data Columns
---------------------------

.. list-table::
   :header-rows: 1
   :widths: 24 16 60

   * - Column
     - Type
     - Description
   * - ``platform_code``
     - text
     - Identifier for the platform (e.g. a buoy ID).
   * - ``profile_no``
     - integer
     - Sequential within a ``platform_code``; identifies one profile (a single
       measurement event).
   * - ``profile_timestamp``
     - datetime
     - The profile's date and time (a real ``Datetime``).
   * - ``longitude``
     - float
     - Longitude of the profile.
   * - ``latitude``
     - float
     - Latitude of the profile.
   * - ``observation_no``
     - integer
     - Sequential within a profile; the order of observations.
   * - ``pres``
     - float
     - Pressure for each observation.

Other columns (targets such as ``temp``, QC flags, etc.) are passed through
untouched.

What ``aiqclib`` does for you
-----------------------------

The ``input`` step can perform three preprocessing tasks, in this order:

.. list-table::
   :header-rows: 1
   :widths: 42 38 20

   * - Step
     - Controlled by (``input`` sub-step)
     - Default
   * - Rename columns to the required names
     - ``rename_columns`` + ``rename_dict``
     - off
   * - Create ``profile_no`` / ``observation_no``
     - ``create_columns`` + ``create_column_dict``
     - off
   * - Validate columns and correct their types
     - ``validate_columns``
     - on

**Renaming.** If your raw columns use different names, map them with
``rename_dict`` (``{ original_name: required_name }``) and set
``rename_columns: true``. Renaming runs first, so the create and validate steps
see the final column names.

**Creating identifiers.** ``profile_no`` is numbered within each
``platform_code`` and ``observation_no`` within each profile (ordered by
``pres``). Choose ``key_columns`` so they genuinely identify a profile: jittered
coordinates would split one profile, while identical timestamps at the same
location would merge two.

**Validating.** The required columns are checked, and mismatched types are
converted where possible — handy for CSV/TSV inputs, where numbers and dates
often arrive as strings.

Configure all three in the ``input`` step:

.. code-block:: yaml

   step_param_sets:
     - name: data_set_param_set_1
       steps:
         input:
           sub_steps:
             rename_columns: true
             filter_rows: false
             validate_columns: true      # on by default
             create_columns: true        # opt-in
           rename_dict: { date: profile_timestamp }
           create_column_dict:           # all keys optional
             key_columns:  [ platform_code, profile_timestamp, longitude, latitude ]
             sort_columns: [ platform_code, profile_timestamp, longitude, latitude, pres ]
             columns:      [ profile_no, observation_no ]

.. note::
   ``validate_columns`` is enabled by default. ``create_columns`` is **off** by
   default, so inputs that already provide these identifiers are never
   overwritten. Set ``columns: [ observation_no ]`` to create only one of them.

Converting a numeric timestamp
------------------------------

``profile_timestamp`` must be a real datetime. A numeric epoch (e.g. days since
1950-01-01) is ambiguous, so ``aiqclib`` cannot convert it automatically —
do this yourself before running the workflow:

.. code-block:: python

   import polars as pl
   from datetime import datetime

   # 'profile_time' is days since 1950-01-01
   df = df.with_columns(
       (
           pl.lit(datetime(1950, 1, 1))
           + pl.duration(days=pl.col("profile_time").floor())
           + pl.duration(
               seconds=(pl.col("profile_time") - pl.col("profile_time").floor()) * 86400
           )
       )
       .cast(pl.Datetime("ms"))
       .alias("profile_timestamp")
   )

.. important::
   Remove duplicate rows at the platform/profile level first — duplicates can
   produce incorrect datasets even when everything else is correct.

Save the Preprocessed Data
--------------------------

Write the result to Parquet and point ``input_file_name`` in your
``prepare_config.yaml`` at it:

.. code-block:: python

   import os

   output_file = os.path.expanduser("~/aiqc_project/input/nrt_cora_bo_preprocessed.parquet")
   df.write_parquet(output_file)

Next Steps
----------

Return to the tutorial: :doc:`../tutorial/preparation`.