Step 2: Dataset Preparation

The dataset preparation workflow is the first crucial step in the aiqclib pipeline. It’s designed to prepare feature data sets from your raw data. This includes creating training, validation, and test data sets.

This entire process is driven by a YAML configuration file, ensuring your data preparation is repeatable, transparent, and easy to manage across different experiments.

A Note on Running the Examples

The examples in these tutorials are presented as commands suitable for an interactive Python session (e.g., in a terminal with python or ipython, or within a Jupyter Notebook/Lab).

However, you are encouraged to use the method you are most comfortable with. The code can be run in several ways:

In an Interactive Python Session: Launch Python (python) or IPython (ipython) and paste the code line by line. This is great for quick tests and exploration.
As Python Scripts: Copy the code into a .py file (e.g., prepare_data.py) and execute it from your terminal with python your_script_name.py. This is suitable for automation and batch processing.
In a Jupyter Notebook or Lab: This is a fantastic option for experimentation, as it allows you to run code in cells, add notes, and visualize results interactively.

Feel free to adapt the examples to your preferred environment.

Getting the Example Data

This tutorial uses the Copernicus Marine NRT CTD dataset, publicly available on Kaggle. Before proceeding, let’s set up your project directory structure and download the necessary data.

First, create the directories for your project. This structure will be used consistently throughout the tutorials:

# Create a main project directory for all aiqclib outputs and configs
mkdir -p ~/aiqc_project

# Create subdirectories for configuration files and raw input data
mkdir -p ~/aiqc_project/config
mkdir -p ~/aiqc_project/input

Now, choose one of the following options to download the dataset. Both methods will place the required data in ~/aiqc_project/input/.

This method is ideal for reproducibility and for users who frequently work with Kaggle datasets.

Install and configure the Kaggle API: If you haven’t already, install the kaggle client and set up your API credentials.
```
pip install kaggle
```
Follow the official Kaggle API authentication instructions to obtain your kaggle.json file and place it in the correct location (~/.kaggle/).

Download and unzip the data: This single command downloads and extracts the dataset directly into your input folder.

kaggle datasets download -d takaya88/copernicus-marine-nrt-ctd-data-for-aiqc -p ~/aiqc_project/input --unzip

This method is the fastest way to get the data, as it requires no extra tools or setup beyond standard command-line utilities.

Download the zip file using cURL:

curl -L -o ~/aiqc_project/input/data.zip \
  https://www.kaggle.com/api/v1/datasets/download/takaya88/copernicus-marine-nrt-ctd-data-for-aiqc

Unzip the file: Extract the downloaded archive into your input directory.
```
unzip ~/aiqc_project/input/data.zip -d ~/aiqc_project/input/
```

After following either set of instructions, you should now have a file named nrt_cora_bo_4.parquet inside ~/aiqc_project/input/.

Required Input Data Structure

aiqclib expects your raw input data (a Parquet file) to contain specific columns, which are crucial for identifying unique profiles and observations. If your raw data already contains these, you’re good to go. Otherwise, you may need to preprocess your data to create them.

The required columns are:

platform_code: A unique identifier for the measurement platform (e.g., buoy, ship).
profile_no: A unique, sequential number identifying each distinct “profile” (a set of measurements taken at a specific time and location) within a platform_code.
profile_timestamp: The exact datetime of the profile. This column should be of a datetime type (e.g., Pandas/Polars datetime, or similar).
longitude: The longitude of the measurement profile.
latitude: The latitude of the measurement profile.
observation_no: A unique, sequential number identifying each individual observation (row) within a profile_no.
pres: Pressure values for each observation.

Important

If your raw data lacks profile_no, profile_timestamp, or observation_no, you will need to generate them. For detailed examples and helper code on how to perform these common data preprocessing steps (e.g., converting float timestamps, generating unique IDs), please refer to the Data Preprocessing Utilities guide.

The Dataset Preparation Workflow

The aiqclib data preparation workflow consists of three main programmatic steps: generating a configuration template, customizing this template to match your data and desired processing, and finally running the preparation script.

Step 2.1: Generate the Configuration Template

First, use aiqclib to generate a boilerplate configuration template. This file will contain all the necessary sections for the data preparation task, which you will then customize.

import aiqclib as aq
import os

config_path = os.path.expanduser("~/aiqc_project/config/prepare_config.yaml")
aq.write_config_template(
    file_name=config_path,
    stage="prepare"
)

Step 2.2: Customize the Configuration File

Now, open the newly created ~/aiqc_project/config/prepare_config.yaml in a text editor. You need to tell aiqclib where to find your input data, where to save the processed output, and define your targets and features.

You will primarily focus on updating the following sections:

path_info_sets: Define your input and output directories.
target_sets: Specify your prediction targets and their quality control flags.
summary_stats_sets: Provide settings for summary statistics.
feature_sets & feature_param_sets: List the feature engineering methods and their parameters.
feature_stats_sets: Provide statistics for feature normalization.
data_sets: Assemble the full pipeline by linking the named blocks.

Updating path_info_sets and data_sets: Update your prepare_config.yaml to match the following for the path_info_sets and data_sets sections, replacing the placeholder paths with the ones you created in “Getting the Example Data”.

path_info_sets:
  - name: data_set_1
    common:
      base_path: ~/aiqc_project/data # Root directory for all processed output data
    input:
      base_path: ~/aiqc_project/input # Directory where your raw input files are located
      step_folder_name: "" # Set to "" if input files are directly in base_path
    split:
      step_folder_name: training # Subdirectory for the final training/validation/test splits

data_sets:
  - name: dataset_0001  # A unique name for this dataset preparation job
    dataset_folder_name: dataset_0001  # The name of the output folder for this job
    input_file_name: nrt_cora_bo_4.parquet # The specific raw input file to process

Note

The prepare_config.yaml can be quite detailed. For a complete reference of all available configuration options, please consult the dedicated Configuration of Dataset Preparation page.

Note

dmaclib provides methods to down-sample the negative data set. Please refer to the Down-sampling of the Negative Dataset guide for details.

Step 2.2: Run the Preparation Process

Once you have customized your prepare_config.yaml with the correct paths, input file name, and definitions for targets, features, and summary statistics, you can execute the data preparation workflow.

Load the configuration file and then call the create_training_dataset function:

import aiqclib as aq
import os

config_path = os.path.expanduser("~/aiqc_project/config/prepare_config.yaml")
config = aq.read_config(config_path)
aq.create_training_dataset(config)

Understanding the Output

After the commands finishes, your main output directory (as defined by path_info_sets.common.base_path, e.g., ~/aiqc_project/data) will contain a new folder named dataset_0001 (derived from data_sets.dataset_folder_name). Inside this folder, you will find several subdirectories, each representing a stage of the data preparation pipeline:

summary: Contains intermediate files with summary statistics of the input data, often used for normalization or feature scaling.
select: Stores data points identified as “good” (negative samples) and “bad” (positive samples) based on your target and QC flag definitions.
locate: Contains specific observation records for both positive and negative profiles, often after a proximity-based selection.
extract: Holds the features extracted from the observation records, ready for model consumption.
training: The final output directory. This contains the split training, validation, and test datasets in Parquet format, ready for model training and evaluation.

Next Steps

Congratulations! You have successfully prepared your dataset, transforming raw data into a structured format with engineered features and appropriate splits. You are now ready to train your first machine learning model using aiqclib.

Proceed to the next tutorial: Step 3: Training & Evaluation.