Quick Start

This guide demonstrates how to run the entire machine learning process with minimal configuration.

Note

This is a condensed version of the tutorial provided in the “Getting Started” section. See the Overview in the “Getting Started” section for more comprehensive explanations.

Objectives

You will learn how to run all three stages of aiqclib by creating stage-specific configuration files. This guide lets you create three classifiers for temp (temperature), psal (salinity), and pres (pressure) to predict QC labels for the corresponding variables.

Installation

(Optional) We recommend creating a mamba/conda environment before installing aiqclib.

# conda
conda create --name aiqclib -c conda-forge python=3.12 pip uv
conda activate aiqclib

# mamba
mamba create -n aiqclib -c conda-forge python=3.12 pip uv
mamba activate aiqclib

Use pip, conda, or mamba to install aiqclib.

# pip
pip install aiqclib

# conda
conda install -c conda-forge aiqclib

# mamba
mamba install -c conda-forge aiqclib

Download Raw Input Data

You can get the sample input data set (nrt_cora_bo_4.parquet) from Kaggle.

Prepare Directory Structure

The following Python commands create the necessary directory structure for your input and output files.

import os
import polars as pl
import aiqclib as aq

print(f"aiqclib version: {aq.__version__}")

# !! IMPORTANT: Update these placeholder paths to your actual file locations !!
data_path = "/path/to/your/data"  # This will be the root for outputs

config_path = os.path.join(data_path, "config")
os.makedirs(config_path, exist_ok=True)

Stage 1: Data Preparation Stage

The prepare workflow (stage=”prepare”) is the first step in the machine learning pipeline. It processes your raw data into feature sets and then splits them into training, validation, and test sets.

Template Configuration File

The following command creates a configuration template for this stage.

config_file_prepare = os.path.join(config_path, "data_preparation_config.yaml")
aq.write_config_template(file_name=config_file_prepare, stage="prepare")

Update the Configuration File

File: /path/to/your/data/config/data_preparation_config.yaml

Update Data and Input Paths: Adjust the base_path values in the path_info_sets section.

data_preparation_config.yaml: path_info_sets

path_info_sets:
  - name: data_set_1
    common:
      base_path: /path/to/your/data  # <--- Root directory for generated datasets and models
    input:
      base_path: /path/to/your/input # <--- Directory where the raw input data is located
      step_folder_name: ""

Configure the Test Data Year(s): Specify the year(s) to be held out as an independent test set. The remove_years parameter excludes these years from the training and validation sets.

data_preparation_config.yaml: step_param_sets

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years: [ 2023 ], # <--- Year(s) to set aside for the test set
                                     keep_years: [ ] } }

Specify Input File Name: Ensure input_file_name matches the name of your raw data file.

data_preparation_config.yaml: data_sets

data_sets:
  - name: dataset_0001
    dataset_folder_name: dataset_0001
    input_file_name: nrt_cora_bo_4.parquet # <--- Your input file's name

Run the Data Preparation Stage

Once the configuration file is updated, run the following command to generate the training and validation datasets.

config_prepare = aq.read_config(os.path.join(config_path, "data_preparation_config.yaml"))
aq.create_training_dataset(config_prepare)

Understanding the Output

After the command finishes, your main output directory (e.g., /path/to/your/data) will contain a new folder named dataset_0001. Inside this folder, you will find several subdirectories, each representing a stage of the data preparation pipeline:

summary: Contains intermediate files with summary statistics.
select: Stores data points identified as “good” (negative samples) and “bad” (positive samples).
locate: Contains specific observation records for positive and negative profiles.
extract: Holds the features extracted from the observation records.
training: The final output directory for this stage. It contains the split training, validation, and test datasets in Parquet format.

Stage 2: Training & Evaluation

The train workflow (stage=”train”) orchestrates the model building process. It uses the datasets from the prepare stage to perform cross-validation, train the model, and evaluate it.

Template Configuration File

The following command creates a configuration template for this stage.

config_file_train = os.path.join(config_path, "training_config.yaml")
aq.write_config_template(file_name=config_file_train, stage="train")

Update the Configuration File

File: /path/to/your/data/config/training_config.yaml

Update Data Path: Adjust the base_path in the path_info_sets section. This path must point to the same output directory (common.base_path) you defined in data_preparation_config.yaml.
training_config.yaml: path_info_sets
```
path_info_sets:
  - name: data_set_1
    common:
      base_path: /path/to/your/data # <--- Must match the common.base_path from the previous stage
```

Run the Training & Evaluation Stage

With the configuration file updated, the following command will run the training and validation processes.

config_train = aq.read_config(os.path.join(config_path, "training_config.yaml"))
aq.train_and_evaluate(config_train)

Understanding the Output

After the command finishes, new folders will be created within your dataset’s output directory (e.g., /path/to/your/data/dataset_0001/). The primary outputs include:

validate: Contains detailed results from the cross-validation process, allowing you to inspect model performance across different data folds.
build: Holds a comprehensive report of the final model’s evaluation on the held-out test dataset.
model: Contains the final, trained model objects. These are the artifacts you will use in the next stage.

Stage 3: Classification

The classify workflow (stage=”classify”) applies a trained model to make predictions on a new, unseen dataset (e.g., the test set you held out in Stage 1).

Template Configuration File

The following command creates a configuration template for this final stage.

config_file_classify = os.path.join(config_path, "classification_config.yaml")
aq.write_config_template(file_name=config_file_classify, stage="classify")

Update the Configuration File

File: /path/to/your/data/config/classification_config.yaml

Update Paths: Adjust the base_path values for common, input, and model. * common.base_path: The root directory for your data outputs. * input.base_path: The location of the raw input data file. * model.base_path: The location of the trained model from Stage 2.

classification_config.yaml: path_info_sets

path_info_sets:
  - name: data_set_1
    common:
      base_path: /path/to/your/data  # <--- Your common data root
    input:
      base_path: /path/to/your/input # <--- Location of the raw data for classification
      step_folder_name: ""
    model:
      base_path: /path/to/your/data/dataset_0001 # <--- Path to the trained model folder
      step_folder_name: "model"

Configure Classification Data Year(s): Specify the year(s) for the classification dataset using keep_years. This should correspond to the test data year(s) you excluded (remove_years) during data preparation.

classification_config.yaml: step_param_sets

step_param_sets:
  - name: data_set_param_set_1
    steps:
      input: { sub_steps: { rename_columns: false,
                            filter_rows: true },
               rename_dict: { },
               filter_method_dict: { remove_years: [],
                                     keep_years: [ 2023 ] } } # <--- Specify year(s) to *keep* for classification

Specify Input File Name: Ensure input_file_name matches the name of the data file you want to classify.

classification_config.yaml: data_sets

data_sets:
  - name: classification_0001
    dataset_folder_name: dataset_0001
    input_file_name: nrt_cora_bo_4.parquet # <--- Your input file's name

Run the Classification Stage

Once the configuration is complete, the following commands will apply the model to the specified data and generate classification results.

config_classify = aq.read_config(os.path.join(config_path, "classification_config.yaml"))
aq.classify_dataset(config_classify)

Understanding the Output

After this command finishes, the output directories will be generated within /path/to/your/data/dataset_0001/. The most important output is in the classify directory:

classify: This is the final output directory for the workflow. It contains:
- A .parquet file with the original input data augmented with new columns for the model’s predictions (e.g., temp_prediction) and prediction probabilities (e.g., temp_probability).
- A summary report detailing the classification results.

Other intermediate folders (summary, select, locate, extract) are also created, mirroring the process used during data preparation to ensure consistency.

Conclusion

Congratulations! You have successfully completed the entire aiqclib workflow, from raw data preparation to training a machine learning model and using it to generate predictions on new data.

You now have a powerful, repeatable, and configurable pipeline for your machine learning tasks. You can easily adapt the configuration files to process new datasets, experiment with different models, or integrate this pipeline into larger automated workflows.