Skip to content

Demo I.I - Loading the Data (Refactored)

Important - Paths and Directories

This is annoying but it needs to be defined otherwise things get confusing. We need a few important paths to be pre-defined:

Name Variable Purpose
Project PROJECT_PATH top level directory for the project (assuming megatron)
Code CODE_PATH folder of any dedicated functions that we use
Raw Data RAW_PATH where the raw data is. Ideally, we never touch this ever except to read.
Processed Data DATA_PATH where the processed data is stored
Interim Data INTERIM_PATH where we save the training, validation and testing data
Saved Models MODEL_PATH where we save any trained models
Results Data RESULTS_PATH where we save any data results or outputs from ML models
Figures FIG_PATH where we store any plotted figures during any part of our ML pipeline

This cell checks to see if all of the paths exist. If there is a path missing, it probably means you're not in megatron. If that's the case...well, we'll cross that bridge when we get there.

import pathlib
import sys

# define the top level directory
PROJECT_PATH = pathlib.Path("/media/disk/erc/papers/2019_ML_OCN/")
CODE_PATH = PROJECT_PATH.joinpath("ml4ocean")
sys.path.append(str(CODE_PATH))

# ml4ocean packages
from src.utils import get_paths
from src.data.world import get_full_data, world_features
from src.features.world import subset_independent_floats

PATHS = get_paths()

# standard pacakges
import numpy as np
import pandas as pd

%load_ext autoreload
%autoreload 2

1. Load Processed Global Data

from src.data.world import get_input_data, get_meta_data
t = get_meta_data()
t.dtypes
wmo          int64
n_cycle      int64
N            int64
lon        float64
lat        float64
juld       float64
date        object
dtype: object
t = get_input_data()
t.dtypes
N               int64
wmo             int64
n_cycle         int64
sla           float64
PAR           float64
RHO_WN_412    float64
RHO_WN_443    float64
RHO_WN_490    float64
RHO_WN_555    float64
RHO_WN_670    float64
doy_sin       float64
doy_cos       float64
x_cart        float64
y_cart        float64
z_cart        float64
PC1           float64
PC2           float64
PC3           float64
PC4           float64
PC5           float64
PC6           float64
PC7           float64
PC1.1         float64
PC2.1         float64
PC3.1         float64
PC1.2         float64
PC2.2         float64
PC3.2         float64
PC4.1         float64
bbp           float64
bbp.1         float64
bbp.2         float64
bbp.3         float64
bbp.4         float64
bbp.5         float64
bbp.6         float64
bbp.7         float64
bbp.8         float64
bbp.9         float64
bbp.10        float64
bbp.11        float64
bbp.12        float64
bbp.13        float64
bbp.14        float64
bbp.15        float64
bbp.16        float64
bbp.17        float64
bbp.18        float64
dtype: object
full_df = get_full_data()

2 - Training and Test Split

2.1 - Independent Set I (SOCA2016)

This independent set has a set number of independent floats which are not counted in the training or validation phase. These floats were in a paper (Sauzede et. al., 2016) and used during the testing phase to showcase how well the models did.

  • 6901472
  • 6901493
  • 6901523
  • 6901496

So we need to take these away from the data.

_, soca2016_df = subset_independent_floats(full_df, 'soca2016')

2.2 - Indpendent Set II (ISPRS2020)

This independent set was a set of floats taken from the ISPRS paper (Sauzede et. al., 2020 (pending...)). These floats were used as the independent testing set to showcase the performance of the ML methods.

  • 6901486 (North Atlantic?)
  • 3902121 (Subtropical Gyre?)

So we need to take these away from the data.

_, isprs2020_df = subset_independent_floats(full_df, 'isprs2020')

2.3 - ML Data

Now we want to subset the input data to be used for the ML models. Basically, we can subset all datasets that are not in the independent floats. In addition, we want all of the variables in the input features that we provided earlier.

# subset non-independent flows
dataset = 'both'
ml_df, _ = subset_independent_floats(full_df, 'both')

2.4 - Inputs, Outputs

Lastly, we need to split the data into training, validation (and possibly testing). Recall that all the inputs are already written above and the outputs as well.

input_df = ml_df[world_features.input]
output_df = ml_df[world_features.output]

3. Final Dataset (saving)

3.1 - Print out data dimensions (w. metadata)

print("Input Data:", input_df.shape)
print("Output Data:", output_df.shape)
print("SOCA2016 Independent Data:", soca2016_df[world_features.input].shape, soca2016_df[world_features.output].shape)
print("ISPRS2016 Independent Data:", isprs2020_df[world_features.input].shape, isprs2020_df[world_features.output].shape)
Input Data: (24704, 26)
Output Data: (24704, 19)
SOCA2016 Independent Data: (378, 26) (378, 19)
ISPRS2016 Independent Data: (331, 26) (331, 19)

3.2 - Saving

  • We're going to save the data in the global/interim/ path. This is to prevent any overwrites.
  • We also need to index=True for the savefile in order to preserve the metadata indices.

so just by reducing the precision by a smidge (1e-14 instead of 1e-15), we find that the arrays are the same. So we can trust it.

input_df.to_csv(f"{PATHS.data_interim.joinpath('inputs.csv')}", index=True)
output_df.to_csv(f"{PATHS.data_interim.joinpath('outputs.csv')}", index=True)
soca2016_df.to_csv(f"{PATHS.data_interim.joinpath('soca2016.csv')}", index=True)
isprs2020_df.to_csv(f"{PATHS.data_interim.joinpath('isprs2020.csv')}", index=True)