Demo I.I - Loading the Data (Refactored)¶
Important - Paths and Directories¶
This is annoying but it needs to be defined otherwise things get confusing. We need a few important paths to be pre-defined:
Name | Variable | Purpose |
---|---|---|
Project | PROJECT_PATH |
top level directory for the project (assuming megatron) |
Code | CODE_PATH |
folder of any dedicated functions that we use |
Raw Data | RAW_PATH |
where the raw data is. Ideally, we never touch this ever except to read. |
Processed Data | DATA_PATH |
where the processed data is stored |
Interim Data | INTERIM_PATH |
where we save the training, validation and testing data |
Saved Models | MODEL_PATH |
where we save any trained models |
Results Data | RESULTS_PATH |
where we save any data results or outputs from ML models |
Figures | FIG_PATH |
where we store any plotted figures during any part of our ML pipeline |
This cell checks to see if all of the paths exist. If there is a path missing, it probably means you're not in megatron. If that's the case...well, we'll cross that bridge when we get there.
import pathlib
import sys
# define the top level directory
PROJECT_PATH = pathlib.Path("/media/disk/erc/papers/2019_ML_OCN/")
CODE_PATH = PROJECT_PATH.joinpath("ml4ocean")
sys.path.append(str(CODE_PATH))
# ml4ocean packages
from src.utils import get_paths
from src.data.world import get_full_data, world_features
from src.features.world import subset_independent_floats
PATHS = get_paths()
# standard pacakges
import numpy as np
import pandas as pd
%load_ext autoreload
%autoreload 2
1. Load Processed Global Data¶
from src.data.world import get_input_data, get_meta_data
t = get_meta_data()
t.dtypes
t = get_input_data()
t.dtypes
full_df = get_full_data()
2 - Training and Test Split¶
2.1 - Independent Set I (SOCA2016)¶
This independent set has a set number of independent floats which are not counted in the training or validation phase. These floats were in a paper (Sauzede et. al., 2016) and used during the testing phase to showcase how well the models did.
- 6901472
- 6901493
- 6901523
- 6901496
So we need to take these away from the data.
_, soca2016_df = subset_independent_floats(full_df, 'soca2016')
2.2 - Indpendent Set II (ISPRS2020)¶
This independent set was a set of floats taken from the ISPRS paper (Sauzede et. al., 2020 (pending...)). These floats were used as the independent testing set to showcase the performance of the ML methods.
- 6901486 (North Atlantic?)
- 3902121 (Subtropical Gyre?)
So we need to take these away from the data.
_, isprs2020_df = subset_independent_floats(full_df, 'isprs2020')
2.3 - ML Data¶
Now we want to subset the input data to be used for the ML models. Basically, we can subset all datasets that are not in the independent floats. In addition, we want all of the variables in the input features that we provided earlier.
# subset non-independent flows
dataset = 'both'
ml_df, _ = subset_independent_floats(full_df, 'both')
2.4 - Inputs, Outputs¶
Lastly, we need to split the data into training, validation (and possibly testing). Recall that all the inputs are already written above and the outputs as well.
input_df = ml_df[world_features.input]
output_df = ml_df[world_features.output]
3. Final Dataset (saving)¶
3.1 - Print out data dimensions (w. metadata)¶
print("Input Data:", input_df.shape)
print("Output Data:", output_df.shape)
print("SOCA2016 Independent Data:", soca2016_df[world_features.input].shape, soca2016_df[world_features.output].shape)
print("ISPRS2016 Independent Data:", isprs2020_df[world_features.input].shape, isprs2020_df[world_features.output].shape)
3.2 - Saving¶
- We're going to save the data in the
global/interim/
path. This is to prevent any overwrites. - We also need to
index=True
for the savefile in order to preserve the metadata indices.
input_df.to_csv(f"{PATHS.data_interim.joinpath('inputs.csv')}", index=True)
output_df.to_csv(f"{PATHS.data_interim.joinpath('outputs.csv')}", index=True)
soca2016_df.to_csv(f"{PATHS.data_interim.joinpath('soca2016.csv')}", index=True)
isprs2020_df.to_csv(f"{PATHS.data_interim.joinpath('isprs2020.csv')}", index=True)