Demo I - Loading the Data¶

Important - Paths and Directories¶

This is annoying but it needs to be defined otherwise things get confusing. We need a few important paths to be pre-defined:

Name	Variable	Purpose
Project	`PROJECT_PATH`	top level directory for the project (assuming megatron)
Code	`CODE_PATH`	folder of any dedicated functions that we use
Raw Data	`RAW_PATH`	where the raw data is. Ideally, we never touch this ever except to read.
Processed Data	`DATA_PATH`	where the processed data is stored
Interim Data	`INTERIM_PATH`	where we save the training, validation and testing data
Saved Models	`MODEL_PATH`	where we save any trained models
Results Data	`RESULTS_PATH`	where we save any data results or outputs from ML models
Figures	`FIG_PATH`	where we store any plotted figures during any part of our ML pipeline

This cell checks to see if all of the paths exist. If there is a path missing, it probably means you're not in megatron. If that's the case...well, we'll cross that bridge when we get there.

import pathlib
import sys

# define the top level directory
PROJECT_PATH = pathlib.Path("/media/disk/erc/papers/2019_ML_OCN/")
CODE_PATH = PROJECT_PATH.joinpath("ml4ocean", "src")

# check if path exists and is a directory
assert PROJECT_PATH.exists() & PROJECT_PATH.is_dir()
assert CODE_PATH.exists() & CODE_PATH.is_dir()

# add code and project paths to PYTHONPATH (to call functions)
sys.path.append(str(PROJECT_PATH))
sys.path.append(str(CODE_PATH))

# specific paths
FIG_PATH = PROJECT_PATH.joinpath("ml4ocean/reports/figures/global/")
RAW_PATH = PROJECT_PATH.joinpath("data/global/raw/")
DATA_PATH = PROJECT_PATH.joinpath("data/global/processed/")
INTERIM_PATH = PROJECT_PATH.joinpath("data/global/interim/")
MODEL_PATH = PROJECT_PATH.joinpath("models/global/")
RESULTS_PATH = PROJECT_PATH.joinpath("data/global/results/")

# check if path exists and is a directory
assert FIG_PATH.exists() & FIG_PATH.is_dir()
assert RAW_PATH.exists() & RAW_PATH.is_dir()
assert DATA_PATH.exists() & DATA_PATH.is_dir()
assert INTERIM_PATH.exists() & INTERIM_PATH.is_dir()
assert MODEL_PATH.exists() & MODEL_PATH.is_dir()
assert RESULTS_PATH.exists() & RESULTS_PATH.is_dir()

Python Packages¶

# Standard packages
import numpy as np
import pandas as pd

1. Load Processed Global Data¶

In this section, I will load the metadata and the actual data. The steps involved are:

Define the filepath (check for existence)
Open meta data and real data
Check that the samples correspond to each other.
Check if # of features are the same

1.1 - Meta Data¶

# name of file
meta_name = "METADATA_20200310.csv"

# get full path
meta_file = DATA_PATH.joinpath(meta_name)

# assert meta file exists
error_msg = f"File '{meta_file.name}' doesn't exist. Check name or directory."
assert meta_file.exists(), error_msg

# assert meta file is a file
error_msg = f"File '{meta_file.name}' isn't a file. Check name or directory."
assert meta_file.is_file(), error_msg

# open meta data
meta_df = pd.read_csv(f"{meta_file}",sep=',')

#ANA: I got error "AttributeError: 'DataFrame' object has no attribute 'to_markdown'""
#meta_df.head().to_markdown()
meta_df.head()

	wmo	n_cycle	N	lon	lat	juld	date
0	2902086	1	1	88.695687	12.163850	23009.165972	2012-12-30 03:58:59
1	2902086	10	10	88.603349	12.412847	23018.142361	2013-01-08 03:24:59
2	2902086	100	64	86.203895	13.791507	23432.149305	2014-02-26 03:34:59
3	2902086	101	65	86.311614	13.750043	23437.143750	2014-03-03 03:26:59
4	2902086	102	66	86.397120	13.758830	23442.147222	2014-03-08 03:31:59

meta_df.shape

(25413, 7)

1.2 - Input Data¶

# name of file
data_name = "SOCA_GLOBAL2_20200310.csv"

# get full path
data_file = DATA_PATH.joinpath(data_name)

# assert exists
error_msg = f"File '{data_file.name}' doesn't exist. Check name or directory."
assert data_file.exists(), error_msg

# assert meta file is a file
error_msg = f"File '{data_file.name}' isn't a file. Check name or directory."
assert data_file.is_file(), error_msg

# load data
data_df = pd.read_csv(f"{data_file}")

## Same markdown error here
#data_df.head().iloc[:, :6].to_markdown()
data_df.iloc[0:10, :6]

	N	wmo	n_cycle	sla	PAR	RHO_WN_412
0	1	2902086	1	-4.704400	42.6541	0.025462
1	2	2902086	2	-4.038200	42.6541	0.025462
2	3	2902086	3	-3.460399	44.2927	0.024094
3	4	2902086	4	-2.840400	42.7664	0.024917
4	5	2902086	5	-2.394000	42.7664	0.024917
5	6	2902086	6	-2.049000	42.7468	0.025830
6	7	2902086	7	-1.772300	42.7468	0.025830
7	8	2902086	8	-1.429900	42.6694	0.025811
8	9	2902086	9	-1.261000	44.5087	0.020570
9	10	2902086	10	-0.901500	44.5505	0.020603

data_df.shape

(25413, 48)

1.3 - Checks¶

I do a number of checks to make sure that our data follows a standard and that I am reproducing the same results.

Number of samples are equal for both
7 meta features
48 data features (26 data + 19 levels + 3 meta)
check features in columns

# same number of samples
error_msg = f"Mismatch between meta and data: {data_df.shape[0]} =/= {meta_df.shape[0]}"
assert data_df.shape[0] == meta_df.shape[0], error_msg

# check number of samples
n_samples = 25413
error_msg = f"Incorrect number of samples: {data_df.shape[0]} =/= {n_samples}"
assert data_df.shape[0] == n_samples, error_msg

# check meta feature names
meta_features = ['wmo', 'n_cycle', 'N', 'lon', 'lat', 'juld', 'date']
error_msg = f"Missing features in meta data."
assert meta_df.columns.tolist() == meta_features, error_msg

# check data feature names
input_meta_features = ['N', 'wmo', 'n_cycle']
input_features = ['sla', 'PAR', 'RHO_WN_412', 'RHO_WN_443',
       'RHO_WN_490', 'RHO_WN_555', 'RHO_WN_670', 'doy_sin', 'doy_cos',
       'x_cart', 'y_cart', 'z_cart', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6',
       'PC7', 'PC1.1', 'PC2.1', 'PC3.1', 'PC1.2', 'PC2.2', 'PC3.2', 'PC4.1']
output_features = ['bbp', 'bbp.1', 'bbp.2', 'bbp.3', 'bbp.4', 'bbp.5', 'bbp.6', 'bbp.7',
       'bbp.8', 'bbp.9', 'bbp.10', 'bbp.11', 'bbp.12', 'bbp.13', 'bbp.14',
       'bbp.15', 'bbp.16', 'bbp.17', 'bbp.18']
features = input_meta_features + input_features + output_features
error_msg = f"Missing features in input data."
assert data_df.columns.tolist() == features, error_msg

1.4 - Convert metadata to indices (Important)¶

To make our life easier, we're going to eliminate the need to keep track of meta data all of the time. So I'm going to merge the datasets together to form one dataframe. Then I will set the index to be the metadata values. The remaining parts will be columns which will be features.

So in the end, we will have a dataframe where:

the indices is the metadata (e.g. wmo, n_cycle)
the columns are the actual features (e.g. sla, pca components, bbp, etc).

# merge meta and data
full_df = pd.merge(meta_df, data_df)

# convert meta information to indices
full_df = full_df.set_index(meta_features)

# checks - check indices match metadata
meta_features = ['wmo', 'n_cycle', 'N', 'lon', 'lat', 'juld', 'date']
error_msg = f"Missing features in input data."
assert full_df.index.names == meta_features, error_msg

# checks - check column names match feature names
input_features = ['sla', 'PAR', 'RHO_WN_412', 'RHO_WN_443',
       'RHO_WN_490', 'RHO_WN_555', 'RHO_WN_670', 'doy_sin', 'doy_cos',
       'x_cart', 'y_cart', 'z_cart', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6',
       'PC7', 'PC1.1', 'PC2.1', 'PC3.1', 'PC1.2', 'PC2.2', 'PC3.2', 'PC4.1']
output_features = ['bbp', 'bbp.1', 'bbp.2', 'bbp.3', 'bbp.4', 'bbp.5', 'bbp.6', 'bbp.7',
       'bbp.8', 'bbp.9', 'bbp.10', 'bbp.11', 'bbp.12', 'bbp.13', 'bbp.14',
       'bbp.15', 'bbp.16', 'bbp.17', 'bbp.18']
features = input_features + output_features
error_msg = f"Missing features in input data."
assert full_df.columns.tolist() == features, error_msg

full_df.columns

Index(['sla', 'PAR', 'RHO_WN_412', 'RHO_WN_443', 'RHO_WN_490', 'RHO_WN_555',
       'RHO_WN_670', 'doy_sin', 'doy_cos', 'x_cart', 'y_cart', 'z_cart', 'PC1',
       'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC1.1', 'PC2.1', 'PC3.1',
       'PC1.2', 'PC2.2', 'PC3.2', 'PC4.1', 'bbp', 'bbp.1', 'bbp.2', 'bbp.3',
       'bbp.4', 'bbp.5', 'bbp.6', 'bbp.7', 'bbp.8', 'bbp.9', 'bbp.10',
       'bbp.11', 'bbp.12', 'bbp.13', 'bbp.14', 'bbp.15', 'bbp.16', 'bbp.17',
       'bbp.18'],
      dtype='object')

print('Dataframe Features:', full_df.shape)
full_df.columns.tolist()[:10]

Dataframe Features: (25413, 45)

['sla',
 'PAR',
 'RHO_WN_412',
 'RHO_WN_443',
 'RHO_WN_490',
 'RHO_WN_555',
 'RHO_WN_670',
 'doy_sin',
 'doy_cos',
 'x_cart']

print('Dataframe Indices (meta vars):', len(full_df.index.names))
full_df.index.names

Dataframe Indices (meta vars): 7

FrozenList(['wmo', 'n_cycle', 'N', 'lon', 'lat', 'juld', 'date'])

2 - Training and Test Split¶

2.1 - Independent Set I (SOCA2016)¶

This independent set has a set number of independent floats which are not counted in the training or validation phase. These floats were in a paper (Sauzede et. al., 2016) and used during the testing phase to showcase how well the models did.

6901472
6901493
6901523
6901496

So we need to take these away from the data.

# soca2016 independent floats
soca2016_floats = ["6901472", "6901493", "6901523", "6901496"]

# subset soca2016 floats
soca2016_df = full_df[full_df.index.isin(soca2016_floats, level='wmo')]

soca2016_df.shape

(378, 45)

Checks¶

# check number of samples (meta, inputs)
n_samples = 378
error_msg = f"Incorrect number of samples for soca2016 floats: {soca2016_df.shape[0]} =/= {n_samples}"
assert soca2016_df.shape[0] == n_samples, error_msg

Ana: Why 378 rows if there are 4 floats? I guess they are not the same length. Just to have it clear.¶

2.2 - Indpendent Set II (ISPRS2020)¶

This independent set was a set of floats taken from the ISPRS paper (Sauzede et. al., 2020 (pending...)). These floats were used as the independent testing set to showcase the performance of the ML methods.

6901486 (North Atlantic?)
3902121 (Subtropical Gyre?)

So we need to take these away from the data.

# isprs2020 independent floats
isprs2020_floats = ["6901486", "3902121"]

# isprs2020 independent floats
isprs2020_df = full_df[full_df.index.isin(isprs2020_floats, level='wmo')]

isprs2020_df.shape

(331, 45)

Checks¶

# check number of samples (meta, inputs)
n_samples = 331
error_msg = f"Incorrect number of samples for isprs2016 floats: {isprs2020_df.shape[0]} =/= {n_samples}"
assert isprs2020_df.shape[0] == n_samples, error_msg

2.3 - ML Data¶

Now we want to subset the input data to be used for the ML models. Basically, we can subset all datasets that are not in the independent floats. In addition, we want all of the variables in the input features that we provided earlier.

# subset non-independent flows
ml_df = full_df[~full_df.index.isin(isprs2020_floats + soca2016_floats, level='wmo')]

ml_df.shape

(24704, 45)

Checks¶

# check number of samples (meta, inputs)
n_samples = 24704
error_msg = f"Incorrect number of samples for non-independent floats: {ml_df.shape[0]} =/= {n_samples}"
assert ml_df.shape[0] == n_samples, error_msg

2.4 - Inputs, Outputs¶

Lastly, we need to split the data into training, validation (and possibly testing). Recall that all the inputs are already written above and the outputs as well.

input_df = ml_df[input_features]
output_df = ml_df[output_features]

# checks - Input Features
n_input_features = 26
error_msg = f"Incorrect number of features for input df: {input_df.shape[1]} =/= {n_input_features}"
assert input_df.shape[1] == n_input_features, error_msg

# checks - Output Features
n_output_features = 19
error_msg = f"Incorrect number of features for output df: {output_df.shape[1]} =/= {n_output_features}"
assert output_df.shape[1] == n_output_features, error_msg

input_df.shape, output_df.shape

((24704, 26), (24704, 19))

3. Final Dataset (saving)¶

3.1 - Print out data dimensions (w. metadata)¶

print("Input Data:", input_df.shape)
print("Output Data:", output_df.shape)
print("SOCA2016 Independent Data:", soca2016_df[input_features].shape, soca2016_df[output_features].shape)
print("ISPRS2016 Independent Data:", isprs2020_df[input_features].shape, isprs2020_df[output_features].shape)

Input Data: (24704, 26)
Output Data: (24704, 19)
SOCA2016 Independent Data: (378, 26) (378, 19)
ISPRS2016 Independent Data: (331, 26) (331, 19)

3.2 - Saving¶

We're going to save the data in the global/interim/ path. This is to prevent any overwrites.
We also need to index=True for the savefile in order to preserve the metadata indices.

input_df.to_csv(f"{INTERIM_PATH.joinpath('inputs.csv')}", index=True)

3.3 - Loading¶

This is a tiny bit tricky if we want to preserve the meta data as the indices. So we need to set the index to be the same meta columns that we used last time via the .set_index(meta_vars) command.

test_inputs_df = pd.read_csv(f"{INTERIM_PATH.joinpath('inputs.csv')}")

# add index
test_inputs_df = test_inputs_df.set_index(meta_features)

QUESTION(Ana): if we have alredy saved the file, couldn't we still use the input_df here? The one we saved it is supposed not to be modified , right?¶

3.4 - Checking¶

So curiously, we cannot compare the dataframes directly because there is some numerical error when saving them. But if we calculate the exact differences between them, we find that they are almost equal. See below what happens if we calculate the exact difference between the arrays.

# example are they exactly the same?
# np.testing.assert_array_equal(test_inputs_df.describe(), input_df.describe())
np.testing.assert_array_equal(test_inputs_df.values, input_df.values)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-25-41695177b4b4> in <module>
      1 # example are they exactly the same?
      2 # np.testing.assert_array_equal(test_inputs_df.describe(), input_df.describe())
----> 3 np.testing.assert_array_equal(test_inputs_df.values, input_df.values)

~/.conda/envs/ml4ocn/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_array_equal(x, y, err_msg, verbose)
    934     __tracebackhide__ = True  # Hide traceback for py.test
    935     assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
--> 936                          verbose=verbose, header='Arrays are not equal')
    937 
    938 

~/.conda/envs/ml4ocn/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    844                                 verbose=verbose, header=header,
    845                                 names=('x', 'y'), precision=precision)
--> 846             raise AssertionError(msg)
    847     except ValueError:
    848         import traceback

AssertionError: 
Arrays are not equal

Mismatched elements: 96056 / 642304 (15%)
Max absolute difference: 1.42108547e-14
Max relative difference: 3.92139227e-16
 x: array([[-4.704400e+00,  4.265410e+01,  2.546170e-02, ..., -3.458944e+00,
        -1.017509e-02, -1.025450e+00],
       [-9.015000e-01,  4.455050e+01,  2.060340e-02, ..., -3.691716e+00,...
 y: array([[-4.704400e+00,  4.265410e+01,  2.546170e-02, ..., -3.458944e+00,
        -1.017509e-02, -1.025450e+00],
       [-9.015000e-01,  4.455050e+01,  2.060340e-02, ..., -3.691716e+00,...

We get an assertion error that they're not equal. There is a mismatch difference of order 1e-15 for the absolute and relative differences. That's numerical error probably due to compression that comes when saving and loading data. Let's check again but with a little less expected precision.

np.testing.assert_array_almost_equal(test_inputs_df.values, input_df.values, decimal=1e-14)

so just by reducing the precision by a smidge (1e-14 instead of 1e-15), we find that the arrays are the same. So we can trust it.

QUESTION(Ana):Should we save the data specifying that precision already?¶

input_df.to_csv(f"{INTERIM_PATH.joinpath('inputs_Ana.csv')}", index=True, float_format='%.14f')

3.5 - Save the rest of the data¶

input_df.to_csv(f"{INTERIM_PATH.joinpath('inputs.csv')}", index=True)
output_df.to_csv(f"{INTERIM_PATH.joinpath('outputs.csv')}", index=True)
soca2016_df.to_csv(f"{INTERIM_PATH.joinpath('soca2016.csv')}", index=True)
isprs2020_df.to_csv(f"{INTERIM_PATH.joinpath('isprs2020.csv')}", index=True)