Data Analysis¶
This notebook will showcase a bit of data analysis of the datasets given. I will take a look at some control datasets that look over the following regions:
- North Atlantic
- Subtropical Gyres
import pandas as pd
import numpy as np
# preprocessing
from sklearn.preprocessing import StandardScaler
# plotting
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
%matplotlib inline
%load_ext autoreload
%autoreload 2
data_path = '/home/emmanuel/projects/2020_ml_ocn/data/RAW/CONTROL/'
control_gp1 = 'NORTH_ATLANTIC'
control_gp2 = 'SUBTROPICAL_GYRES'
Import Data¶
First we will look at the
Core Variables¶
For the first steps, we will look at the core variables. These include:
- Latitude
- Longitude
- Spectral Variables
- 412, 443, 490, 555, 670
- PAR
- Sea Surface Anomaly
- Mixed Layer Depth (MLD)
Meta-Data
- Day of the Year
- number of cycles
- wmo
So roughly 8 variables in total (not including the 2 meta-variables)
data_path = '/home/emmanuel/projects/2020_ml_ocn/data/RAW/CONTROL/'
CORE_VARS = [
'sla', 'PAR', 'RHO_WN_412', 'RHO_WN_443', "RHO_WN_490",
"RHO_WN_555", "RHO_WN_670",
"MLD",
]
CORE_OUTS = [
'sla',
]
LOCATION_VARS = [
'lat', 'lon', 'doy',
]
META_VARS = [
'wmo', 'n_cycle'
]
def load_control_data(control='na'):
# choose control group data
if control == 'na':
region = 'NORTH_ATLANTIC'
filename_ext = 'NA'
elif control == 'stg':
region = 'SUBTROPICAL_GYRES'
filename_ext = 'STG'
else:
raise ValueError(f"Unrecognized control group: {control}")
# Load Data
X = pd.read_csv(f"{data_path}{region}/X_INPUT_{filename_ext}.csv")
y = pd.read_csv(f"{data_path}{region}/BBP_OUTPUT_{filename_ext}.csv")
return X, y
X_na, Y_na = load_control_data(control='na')
X_features = X_na[CORE_VARS]
X_features.head()
General Statistics for the Datasets¶
X_features.describe()
At a first glance, I think we will have to normalized the data for sure. Beacuse the values are all over the place. Fairly standard procedure.
import geopandas as gpd
from shapely.geometry import Point, Polygon
def get_geodataframe(dataframe: pd.DataFrame)-> gpd.GeoDataFrame:
"""This function will transform the dataset from a
pandas.DataFrame to a geopandas.DataFrame which will
have a special column for geometry. This will make plotting
a lot easier."""
# get polygons
geometry = [Point(xy) for xy in zip(X_na['lon'], X_na['lat'])]
# coordinate systems
crs = {'init': 'epsg:4326'}
# create dataframe
gpd_df = gpd.GeoDataFrame(
dataframe,
crs=crs,
geometry=geometry
)
return gpd_df
gpd_df = get_geodataframe(X_na)
gpd_df.head()
For plotting, it's a bit more involved but we can do it quite easily. I found some help from these resources:
def plot_geolocations(gpd_df: gpd.GeoDataFrame)-> None:
# get the background map
path = gpd.datasets.get_path('naturalearth_lowres')
world_df = gpd.read_file(path)
# initialize figure
fig, ax = plt.subplots(figsize=(10, 10))
# add background world map
world_df.plot(ax=ax, color='gray')
# add the locations of the dataset
gpd_df.plot(ax=ax, color='red', markersize=2)
plt.show()
plot_geolocations(gpd_df)
Looks pretty good. All the points seem to be in the Northern atlantic area. Some also seem near the equator and there are not many of them. So I wonder if that will effect the results.
PairPlots¶
First and foremost, I would like to look at the pairplots of the variables. Just to see what we're dealing with. I'll do the unnormalized and unnormalized versions.
def plot_pairplots(dataframe: pd.DataFrame)-> None:
fig = plt.figure(figsize=(10,10))
pts = sns.pairplot(dataframe)
plt.show()
# unnormalized
plot_pairplots(X_features)
Normalized¶
# normalized
X_norm = StandardScaler().fit_transform(X_features)
X_norm = pd.DataFrame(X_norm, columns=X_features.columns)
# plot data
plot_pairplots(X_norm)
The normalized plots look OK actually. I don't think there will be much need to actually normalize this data.
PairPlots - Outputs¶
So we were told that the output layer is something to consider. We have a very high dimensional output which I imagine will give us some problems. We will definitely have to reduce the dimensionality.
Profiles¶
Let's just look at the distribution of the data itself.
Y_na.head()
plt.figure(figsize=(50,50))
plt.imshow(Y_na.drop(meta_vars, axis=1).T, cmap='viridis')
# plt.xticks(labels=y.columns.values)
plt.show()
def plot_bbp_profile(dataframe: pd.DataFrame):
norm = colors.LogNorm(vmin=Y.values.min(), vmax=Y.values.max())
fig, ax = plt.subplots(figsize=(50,50))
ax.imshow(Y.T, cmap='viridis', norm=norm)
plt.show()
So first of all, we can't really see anything. So I'll have to adjust the scale a little bit. I'll use the log normal transformation.
import matplotlib.colors as colors
Y = Y_na.drop(meta_vars, axis=1)
norm = colors.LogNorm(vmin=Y.values.min(), vmax=Y.values.max())
plt.figure(figsize=(50,50))
plt.imshow(Y.T, cmap='viridis', norm=norm)
# plt.xticks(labels=y.columns.values)
plt.show()
So this is a bit better. I think I also see a lot less extreme values. So that's quite nice.
CONTROL GROUP II - Subtropical Gyres¶
So we will do the exact same analysis as above but for the 2nd control group. It will be less code because the functions are already above.
X_na, Y_na = load_control_data(control='stg')
X_features = X_na[CORE_VARS]
X_features.head()
X_features.describe()
Immediate note: we have less data. That's never the optimum solution. But it is what it is.
gpd_df = get_geodataframe(X_na)
gpd_df.head()
plot_geolocations(gpd_df)
Definitely interesting that the locations are way more spread out and in clustered regions. There is also some overlap in a particular region of the data.
plot_pairplots(X_features)
# normalized
X_norm = StandardScaler().fit_transform(X_features)
X_norm = pd.DataFrame(X_norm, columns=X_features.columns)
# plot data
plot_pairplots(X_norm)
Profiles¶
Y_na.head()
Y = Y_na.drop(meta_vars, axis=1)
plot_bbp_profile(Y)
So I see a bit more values that are statistically higher here. Not sure what that means but relatively speaking we have more spead in our data.