Skip to content

Data Analysis

This notebook will showcase a bit of data analysis of the datasets given. I will take a look at some control datasets that look over the following regions:

  • North Atlantic
  • Subtropical Gyres
import pandas as pd
import numpy as np

# preprocessing
from sklearn.preprocessing import StandardScaler

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')

%matplotlib inline
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
data_path = '/home/emmanuel/projects/2020_ml_ocn/data/RAW/CONTROL/'
control_gp1 = 'NORTH_ATLANTIC'
control_gp2 = 'SUBTROPICAL_GYRES'

Import Data

First we will look at the

Core Variables

For the first steps, we will look at the core variables. These include:

  • Latitude
  • Longitude
  • Spectral Variables
    • 412, 443, 490, 555, 670
  • PAR
  • Sea Surface Anomaly
  • Mixed Layer Depth (MLD)

Meta-Data

  • Day of the Year
  • number of cycles
  • wmo

So roughly 8 variables in total (not including the 2 meta-variables)

data_path = '/home/emmanuel/projects/2020_ml_ocn/data/RAW/CONTROL/'

CORE_VARS = [
    'sla', 'PAR', 'RHO_WN_412', 'RHO_WN_443', "RHO_WN_490",
    "RHO_WN_555", "RHO_WN_670",
    "MLD",
]

CORE_OUTS = [
    'sla',
]
LOCATION_VARS = [
    'lat', 'lon', 'doy',

]
META_VARS = [
    'wmo', 'n_cycle'
]
def load_control_data(control='na'):

    # choose control group data
    if control == 'na':
        region = 'NORTH_ATLANTIC'
        filename_ext = 'NA'
    elif control == 'stg':
        region = 'SUBTROPICAL_GYRES'
        filename_ext = 'STG'
    else:
        raise ValueError(f"Unrecognized control group: {control}")

    # Load Data
    X = pd.read_csv(f"{data_path}{region}/X_INPUT_{filename_ext}.csv")
    y = pd.read_csv(f"{data_path}{region}/BBP_OUTPUT_{filename_ext}.csv")

    return X, y
X_na, Y_na = load_control_data(control='na')
X_features = X_na[CORE_VARS]

X_features.head()
sla PAR RHO_WN_412 RHO_WN_443 RHO_WN_490 RHO_WN_555 RHO_WN_670 MLD
0 5.421500 44.7510 0.026754 0.021997 0.016902 0.004928 0.000069 50
1 5.845199 45.2395 0.025535 0.020452 0.016493 0.004793 0.000119 58
2 0.756100 50.9293 0.007764 0.009328 0.009720 0.003551 0.000421 52
3 0.326600 50.8669 0.007064 0.008832 0.009639 0.003629 0.000407 62
4 0.326300 53.1347 0.020773 0.019439 0.015983 0.005122 0.000562 54

General Statistics for the Datasets

X_features.describe()
sla PAR RHO_WN_412 RHO_WN_443 RHO_WN_490 RHO_WN_555 RHO_WN_670 MLD
count 3022.000000 3022.000000 3022.000000 3022.000000 3022.000000 3022.000000 3022.000000 3022.000000
mean -1.023435 42.587315 0.017911 0.015458 0.013138 0.006466 0.000970 92.319656
std 6.241974 13.341398 0.010524 0.007983 0.005180 0.002834 0.000447 145.681095
min -47.530300 7.658160 0.000789 0.001000 0.001829 0.002191 0.000004 10.000000
25% -4.710750 33.493225 0.010973 0.010070 0.010000 0.004785 0.000662 20.000000
50% -1.061400 44.568950 0.014952 0.013515 0.012166 0.005929 0.000991 40.000000
75% 2.866325 53.848125 0.021841 0.018442 0.015337 0.007369 0.001151 82.000000
max 29.168600 63.337100 0.063060 0.052012 0.053483 0.036277 0.005287 995.000000

At a first glance, I think we will have to normalized the data for sure. Beacuse the values are all over the place. Fairly standard procedure.

Locations

To get a feel for the locations, we can simply use the lat, lon values to display the locations on a map. I used tips from this blog

import geopandas as gpd
from shapely.geometry import Point, Polygon

def get_geodataframe(dataframe: pd.DataFrame)-> gpd.GeoDataFrame:
    """This function will transform the dataset from a 
    pandas.DataFrame to a geopandas.DataFrame which will
    have a special column for geometry. This will make plotting 
    a lot easier."""
    # get polygons
    geometry = [Point(xy) for xy in zip(X_na['lon'], X_na['lat'])]

    # coordinate systems
    crs = {'init': 'epsg:4326'}

    # create dataframe
    gpd_df = gpd.GeoDataFrame(
        dataframe,
        crs=crs,
        geometry=geometry
    )
    return gpd_df
gpd_df = get_geodataframe(X_na)

gpd_df.head()
wmo n_cycle sla PAR RHO_WN_412 RHO_WN_443 RHO_WN_490 RHO_WN_555 RHO_WN_670 MLD lon lat doy geometry
0 3902122 42 5.421500 44.7510 0.026754 0.021997 0.016902 0.004928 0.000069 50 -27.221187 12.073372 1 POINT (-27.22119 12.07337)
1 3902122 43 5.845199 45.2395 0.025535 0.020452 0.016493 0.004793 0.000119 58 -27.042640 12.199892 6 POINT (-27.04264 12.19989)
2 3902122 49 0.756100 50.9293 0.007764 0.009328 0.009720 0.003551 0.000421 52 -26.491484 11.554463 36 POINT (-26.49148 11.55446)
3 3902122 50 0.326600 50.8669 0.007064 0.008832 0.009639 0.003629 0.000407 62 -26.471429 11.441130 41 POINT (-26.47143 11.44113)
4 3902122 51 0.326300 53.1347 0.020773 0.019439 0.015983 0.005122 0.000562 54 -26.447072 11.427888 46 POINT (-26.44707 11.42789)

For plotting, it's a bit more involved but we can do it quite easily. I found some help from these resources:

def plot_geolocations(gpd_df: gpd.GeoDataFrame)-> None:

    # get the background map
    path = gpd.datasets.get_path('naturalearth_lowres')
    world_df = gpd.read_file(path)

    # initialize figure
    fig, ax = plt.subplots(figsize=(10, 10))

    # add background world map
    world_df.plot(ax=ax, color='gray')

    # add the locations of the dataset
    gpd_df.plot(ax=ax, color='red', markersize=2)

    plt.show()
plot_geolocations(gpd_df)

Looks pretty good. All the points seem to be in the Northern atlantic area. Some also seem near the equator and there are not many of them. So I wonder if that will effect the results.

PairPlots

First and foremost, I would like to look at the pairplots of the variables. Just to see what we're dealing with. I'll do the unnormalized and unnormalized versions.

def plot_pairplots(dataframe: pd.DataFrame)-> None:

    fig = plt.figure(figsize=(10,10))

    pts = sns.pairplot(dataframe)

    plt.show()
# unnormalized 

plot_pairplots(X_features)
<Figure size 720x720 with 0 Axes>

Normalized

# normalized 
X_norm = StandardScaler().fit_transform(X_features)
X_norm = pd.DataFrame(X_norm, columns=X_features.columns)

# plot data
plot_pairplots(X_norm)
<Figure size 720x720 with 0 Axes>

The normalized plots look OK actually. I don't think there will be much need to actually normalize this data.

PairPlots - Outputs

So we were told that the output layer is something to consider. We have a very high dimensional output which I imagine will give us some problems. We will definitely have to reduce the dimensionality.

Profiles

Let's just look at the distribution of the data itself.

Y_na.head()
wmo n_cycle X1 X3 X5 X7 X9 X11 X13 X15 ... X955 X960 X965 X970 X975 X980 X985 X990 X995 X1000
0 3902122 42 0.000880 0.000870 0.000860 0.00088 0.000840 0.00086 0.00087 0.000873 ... 0.000256 0.000240 0.00022 0.000225 0.000240 0.00024 0.00024 0.000240 0.000240 0.000240
1 3902122 43 0.000830 0.000820 0.000800 0.00080 0.000794 0.00080 0.00080 0.000800 ... 0.000220 0.000230 0.00023 0.000230 0.000230 0.00023 0.00023 0.000251 0.000260 0.000260
2 3902122 49 0.000790 0.000830 0.000900 0.00078 0.000848 0.00090 0.00090 0.000900 ... 0.000230 0.000210 0.00022 0.000225 0.000223 0.00024 0.00026 0.000380 0.000270 0.000270
3 3902122 50 0.000775 0.000895 0.000783 0.00079 0.000830 0.00077 0.00077 0.000780 ... 0.000228 0.000240 0.00023 0.000230 0.000230 0.00024 0.00022 0.000270 0.000280 0.000280
4 3902122 51 0.000860 0.000920 0.000880 0.00092 0.000920 0.00084 0.00084 0.000890 ... 0.000220 0.000225 0.00023 0.000230 0.000230 0.00024 0.00026 0.000235 0.000235 0.000235

5 rows × 278 columns

plt.figure(figsize=(50,50))
plt.imshow(Y_na.drop(meta_vars, axis=1).T, cmap='viridis')
# plt.xticks(labels=y.columns.values)
plt.show()
def plot_bbp_profile(dataframe: pd.DataFrame):
    norm = colors.LogNorm(vmin=Y.values.min(), vmax=Y.values.max())

    fig, ax = plt.subplots(figsize=(50,50))
    ax.imshow(Y.T, cmap='viridis', norm=norm)
    plt.show()

So first of all, we can't really see anything. So I'll have to adjust the scale a little bit. I'll use the log normal transformation.

import matplotlib.colors as colors

Y = Y_na.drop(meta_vars, axis=1)

norm = colors.LogNorm(vmin=Y.values.min(), vmax=Y.values.max())
plt.figure(figsize=(50,50))
plt.imshow(Y.T, cmap='viridis', norm=norm)
# plt.xticks(labels=y.columns.values)
plt.show()

So this is a bit better. I think I also see a lot less extreme values. So that's quite nice.

CONTROL GROUP II - Subtropical Gyres

So we will do the exact same analysis as above but for the 2nd control group. It will be less code because the functions are already above.

X_na, Y_na = load_control_data(control='stg')
X_features = X_na[CORE_VARS]

X_features.head()
sla PAR RHO_WN_412 RHO_WN_443 RHO_WN_490 RHO_WN_555 RHO_WN_670 MLD
0 19.850199 35.6356 0.037567 0.029403 0.020264 0.004894 0.000560 40
1 19.360900 35.7430 0.036155 0.028722 0.019437 0.004475 0.000149 10
2 3.658199 63.8234 0.056097 0.043019 0.025556 0.005367 0.000575 10
3 0.672800 63.5382 0.052678 0.039138 0.023210 0.004321 0.000358 10
4 -10.379800 62.6472 0.049627 0.038268 0.023311 0.004684 0.000510 10
X_features.describe()
sla PAR RHO_WN_412 RHO_WN_443 RHO_WN_490 RHO_WN_555 RHO_WN_670 MLD
count 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1.486000e+03 1486.000000
mean 1.638224 47.677969 0.048893 0.036750 0.022760 0.004863 6.208601e-04 53.334455
std 6.830798 10.594949 0.010345 0.006920 0.002832 0.000740 2.340827e-04 31.431297
min -20.751599 27.726300 0.014931 0.015627 0.012073 0.002241 2.737260e-09 10.000000
25% -2.407725 37.646975 0.042085 0.032025 0.020937 0.004428 4.757155e-04 32.000000
50% 1.422450 45.771700 0.048925 0.036802 0.022995 0.004894 5.831605e-04 48.000000
75% 5.376425 58.763150 0.057000 0.041904 0.024736 0.005192 7.523037e-04 71.000000
max 33.062599 67.356800 0.077000 0.055167 0.033292 0.008618 1.549230e-03 194.000000

Immediate note: we have less data. That's never the optimum solution. But it is what it is.

gpd_df = get_geodataframe(X_na)

gpd_df.head()
wmo n_cycle sla PAR RHO_WN_412 RHO_WN_443 RHO_WN_490 RHO_WN_555 RHO_WN_670 MLD lon lat doy geometry
0 1901347 12 19.850199 35.6356 0.037567 0.029403 0.020264 0.004894 0.000560 40 78.149099 -20.609199 192 POINT (78.14910 -20.60920)
1 1901347 16 19.360900 35.7430 0.036155 0.028722 0.019437 0.004475 0.000149 10 78.030200 -20.651700 194 POINT (78.03020 -20.65170)
2 1901347 186 3.658199 63.8234 0.056097 0.043019 0.025556 0.005367 0.000575 10 72.087800 -20.514000 4 POINT (72.08780 -20.51400)
3 1901347 189 0.672800 63.5382 0.052678 0.039138 0.023210 0.004321 0.000358 10 71.725799 -20.484700 13 POINT (71.72580 -20.48470)
4 1901347 195 -10.379800 62.6472 0.049627 0.038268 0.023311 0.004684 0.000510 10 71.283800 -20.956100 31 POINT (71.28380 -20.95610)
plot_geolocations(gpd_df)

Definitely interesting that the locations are way more spread out and in clustered regions. There is also some overlap in a particular region of the data.

plot_pairplots(X_features)
<Figure size 720x720 with 0 Axes>
# normalized 
X_norm = StandardScaler().fit_transform(X_features)
X_norm = pd.DataFrame(X_norm, columns=X_features.columns)

# plot data
plot_pairplots(X_norm)
<Figure size 720x720 with 0 Axes>

Profiles

Y_na.head()
wmo n_cycle X1 X3 X5 X7 X9 X11 X13 X15 ... X955 X960 X965 X970 X975 X980 X985 X990 X995 X1000
0 1901347 4 0.00060 0.00060 0.00060 0.00060 0.00060 0.000600 0.000600 0.000600 ... 0.00028 0.000280 0.00028 0.00028 0.00028 0.00028 0.00028 0.00028 0.00028 0.00028
1 1901347 12 0.00061 0.00061 0.00061 0.00061 0.00061 0.000610 0.000604 0.000594 ... 0.00028 0.000271 0.00028 0.00028 0.00028 0.00028 0.00028 0.00028 0.00028 0.00028
2 1901347 16 0.00060 0.00060 0.00060 0.00060 0.00060 0.000600 0.000595 0.000590 ... 0.00027 0.000270 0.00027 0.00027 0.00027 0.00027 0.00027 0.00027 0.00027 0.00027
3 1901347 24 0.00059 0.00059 0.00059 0.00059 0.00059 0.000590 0.000590 0.000590 ... 0.00027 0.000270 0.00027 0.00027 0.00028 0.00028 0.00028 0.00028 0.00028 0.00028
4 1901347 28 0.00059 0.00059 0.00059 0.00059 0.00059 0.000585 0.000580 0.000585 ... 0.00027 0.000270 0.00027 0.00027 0.00027 0.00027 0.00027 0.00027 0.00027 0.00027

5 rows × 278 columns

Y = Y_na.drop(meta_vars, axis=1)

plot_bbp_profile(Y)

So I see a bit more values that are statistically higher here. Not sure what that means but relatively speaking we have more spead in our data.