Preprocessing Steps¶
In this notebook, I will outline a couple of key preprocessing steps we could possibly use in order to acquire a better representation of our data.
import sys
import numpy as np
sys.path.insert(0, '/home/emmanuel/projects/2020_ml_ocn/ml4ocean/src')
from data.make_dataset import DataLoad
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
%load_ext autoreload
%autoreload 2
Inputs¶
dataloader = DataLoad()
X, y = dataloader.load_control_data('na')
X = X[dataloader.core_vars + dataloader.loc_vars]
y = y.drop(dataloader.meta_vars, axis=1)
Visualize PairPlots¶
fig = plt.figure(figsize=(8,8))
pts = sns.pairplot(X)
plt.show()
- We need to normalize
- There are some serious skewed distributions
- MLD
Standard Preprocessing¶
Standardization¶
The is a super common transformation and it's typically the transformation we start with. It just involves removing the mean, \mu and the standard deviation \sigma.
We can use the sklearn function sklearn.preprocessing.StandardScaler()
to perform the normalization.
Note: I don't tranasform the lat, lon, doy coordinates. I think think there are smarter transformations for those variables. Outlined below.
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
transformer = StandardScaler(with_mean=True, with_std=True)
X_transformer = make_column_transformer((transformer, dataloader.core_vars))
X_ = X_transformer.fit_transform(X)
X_norm = X.copy()
X_norm[dataloader.core_vars] = X_
X_norm.shape
fig = plt.figure(figsize=(8,8))
pts = sns.pairplot(X_norm)
plt.show()
Log Transformation¶
This is a very simple but a very common transformation used in the sciences. I think it will highlight some of aspects near the surface because it essentially scales the regions near the beginning of the distribution. This is very similar to what we observed in our outputs. It might not be necessary for all of our inputs but it will probably be necessary for the outputs. I have heard the transforming the Mixed Layer Depth (MLD) might also improve the representation as well.
Inputs¶
We can use the sklearn function sklearn.preprocessing.FunctionTransformer
to apply the logarithmic transformation. We can also use the sklearn.compose.ColumnTransformer
to ensure that we only apply it to columns that interest us.
Variables
- MLD
X_mld = X['MLD']
# Do a histogram plot
fig, ax = plt.subplots(ncols=2)
ax[0].hist(X_mld, bins=100)
ax[1].hist(np.log(X_mld), bins=100)
plt.show()
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log, validate=True)
mld_transformer = make_column_transformer((transformer, ['MLD']))
# Use original data before the normalization above
X_mld = mld_transformer.fit_transform(X)
X_norm = X.copy()
X_norm['MLD'] = X_mld
# # perform standardization on entire dataset
# X_norm = normalizer.fit_transform(X_norm)
# X_norm = pd.DataFrame(X_norm, columns=X.columns)
fig = plt.figure(figsize=(8,8))
pts = sns.pairplot(X_norm)
plt.show()
X_norm.describe()
Coordinate Transformation¶
fig = plt.figure(figsize=(8,8))
pts = sns.pairplot(X)
plt.show()
from features.build_features import times_2_cycles, geo_2_cartesian
X = geo_2_cartesian(X)
X.describe()
from mpl_toolkits import mplot3d
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter(X['x'], X['y'], X['z'], 'gray')
fig = plt.figure(figsize=(8,8))
pts = sns.pairplot(X_rad)
plt.show()
Day of Year Transformation¶
For this, it would be preferable that this is a cyclic transformation.
times = ['doy']
X = times_2_cycles(X, times)
X.describe()
plt.scatter(X['doy_cos'], X['doy_sin'])
Clustering¶
So we can do a preprocessing technique by looking at the clusters of the datapoints. We can first find the cluster and then we can plot the locations to see how that helps.
from sklearn.cluster import KMeans
from sklearn.preprocessing import KBinsDiscretizer
from features.build_features import get_geodataframe
from visualization.visualize import plot_geolocations
clf = KMeans(init='k-means++', n_clusters=3, n_init=10, max_iter=1_000, verbose=None)
clf.fit(X)
clusters = clf.predict(X)
clusters_geo = pd.DataFrame(clusters, columns=['clusters'])
clusters_geo['lat'] = X['lat']
clusters_geo['lon'] = X['lon']
clusters_geo = get_geodataframe(clusters_geo)
plot_geolocations(clusters_geo[clusters_geo['clusters']==0], color='red')
plot_geolocations(clusters_geo[clusters_geo['clusters']==1], color='red')
plot_geolocations(clusters_geo[clusters_geo['clusters']==2], color='red')
# plot clusters
plot_geolocations(clusters_geo[clusters_geo['clusters']==0], color='red')
plot_bbp_profile(y[clusters == 0])
plot_geolocations(clusters_geo[clusters_geo['clusters']==1], color='blue')
plot_bbp_profile(y[clusters == 1])
plot_geolocations(clusters_geo[clusters_geo['clusters']==2], color='orange')
plot_bbp_profile(y[clusters == 2])
plot_geolocations(clusters_geo[clusters_geo['clusters']==3], color='orange')
plot_bbp_profile(y[clusters == 3])
plot_geolocations(clusters_geo[clusters_geo['clusters']==4], color='orange')
plot_bbp_profile(y[clusters == 4])
import matplotlib.colors as colors
def plot_bbp_profile(dataframe: pd.DataFrame):
norm = colors.LogNorm(vmin=dataframe.values.min()+1e-10, vmax=dataframe.values.max())
fig, ax = plt.subplots(figsize=(50,50))
ax.imshow(dataframe.T, cmap='viridis', norm=norm)
plt.show()
y.shape,
Outputs¶
We can use the sklearn function sklearn.
to create
Log Transform¶
Standard Scaling (Mean only...)¶
KBinsDiscretization¶