Sea states clustering GM vs KM #254

dtgaebe · 2023-08-09T17:48:19Z

I was working with this example: PacWave_resource_characterization_example but tried using sklearn.cluster.KMeans instead of sklearn.mixture.GaussianMixture to cluster the data. KMeans does require normalization prior to the clustering.

I found that KMeans requires a smaller number of clusters to converge to the total amount of energy in the data compared to GM. Here illustrated as the ratio between the representative sea states and the total energy (as in the example)

Nr clusters	ratio `GM`	ratio `KM`
4	0.9099	0.9331
6	0.9126	0.9679
8	0.9370	0.9859
12	0.9657	1.0006
16	0.9822	1.0078

I also organized the clusters after weight and visualized the weight with the color scale in the plots

The results of the different cluster algorithms are different, but I don't know if we should be considering other criteria apart from representing the total amount of energy.

The quick and dirty code to extend the PacWave_resource_characterization_example to reproduce the figure and table are:

# Compute Gaussian Mixture Model for each number of clusters
Ns= [4,6, 8, 12, 16]# 16, 32, 64]
X = np.vstack((data_clean.Te.values, data_clean.Hm0.values)).T

## normalize data for kmeans comparison n
Hm0_max = data_clean.Hm0.max()
Te_max = data_clean.Te.max()
Hm0_, Te_ = [], []
Hm0_.append(data.Hm0/Hm0_max)
Te_.append(data.Te/Te_max)
Hm0_ = pd.concat(Hm0_, axis=0)
Te_ = pd.concat(Te_, axis=0)
data_norm = pd.concat([Hm0_, Te_,], axis=1).dropna().sort_index()
from sklearn.cluster import KMeans
import string

results_km={}


fig, axs = plt.subplots(len(Ns),2, figsize=(16, 24), sharex=True)

results={}
for N in Ns:
    gmm = GaussianMixture(n_components=N).fit(X)

    # Save centers and weights
    result = pd.DataFrame(gmm.means_, columns=['Te','Hm0'])
    result['weights'] = gmm.weights_

    result['Tp'] = result.Te / 0.858
    
    labels = gmm.predict(X)
    #sort clusters after weight
    result.sort_values('weights', inplace=True, ascending=False)
    result['labels'] = list(string.ascii_uppercase[0:N])
    idx = result.index
    idx = [int(np.where(idx == i)[0]) for i in np.arange(N)]
    idx = [idx[i] for i in labels]
    result.reset_index(drop=True, inplace=True)
    results[N] = result

    i = Ns.index(N)
    gm_sc = axs[i,0].scatter(data_clean.Te.values, data_clean.Hm0.values, c=result['weights'][idx], s=0.5, cmap='viridis')
    axs[i,0].plot(result.Te, result.Hm0, 'm+')
    axs[i,0].title.set_text(f'{N} GM Clusters')
    cbar = fig.colorbar(gm_sc, ticks=result["weights"], label='Weight [ ]')
    cbar.ax.set_yticklabels([result['labels'][i]+': '+str(round(result["weights"][i],3)) for i in range(N)])
    for x, y, lbl in zip(result['Te'], result.Hm0, result['labels'] ):
        axs[i,0].text(x+0.1, y+0.1, lbl)
    plt.setp(axs[i,0], ylabel='Energy Period, $T_e$ [s]')

    ## kmeans comparison
    km = KMeans(n_clusters=N, random_state=1).fit(data_norm[["Hm0", 'Te']])
    # Un-normalize clustered data
    km.cluster_centers_[:, 0] = km.cluster_centers_[:, 0]*Hm0_max
    km.cluster_centers_[:, 1] = km.cluster_centers_[:, 1]*Te_max
    result_km = pd.DataFrame(km.cluster_centers_, columns=["Hm0", 'Te'])
    result_km['weights'] = [(km.labels_ == i).sum() / len(km.labels_) for i in range(N)]
    result_km['Tp'] = result_km.Te / 0.858

    #sort clusters after weight
    result_km.sort_values('weights', inplace=True, ascending=False)
    result_km['labels'] = list(string.ascii_uppercase[0:N])
    idx_km = result_km.index
    idx_km = [int(np.where(idx_km == i)[0]) for i in np.arange(N)]
    idx_km = [idx_km[i] for i in km.labels_]
    result_km.reset_index(drop=True, inplace=True)
    results_km[N] = result_km

    km_sc = axs[i,1].scatter(data_clean.Te, data_clean.Hm0, c=result_km['weights'][idx_km], s=0.5, cmap='viridis', rasterized=True, vmin = 0)
    axs[i,1].scatter(result_km['Te'], result_km['Hm0'], s=40, marker="x", color="w")
    axs[i,1].title.set_text(f'{N} Kmeans Clusters')
    cbar = fig.colorbar(km_sc, ticks=result_km["weights"], label='Weight [ ]')
    cbar.ax.set_yticklabels([result_km['labels'][i]+': '+str(round(result_km["weights"][i],3)) for i in range(N)])
    for x, y, lbl in zip(result_km['Te'], result_km.Hm0, result_km['labels'] ):
        axs[i,1].text(x+0.1, y+0.1, lbl)

plt.setp(axs[len(Ns)-1,0], xlabel='Sig. wave height, $Hm0$ [m]')    
plt.setp(axs[len(Ns)-1,1], xlabel='Sig. wave height, $Hm0$ [m]')

w = ndbc_data[year].columns.values
f = w / 2*np.pi


for N in results:
    result = results[N]
    J=[]
    for i in range(len(result)):
        b = resource.jonswap_spectrum(f, result.Tp[i], result.Hm0[i])
        J.extend([resource.energy_flux(b, h=399.).values[0][0]])
    
    result['J']  = J
    results[N] = result

for N in results_km:
    result_km = results_km[N]
    J=[]
    for i in range(len(result_km)):
        b = resource.jonswap_spectrum(f, result_km.Tp[i], result_km.Hm0[i])
        J.extend([resource.energy_flux(b, h=399.).values[0][0]])
    
    result_km['J']  = J
    results_km[N] = result_km

ratios={}
for N in results:
    J_hr = results[N].J*len(data_clean)
    total_weighted_J= (J_hr * results[N].weights).sum()
    normalized_weighted_J = total_weighted_J / Total
    ratios[N] = np.round(normalized_weighted_J, 4)

ratios_km={}
for N in results_km:
    J_hr = results_km[N].J*len(data_clean)
    total_weighted_J= (J_hr * results_km[N].weights).sum()
    normalized_weighted_J = total_weighted_J / Total
    ratios_km[N] = np.round(normalized_weighted_J, 4)
    
(pd.Series(ratios), pd.Series(ratios_km))

The text was updated successfully, but these errors were encountered:

dtgaebe · 2023-08-09T17:50:43Z

@ssolson, @ryancoe, @cmichelenstrofer, @jtgrasb just tagging you because at some point we probably all have talked about this.

ryancoe · 2023-08-09T18:32:50Z

In my mind, what you want to do is study how do your quantities of interest (e.g., AEP) change when you vary the number of clusters and change the algorithm. @ssolson - Didn't you write a paper where you did this? I can't seem to find it.

ssolson · 2023-08-11T14:29:36Z

In my mind, what you want to do is study how do your quantities of interest (e.g., AEP) change when you vary the number of clusters and change the algorithm. @ssolson - Didn't you write a paper where you did this? I can't seem to find it.

@ryancoe I stopped the study I wrote up at the sea states. I did forward you some preliminary results I found using WecOptTool although I don't think it will be of much help.

The useful part of the paper I wrote is captured in https://github.com/MHKiT-Software/MHKiT-Python/blob/master/examples/PacWave_resource_characterization_example.ipynb

ssolson · 2023-08-11T14:42:57Z

@dtgaebe that is a cool idea to use an idealized spectrum and then compare the clusters to the original seastate.

Based on what you show here it looks like it makes sense to at a minimum extend the PACWAVE example to include your comparison using the ratio and potentially look to summarize a couple steps into function for MHKiT.

I'm interested to hear your thoughts on next steps for you and perhaps where MHKiT could help with the data processing.

ryancoe · 2023-08-11T15:29:52Z

Note also that @cmichelenstrofer is working on something that intersects this where's he's considering expanding the parameterization of sea states based on machine learning.

dtgaebe · 2023-08-11T16:25:42Z

@ssolson I think that cool idea came from you, or whoever wrote the PacWave resource assessment example because the ratio is already in there.
I just used the same methodology for the KMeans clustered data.

Personally, I find it very useful to sort and visualize the clusters after weight - independent of the algorithm. So this might be a nice addition to MhKit.

I would think that developers would use sea state clusters for gain scheduling. So, as Ryan suggested, it would be interesting how the different clusters impact performance and predicted performance and how sensitive the results are to number of clusters and algorithm. We did a study with sensitivity analysis to bulk spectral bulk parameters here: OCEAN2021. We could potentially do something similar here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sea states clustering GM vs KM #254

Sea states clustering GM vs KM #254

dtgaebe commented Aug 9, 2023

dtgaebe commented Aug 9, 2023

ryancoe commented Aug 9, 2023 •

edited

Loading

ssolson commented Aug 11, 2023

ssolson commented Aug 11, 2023

ryancoe commented Aug 11, 2023

dtgaebe commented Aug 11, 2023

Sea states clustering GM vs KM #254

Sea states clustering GM vs KM #254

Comments

dtgaebe commented Aug 9, 2023

dtgaebe commented Aug 9, 2023

ryancoe commented Aug 9, 2023 • edited Loading

ssolson commented Aug 11, 2023

ssolson commented Aug 11, 2023

ryancoe commented Aug 11, 2023

dtgaebe commented Aug 11, 2023

ryancoe commented Aug 9, 2023 •

edited

Loading