Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sea states clustering GM vs KM #254

Open
dtgaebe opened this issue Aug 9, 2023 · 6 comments
Open

Sea states clustering GM vs KM #254

dtgaebe opened this issue Aug 9, 2023 · 6 comments

Comments

@dtgaebe
Copy link

dtgaebe commented Aug 9, 2023

I was working with this example: PacWave_resource_characterization_example but tried using sklearn.cluster.KMeans instead of sklearn.mixture.GaussianMixture to cluster the data. KMeans does require normalization prior to the clustering.

I found that KMeans requires a smaller number of clusters to converge to the total amount of energy in the data compared to GM. Here illustrated as the ratio between the representative sea states and the total energy (as in the example)

Nr clusters ratio GM ratio KM
4 0.9099 0.9331
6 0.9126 0.9679
8 0.9370 0.9859
12 0.9657 1.0006
16 0.9822 1.0078

I also organized the clusters after weight and visualized the weight with the color scale in the plots
image
The results of the different cluster algorithms are different, but I don't know if we should be considering other criteria apart from representing the total amount of energy.

The quick and dirty code to extend the PacWave_resource_characterization_example to reproduce the figure and table are:

# Compute Gaussian Mixture Model for each number of clusters
Ns= [4,6, 8, 12, 16]# 16, 32, 64]
X = np.vstack((data_clean.Te.values, data_clean.Hm0.values)).T

## normalize data for kmeans comparison n
Hm0_max = data_clean.Hm0.max()
Te_max = data_clean.Te.max()
Hm0_, Te_ = [], []
Hm0_.append(data.Hm0/Hm0_max)
Te_.append(data.Te/Te_max)
Hm0_ = pd.concat(Hm0_, axis=0)
Te_ = pd.concat(Te_, axis=0)
data_norm = pd.concat([Hm0_, Te_,], axis=1).dropna().sort_index()
from sklearn.cluster import KMeans
import string

results_km={}


fig, axs = plt.subplots(len(Ns),2, figsize=(16, 24), sharex=True)

results={}
for N in Ns:
    gmm = GaussianMixture(n_components=N).fit(X)

    # Save centers and weights
    result = pd.DataFrame(gmm.means_, columns=['Te','Hm0'])
    result['weights'] = gmm.weights_

    result['Tp'] = result.Te / 0.858
    
    labels = gmm.predict(X)
    #sort clusters after weight
    result.sort_values('weights', inplace=True, ascending=False)
    result['labels'] = list(string.ascii_uppercase[0:N])
    idx = result.index
    idx = [int(np.where(idx == i)[0]) for i in np.arange(N)]
    idx = [idx[i] for i in labels]
    result.reset_index(drop=True, inplace=True)
    results[N] = result

    i = Ns.index(N)
    gm_sc = axs[i,0].scatter(data_clean.Te.values, data_clean.Hm0.values, c=result['weights'][idx], s=0.5, cmap='viridis')
    axs[i,0].plot(result.Te, result.Hm0, 'm+')
    axs[i,0].title.set_text(f'{N} GM Clusters')
    cbar = fig.colorbar(gm_sc, ticks=result["weights"], label='Weight [ ]')
    cbar.ax.set_yticklabels([result['labels'][i]+': '+str(round(result["weights"][i],3)) for i in range(N)])
    for x, y, lbl in zip(result['Te'], result.Hm0, result['labels'] ):
        axs[i,0].text(x+0.1, y+0.1, lbl)
    plt.setp(axs[i,0], ylabel='Energy Period, $T_e$ [s]')

    ## kmeans comparison
    km = KMeans(n_clusters=N, random_state=1).fit(data_norm[["Hm0", 'Te']])
    # Un-normalize clustered data
    km.cluster_centers_[:, 0] = km.cluster_centers_[:, 0]*Hm0_max
    km.cluster_centers_[:, 1] = km.cluster_centers_[:, 1]*Te_max
    result_km = pd.DataFrame(km.cluster_centers_, columns=["Hm0", 'Te'])
    result_km['weights'] = [(km.labels_ == i).sum() / len(km.labels_) for i in range(N)]
    result_km['Tp'] = result_km.Te / 0.858

    #sort clusters after weight
    result_km.sort_values('weights', inplace=True, ascending=False)
    result_km['labels'] = list(string.ascii_uppercase[0:N])
    idx_km = result_km.index
    idx_km = [int(np.where(idx_km == i)[0]) for i in np.arange(N)]
    idx_km = [idx_km[i] for i in km.labels_]
    result_km.reset_index(drop=True, inplace=True)
    results_km[N] = result_km

    km_sc = axs[i,1].scatter(data_clean.Te, data_clean.Hm0, c=result_km['weights'][idx_km], s=0.5, cmap='viridis', rasterized=True, vmin = 0)
    axs[i,1].scatter(result_km['Te'], result_km['Hm0'], s=40, marker="x", color="w")
    axs[i,1].title.set_text(f'{N} Kmeans Clusters')
    cbar = fig.colorbar(km_sc, ticks=result_km["weights"], label='Weight [ ]')
    cbar.ax.set_yticklabels([result_km['labels'][i]+': '+str(round(result_km["weights"][i],3)) for i in range(N)])
    for x, y, lbl in zip(result_km['Te'], result_km.Hm0, result_km['labels'] ):
        axs[i,1].text(x+0.1, y+0.1, lbl)

plt.setp(axs[len(Ns)-1,0], xlabel='Sig. wave height, $Hm0$ [m]')    
plt.setp(axs[len(Ns)-1,1], xlabel='Sig. wave height, $Hm0$ [m]')    
w = ndbc_data[year].columns.values
f = w / 2*np.pi


for N in results:
    result = results[N]
    J=[]
    for i in range(len(result)):
        b = resource.jonswap_spectrum(f, result.Tp[i], result.Hm0[i])
        J.extend([resource.energy_flux(b, h=399.).values[0][0]])
    
    result['J']  = J
    results[N] = result

for N in results_km:
    result_km = results_km[N]
    J=[]
    for i in range(len(result_km)):
        b = resource.jonswap_spectrum(f, result_km.Tp[i], result_km.Hm0[i])
        J.extend([resource.energy_flux(b, h=399.).values[0][0]])
    
    result_km['J']  = J
    results_km[N] = result_km

ratios={}
for N in results:
    J_hr = results[N].J*len(data_clean)
    total_weighted_J= (J_hr * results[N].weights).sum()
    normalized_weighted_J = total_weighted_J / Total
    ratios[N] = np.round(normalized_weighted_J, 4)

ratios_km={}
for N in results_km:
    J_hr = results_km[N].J*len(data_clean)
    total_weighted_J= (J_hr * results_km[N].weights).sum()
    normalized_weighted_J = total_weighted_J / Total
    ratios_km[N] = np.round(normalized_weighted_J, 4)
    
(pd.Series(ratios), pd.Series(ratios_km))
@dtgaebe
Copy link
Author

dtgaebe commented Aug 9, 2023

@ssolson, @ryancoe, @cmichelenstrofer, @jtgrasb just tagging you because at some point we probably all have talked about this.

@ryancoe
Copy link
Contributor

ryancoe commented Aug 9, 2023

In my mind, what you want to do is study how do your quantities of interest (e.g., AEP) change when you vary the number of clusters and change the algorithm. @ssolson - Didn't you write a paper where you did this? I can't seem to find it.

@ssolson
Copy link
Contributor

ssolson commented Aug 11, 2023

In my mind, what you want to do is study how do your quantities of interest (e.g., AEP) change when you vary the number of clusters and change the algorithm. @ssolson - Didn't you write a paper where you did this? I can't seem to find it.

@ryancoe I stopped the study I wrote up at the sea states. I did forward you some preliminary results I found using WecOptTool although I don't think it will be of much help.

The useful part of the paper I wrote is captured in https://github.com/MHKiT-Software/MHKiT-Python/blob/master/examples/PacWave_resource_characterization_example.ipynb

@ssolson
Copy link
Contributor

ssolson commented Aug 11, 2023

@dtgaebe that is a cool idea to use an idealized spectrum and then compare the clusters to the original seastate.

Based on what you show here it looks like it makes sense to at a minimum extend the PACWAVE example to include your comparison using the ratio and potentially look to summarize a couple steps into function for MHKiT.

I'm interested to hear your thoughts on next steps for you and perhaps where MHKiT could help with the data processing.

@ryancoe
Copy link
Contributor

ryancoe commented Aug 11, 2023

Note also that @cmichelenstrofer is working on something that intersects this where's he's considering expanding the parameterization of sea states based on machine learning.

@dtgaebe
Copy link
Author

dtgaebe commented Aug 11, 2023

@ssolson I think that cool idea came from you, or whoever wrote the PacWave resource assessment example because the ratio is already in there.
I just used the same methodology for the KMeans clustered data.

Personally, I find it very useful to sort and visualize the clusters after weight - independent of the algorithm. So this might be a nice addition to MhKit.

I would think that developers would use sea state clusters for gain scheduling. So, as Ryan suggested, it would be interesting how the different clusters impact performance and predicted performance and how sensitive the results are to number of clusters and algorithm. We did a study with sensitivity analysis to bulk spectral bulk parameters here: OCEAN2021. We could potentially do something similar here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants