-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCA Contours With Tp Data #249
Comments
Thanks, @Graham-EGI for bringing this up. On my first look at the issue, my guess is similar to yours in that the binning of the Te is causing the behavior you are seeing. @ryancoe can you comment on if this is a known behavior for the modified PCA approach we take here? |
Referring back to Aubrey's paper on the PCA contours, I suspect it may have to do with the binning that's done to fit in second principal component/dimension (see Section 3.2), which looks like it is set by an optional argument. Try playing around with this. MHKiT-Python/mhkit/wave/contours.py Line 100 in e591600
Eckert-Gallup, Aubrey C., et al. "Application of principal component analysis (PCA) and improved joint probability distributions to the inverse first-order reliability method (I-FORM) for predicting extreme sea states." Ocean Engineering 112 (2016): 307-319. |
Wanted to take a quick look if this was something that could likely be tuned on my end, when calling Codefrom mhkit.wave import resource, contours
import matplotlib.pyplot as plt
from mhkit.wave.io import ndbc
import pandas as pd
import numpy as np
parameter = "swden"
buoy_number = "46054"
ndbc_available_data = ndbc.available_data(parameter, buoy_number)
years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022]
filenames = years_of_interest["filename"]
ndbc_requested_data = ndbc.request_data(parameter, filenames)
ndbc_data = {}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
year_data = ndbc_requested_data[year]
ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)
Hm0_list = []
Te_list = []
#! ADDITION
Tp_list = []
# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
year_data = ndbc_data[year]
Hm0_list.append(resource.significant_wave_height(year_data.T))
Te_list.append(resource.energy_period(year_data.T))
Tp_list.append(resource.peak_period(year_data.T))
# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list, axis=0)
Hm0 = pd.concat(Hm0_list, axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0_Te = pd.concat([Hm0, Te, Tp], axis=1)
# Drop any NaNs created from the calculation of Hm0 or Te
Hm0_Te.dropna(inplace=True)
# Sort the DateTime index
Hm0_Te.sort_index(inplace=True)
# Return period (years) of interest
period = 50
# Remove TP Outliers
# 30 removes the Hm0 outliers as well
Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30]
# Get only the values from the DataFrame
Hm0 = Hm0_Te_clean.Hm0.values
Te = Hm0_Te_clean.Te.values
Tp = Hm0_Te_clean.Tp.values
# Delta time of sea-states
dt = (Hm0_Te_clean.index[2] - Hm0_Te_clean.index[1]).seconds
# Get the contour values
# ! TP IS THE ONE WITH ODD BEHAVIOR
bin_sizes = {i: None for i in range(0, 1100, 100)}
failed = []
for k in bin_sizes.keys():
try:
copula_tp = contours.environmental_contours(
Hm0, Tp, dt, period, "PCA", return_PCA=True, PCA_bin_size=k
)
except:
failed.append(k)
continue
Hm0_contour_tp = copula_tp["PCA_x1"]
Te_contour_tp = copula_tp["PCA_x2"]
bin_sizes[k] = (Te_contour_tp, Hm0_contour_tp)
print(failed)
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(18, 6))
ax[0].scatter(Tp, Hm0, color="blue", alpha=0.5)
for k, v in bin_sizes.items():
if v is not None:
ax[0].plot(v[0], v[1], label=f"Bin Size = {k}")
ax[0].legend()
ax[0].grid(True)
ax[0].set_xlabel("Tp (s)")
ax[0].set_ylabel("Hm0 (m)")
h = ax[1].hist2d(Tp, Hm0, (100, 100), cmap=plt.cm.jet, cmin=1)
ax[1].set_xlabel("Tp (s)")
ax[1].set_ylabel("Hm0 (m)")
plt.colorbar(h[3])
plt.show() |
Nice Graham. The y-shape is improved for larger bin sizes. My guess would be that the next issue may be that the PCA assumes equal bin size. Perhaps you could try setting independent bin sizes here: Well, looking at it more we didn't see any changes in the x-axis for varying bin sizes from your analysis. But potentially below the linked line by adjusting the fit you could get better results. Very interested in any findings you have. |
Yeah... I'm not honestly sure if the bin size is really the fundamental problem here... I think we need to look a bit more closely before go down a rabbit hole on this. I know they're quite busy with other things these days, but perhaps @aubreyeckert or even @nevmart have some idea as to what's going on here? |
Got it, thanks guys! |
Hey @ryancoe and all. Just chiming in that we're hitting this bug as well. In the last week we've been looking into using this tool in one of our projects. It would be awesome to get to the bottom of it. |
@mattEhall @jtgrasb @mbruggs (in place of Graham) @ryancoe I have been poking at this issue today and yesterday. I have not identified exactly where the issue is occurring in the method but wanted to check the water on the level of interest for this issue. One of the first things I checked was the histogram of the data. As shown in the first 2 plots below it is clear that Hm0 and Te are not perfectly Gaussian. Each show skewness (they do fail the shapiro_wilk_test included in the code below). However the contour method works with these datasets. Looking at the the third figure for Tp Q-Q plot is does not seem particularly worse and shows less skewness but the histogram has far more than a single peak. As this is the data causing the issue my current thought is that the multiple peaks in this non gaussian distribution lead to the poor contours observed in the original issue. The reason this is important is because the PCA method assumes that the data, when transformed into the PCA space, can be modeled with a Gaussian distribution. This is evident in the use of stats.invgauss.ppf for the inverse cumulative distribution function (CDF) to calculate the component 1 values. If Tp does not follow a Gaussian distribution, the results will not reflect the actual distribution of the data. There is then a linear relationship assumed between the two components that converts the second component into the PCA space. If the relationship between Tp and the first component is not linear, this step will not capture the true dependency, leading to incorrect contouring. MHKiT-Python/mhkit/wave/contours.py Line 330 in 2ccc286
As stated above I have started following the changes through the method when using Te vs Tp but I have not identified exactly where things break down yet. I am happy to keep poking at this if there is still interest. Let me know if you guys have thoughts on this initial look at the data or just that you think it is a good expenditure of time to identify exactly where this binned Tp data causes issues vs simple using Te. Codefrom mhkit.wave import resource, contours, graphics
import matplotlib.pyplot as plt
from mhkit.wave.io import ndbc
import pandas as pd
import numpy as np
import numpy as np
import scipy.stats as stats
import seaborn as sns
parameter = 'swden'
buoy_number = '46054'
ndbc_available_data = ndbc.available_data(parameter, buoy_number)
years_of_interest = ndbc_available_data[ndbc_available_data.year < 2022]
filenames = years_of_interest['filename']
ndbc_requested_data = ndbc.request_data(parameter, filenames)
ndbc_data = {}
# Create a Datetime Index and remove NOAA date columns for each year
for year in ndbc_requested_data:
year_data = ndbc_requested_data[year]
ndbc_data[year] = ndbc.to_datetime_index(parameter, year_data)
Hm0_list = []
Te_list = []
#! ADDITION
Tp_list = []
# Iterate over each year and save the result in the initalized dictionary
for year in ndbc_data:
year_data = ndbc_data[year]
Hm0_list.append(resource.significant_wave_height(year_data.T))
Te_list.append(resource.energy_period(year_data.T))
Tp_list.append(resource.peak_period(year_data.T))
# Concatenate list of Series into a single DataFrame
Te = pd.concat(Te_list, axis=0)
Hm0 = pd.concat(Hm0_list, axis=0)
Tp = pd.concat(Tp_list, axis=0)
Hm0_Te = pd.concat([Hm0, Te, Tp], axis=1)
# Drop any NaNs created from the calculation of Hm0 or Te
Hm0_Te.dropna(inplace=True)
# Sort the DateTime index
Hm0_Te.sort_index(inplace=True)
# Return period (years) of interest
period = 50
# Remove TP Outliers
# 30 removes the Hm0 outliers as well
Hm0_Te_clean = Hm0_Te[Hm0_Te.Tp <= 30]
# Get only the values from the DataFrame
Hm0 = Hm0_Te_clean.Hm0.values
Te = Hm0_Te_clean.Te.values
Tp = Hm0_Te_clean.Tp.values
# Delta time of sea-states
dt = (Hm0_Te_clean.index[2]-Hm0_Te_clean.index[1]).seconds
# Define function to create histogram and Q-Q plot for given data
def plot_histogram_qq(data, title):
sns.set(style="whitegrid")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
# Histogram with kernel density estimate
sns.histplot(data, kde=True, ax=ax1)
ax1.set_title(f'Histogram of {title}')
# Q-Q plot
stats.probplot(data, dist="norm", plot=ax2)
ax2.set_title(f'Q-Q Plot of {title}')
plt.tight_layout()
# Plot for Hm0 (simulated as normally distributed)
plot_histogram_qq(Hm0, 'Hm0')
# Plot for Te (simulated as log-normally distributed)
plot_histogram_qq(Te, 'Te')
# Plot for Tp (simulated as log-normally distributed)
plot_histogram_qq(Tp, 'Tp')
# Perform Shapiro-Wilk test on the simulated data to check for normality
def perform_shapiro_wilk_test(data, name):
stat, p = stats.shapiro(data)
alpha = 0.05
if p > alpha:
return f"{name} looks Gaussian (fail to reject H0) with p-value = {p}"
else:
return f"{name} does not look Gaussian (reject H0) with p-value = {p}"
# Perform the test on the simulated data
sw_Hm0 = perform_shapiro_wilk_test(Hm0, 'Hm0')
sw_Te = perform_shapiro_wilk_test(Te, 'Te')
sw_Tp = perform_shapiro_wilk_test(Tp, 'Tp')
print(sw_Hm0)
print(sw_Te)
print(sw_Tp)
plt.show() |
Hi @ssolson, thanks for your message and the informative observations. This isn't a topic I know a lot about or have the chance to get into in more detail, so I don't follow everything (but I see what you mean about the skewness and that it might not be fitting the assumptions of the method well). For what it's worth, I am still very interested in using this functionality. We haven't found another way to create these contours. |
Hi @ssolson, thanks for the update. I am also still interested in this functionality. It's not a pressing issue for the project I was using it for as we were able to use the KDE method with a different dataset and get okay results, but it would be nice to have a workflow for such data in the future. |
Hi everyone. |
Hey @fxcorreiadafonseca thanks for keeping the conversation going and letting us know there is interest in further development. New method development is beyond the scope of the core MHKiT team. I will prioritize getting any well researched methods included in MHKiT and hope this issue can serve as a place to discuss new functions or extension to the current examples. |
Hi @ssolson, thanks for the reply. I am considering somehow interpolating my Hs-Tp data to get a constant binning interval for Tp, which I am convinced that is affecting the results. I am not completely sure if this solves the problem, nor if this is a "statistically sound" approach. |
@fxcorreiadafonseca I was considering creating a gaussian representation of the binned data as well. Please let us know if it works for you! |
Describe the bug:
PCA environmental contours returning odd results when Tp is used as the period value (rather than Te)
To Reproduce:
This is just a slight modification to the code in the example notebook in
examples.environmental_contours_example.ipynb
to calculate Tp from the spectral data returned from NDBCswden
parameter.Not all buoys have this same behavior. Example is of one that creates it.
Expected behavior:
PCA contours having similar shapes between Tp and Te as the period value.
Screenshots:
Bad contour with Tp as period value:
Good contour with Te as period value:
Desktop (please complete the following information):
Additional context:
There are quite a few of these NDBC swden data sets that produce this contour shape when using Tp calculated from MHKiT. Appears to be related to the banding, and the density of the data:
Created with (where
Tp
andHm0
variables are from the example above) :The text was updated successfully, but these errors were encountered: