Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k nearest neighbors and cross validation bandwidth methods #97

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

inti-abbate
Copy link

Two automatic bandwidth selection methods were added:

  • k nearest neighbors: Calculates the bw for each data point as the distance to the kth neighbor
  • cross validated bandwidth: based on a (new) grid_search_cv function, wich computes the score over a grid of bandwidths. A score method was added to BaseKDE for that purpose. Since grid_search_cv requires a model, the cv bw method could not be implemented as an independent function. Instead it is a method of BaseKDE that calls grid_search_cv with model=self (thru and intermediate function cross_val, which chooses the best bw according to the grid search).

Since both method require parameters other than data and weights, a slight change was introduced to the API:
fit method now have the signature fit(self, data, weights=None, **kwargs), where **kwargs are the keyword arguments to be passed to the bw selection method. If no **kwargs is passed, the default parameters will be used, so the change is backwards compatible.

There is an issue in which I'd like to know your opinion. Both methods are quite computationally expensive (for large datasets), so they could benefit from CPU parallelization, which is quite easy to add with joblib library. On the other hand, this would go against the guideline "Import as few external dependencies as possible". Let me know which option you'd prefer.

Two automatic bandwidth selection methods were added:
 - k nearest neighbors
 - cross validated bandwidth
The second is based on a grid_search_cv method, wich computes the score
over a grid of bandwidth. A score method was added to BaseKDE, for that
purpose.
@tommyod
Copy link
Owner

tommyod commented Jun 2, 2021

Hi @inti-abbate . Thanks for the PR. Letting you know now that I've seen it, and I will look more closely over it once I find the time. Regarding the last question, I think adding joblib is the way to go - it introduces a dependency, but it's common in similar libraries and it helps speed up computations.

Tests are failing due to black. Please install black and run python -m black KDEpy -l 120 to auto-format the code.

Note to self: see this comment also

@inti-abbate
Copy link
Author

Thanks for the answer. Of course, take your time.
I'll (try to) fix the failing checks, and add parallelization with joblib.

@inti-abbate
Copy link
Author

Sorry about the previous commit, which I force-pushed it back. I just applied black, but flake8 still finds error E203. As I could find in google, this a known issue between black and flake8, and according to here, E203 should be ignored.

@tommyod
Copy link
Owner

tommyod commented Jun 2, 2021

Feel free to add E203 to flake8 ignore list.

@matfax
Copy link

matfax commented Jun 14, 2023

May I ask what happened to this PR? If it's just about the formatting, I could give it a try. After all, the black issue might be resolved at this point, after two years.

@inti-abbate
Copy link
Author

Hi @matfax. By the time of the last commit, the implemented BW selection methods were ready to be used. Regarding the formatting, I could not manage to avoid the flake8 errors, but yes, maybe it was a black issue that is already resolved.
I think it would be great if you can fix the formatting issue (and rebase all the new commits) so that the PR can be merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants