Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems like a lot of duplicate effort #12

Open
gottacatchenall opened this issue Sep 27, 2024 · 8 comments
Open

Seems like a lot of duplicate effort #12

gottacatchenall opened this issue Sep 27, 2024 · 8 comments

Comments

@gottacatchenall
Copy link

gottacatchenall commented Sep 27, 2024

There is a pretty robust codebase in https://github.com/PoisotLab/SpeciesDistributionToolkit.jl that already does many of the stated goals of this package. What functionality are you interested in that isn't in SDT? We could always work to add that functionality to SDT rather than duplicate a bunch of effort...

@tiemvanderdeure
Copy link
Owner

tiemvanderdeure commented Sep 27, 2024

It's a fair point, but I think it extends to more than just those two packages. SpeciesDistributionToolkit.jl implements a bunch of features that are already in Rasters.jl or RasterDataSources.jl. Not to mention GBIF.jl and GBIF2.jl.

What functionality are you interested in that isn't in SDT?

Basically to make a package that makes it easy to actually fit ensemble models, and evaluates them, and projects them, similar to e.g. biomod2 in R. The most obvious way to do this (it seems to me) is to depend on MLJ. I needed that functionality for my own research and figured I might as well make a package.

SDT and related packages offer some tools to load and manipulate spatial and occurrence data, but don't actually fit models (and explicitly say that fitting models is not the goal).

The reason I started it as a separate package is that I would much rather depend on packages like Rasters.jl and MLJBase.jl with a lot of uses and users, instead of packages like SimpleSDMLayers.jl.

That said, I can see I ended up adding some tools here (e.g. spatial thinning) that could also fit in SDT.

Does that make sense?

@gottacatchenall
Copy link
Author

Sure, if it makes things easier for your research you should use whatever gets the job done most efficiently---that makes sense.

That being said, over the next couple years of my postdoc fellowship, a major goal is to add interfaces to MLJ (and other computer vision specific tools) in SDT, likely building on the API in the (very recently released) SDeMO.jl subpackage of SDT.

That said, I can see I ended up adding some tools here (e.g. spatial thinning) that could also fit in SDT.

Happy to accept any contributions (and spatial thinning is a great one)

@tiemvanderdeure
Copy link
Owner

I didn't know about SDeMO.jl and I'm happy to hear that you will add interfaces to MLJ.

I have tried as much as possible to fit new functionality into existing packages, so this package can be as small as possible (and to avoid siloing and duplicate efforts).

The MLJ ecosystem already has a lot of really useful functionality. An example is a whole package that deals with confusion matrices and performance measures: https://github.com/JuliaAI/StatisticalMeasures.jl, which I can see SDeMO.jl re-implements from scratch.

I can see a use case for a very lightweight package with a few functions that really are specific to SDMs (and that both this package and SDT could depend on). But I can't think of very many functions that wouldn't fit into some other already existing package.

@tiemvanderdeure
Copy link
Owner

I would love to offer things like BIOCLIM, which I can see you have implemented. But that would be much easier if it were registered as a standalone package (ideally with an interface to MLJ - but I understand if that's not a priority) so it would be much more lightweight.

I implemented Maxnet for instance, but as a completely separate package: https://github.com/tiemvanderdeure/maxnet.jl

@rafaqz
Copy link
Collaborator

rafaqz commented Sep 27, 2024

I would just like to point out that @tiemvanderdeure is a serial contributor to Rasters.jl and the wider geospatial ecosystem, RasterDataSources.jl and the MLJ.jl ecosystem specifically due to this package intentionally being a small component of a wider functioning ecosystem. Its not really fair to say what this package should be without understanding the context.

See:
https://github.com/search?q=org%3AJuliaAI++tiemvanderdeure&type=pullrequests
https://github.com/EcoJulia/RasterDataSources.jl/pulls?q=is%3Apr+author%3Atiemvanderdeure+
https://github.com/rafaqz/Rasters.jl/pulls?q=is%3Apr+author%3Atiemvanderdeure+

Maxnet.jl was specifically written for this package to build on, but you can use it too. To get faster point extraction here, Tiem and I fixed extract in Rasters.jl: kadyb/raster-benchmark#18 - its looking like the fastest alg in the world now.

To me it seems that in the long term the approach taken here is better for science in Julia than the approach of SpeciesDistributionToolkit.jl (and involves much less duplication of effort in total).

In Julia we are limited by a small community but our strength is the ability to reuse code across domains by building small, modular components that leverage other packages. See https://www.youtube.com/watch?v=kc9HwsxE1OY

@gottacatchenall
Copy link
Author

gottacatchenall commented Oct 2, 2024

I would love to offer things like BIOCLIM, which I can see you have implemented. But that would be much easier if it were registered as a standalone package (ideally with an interface to MLJ - but I understand if that's not a priority) so it would be much more lightweight.

I think we're speaking past each other a bit here. All of the subpackages within SDT.jl are registered separately (though I get why having a BIOCLIM.jl specfic pkg makes sense).

The reason for combining each package into the meta-package SDT.jl (like Tidier.jl) is that SDT.jl is not designed to be used by seasoned Julia programmers, it is largely for teaching purposes. Most people who work with SDMs do not know (or perhaps have never even heard of!) Julia. SDT is meant as an 'all-in-one' introduction to the language for the median SDM practitioner that exclusively works in R. People who know Julia will know how to install the specific sub-packages they need for a given purpose.

To me it seems that in the long term the approach taken here is better for science in Julia than the approach of SpeciesDistributionToolkit.jl (and involves much less duplication of effort in total).

I agree the modular approach is generally better for scientific software and the Julia ecosystem as a whole. Still, I've found that teaching people to use these tools requires meeting them halfway, which is why I think convenient (though inefficient) meta-packages like SDT.jl will bring more users to the language.

I can see a use case for a very lightweight package with a few functions that really are specific to SDMs (and that both this package and SDT could depend on). But I can't think of very many functions that wouldn't fit into some other already existing package.

Agreed. It'd be nice to have a common API for operations on SDMLayer/Rasters, and for shared calls between this package and the prototype model fitting api in SDeMo

@rafaqz
Copy link
Collaborator

rafaqz commented Oct 3, 2024

SDT.jl is not designed to be used by seasoned Julia programmers

Rasters.jl and SpeciesDistributionModels.jl are also not designed to be used by seasoned programmers. They are pretty easy to use these days. To smooth the transition from other languages we are also part of cross-language initiatives, like this work-in-progress book: https://geocompx.org/jl

meta-packages like SDT.jl will bring more users to the language

I'm not sure. The thing that almost always brings people to Julia is a paradigm shift in scale or performance

image

To focus on the topic of the issue: what I look for to reduce duplication of effort is a focus on the key underlying tools we share and are important in the wider ecosystem. Like fixing bugs and performance in MLJ.jl and other stats packages, or helping us on the geospatial packages you rely on in SpeciesDistributionToolkit.jl - that we currently maintain with a few others in our spare time. Making high level packages like SDM/SDT work together is harder and has lower returns.

@tiemvanderdeure
Copy link
Owner

I know most people that use SDMs use R. And I really hope to be able to contribute to changing that.

I just don't think making a whole ecosystem of packages that is completely parallel to other Julia packages is the way to go, though.

It doesn't take me long to find code in SDT.jl to broadcast over rasters, and mask them, and to read in climate data, while Rasters and RasterDataSources already does that. I think that's a shame, both because it's a huge amount of duplicate effort to write and maintain, and because it's never going to compete with Rasters.jl in terms of speed (or indeed user-friendliness), becuase Rasters.jl has 1000s of people using it and maybe dozens contributing.

So to go back to your question of what this package should do that SDT.jl doesn't do: I wanted to make a package that works with the existing ecosystem of spatial data and machine learning packages. I knew that SDT.jl existed, but didn't see it as an option because it exists within a parallel universe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants