-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of new data interface to rater() #75
Conversation
I like |
All small improvments the rater_fit classes: * draws -> samples * more use of @nord * simpler calling of the new_* constructors
4186f15
to
83ef75d
Compare
Also update documentation
Also, remove inaccurate documentation
The vignette will be completely overhauled/replaced at some point so for now we only do the bare minimum.
c471148
to
99ff1f8
Compare
So currently we have: devtools::load_all("/Users/jeffreypullin/Documents/R/rater")
#> Loading rater
#> * The rater package uses `Stan` to fit bayesian models.
#> * If you are working on a local, multicore CPU with excess RAM please call:
#> * options(mc.cores = parallel::detectCores())
#> * This will allow Stan to run inference on multiple cores in parallel.
optim_fit <- rater(anesthesia, dawid_skene(), method = "optim")
grouped_fit <- rater(caries, dawid_skene(), method = "optim",
data_format = "grouped")
wrong_grouped_fit <- rater(caries, dawid_skene(), method = "optim")
#> Error in validate_data(data, data_format): `data` must have exactly three columns. Created on 2020-05-28 by the reprex package (v0.3.0) This implementation adds an argument to Strictly, we don't need the argument, we could use heuristics to decide whether the data is probably long or grouped. I.e. if it has three columns assume that the is long, if more that is grouped etc. We don't implement this currently because
Perhaps it would be better to simplify the argument even more by just removing the argument entirely and using heuristics. What are your thoughts @dvukcevic In addition, note that we only currently support grouped data for the base Dawid-Skene model. |
Great work with this! I agree that it would be best to require users to be explicit, at least to start with. If we find that being explicit is cumbersome (and it doesn't seem likely in this case), or many users request it, then we can look to implementing some heuristics. Is there any limitation to grouped data that would stop us from implementing it for the other models? Is it just a matter of getting around to it, or is there something special about the Dawid-Skene model in particular that allows it? |
There is nothing mathematical stopping us from using the grouped data in the other models, though I am somewhat hesitant to do so. The reason is that to include a grouped data version we would need to include a whole new Stan file for each model, this would necessitate a lot more compilation when rater is built, which I am somewhat wary of. |
Okay, good to know. Let's just put those on the roadmap, to do sometime in the future. |
Oh, I see you've already done this with #78! |
The only other reaming decision here (I think) is whether we allow non-numeric data. Currently we don't, but we discussed allowing it: Pros: Easier for the user, no need to convert i.e. rater names to numbers Cons: The large problem with this is how we would display the correspondence between the passed data and the output parameters. I.e. raters are named which element of theta corresponds to which name. We've already discussed labeling the output e.g. #66 (comment) and don't have really any good way of doing it. I don't think we should allow non-numeric data if our output essentially ignores it. |
I am definitely in favour of using non-numeric data, although we don't need to implement this with too much urgency. I often find that using non-numeric labels helps me to keep clarity on variables that are categorical (at least, if they are nominal, but even if they are ordinal). It would be convenient, therefore, to allow users to use non-numeric data. Using the For example, we could:
What do you think? |
I agree that we should support non-numeric data, I just currently don't how to implement your third bullet. That's probably just a lack of imagination on my part though. Let's discuss separately - I've opened an issue (#81) so I can merge this PR. |
We will eventually support non-numeric data (#81) but for now we error.
Hooray!!!!!!! |
Also, need to investigate performance change with skin data using this branch.... |
We update the configure scripts using 2.1.0 and add RcppParrallel to Imports and LinkingTo Try importing/linking to RcppParallel The addition of RcppParallel seems to be the main change in StanHeaders and the CI error we are getting is related to missing tbb (a parallelism library) files. Debugging... WIP WIP WIP Force latest rstantools Revert change to description and try now
685fd17
to
565768a
Compare
I double checked this and it seems fine - not sure what the problem I saw earlier was... |
This doesn't deal with the actual column types of the passed data.
Currently the user can specify what format they are passing through the
data_format
flagTo consider:
What names should we expect for the passed long/grouped data? (Currently we want `c("item", "rater", "rating") for the long data)