Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of new data interface to rater() #75

Merged
merged 19 commits into from
Jun 18, 2020
Merged

Conversation

jeffreypullin
Copy link
Owner

This doesn't deal with the actual column types of the passed data.

Currently the user can specify what format they are passing through the data_format flag

To consider:

What names should we expect for the passed long/grouped data? (Currently we want `c("item", "rater", "rating") for the long data)

@jeffreypullin jeffreypullin changed the title Initial implementation of new data interface to rater Implementation of new data interface to rater() May 27, 2020
@dvukcevic
Copy link
Collaborator

I like c("item", "rater", "rating"). That's fine as a default. If there's an easy way for users to specify other names, that would be handy, but it's fine for that to be a feature for later on the roadmap.

All small improvments the rater_fit classes:

* draws -> samples
* more use of @nord
* simpler calling of the new_* constructors
@jeffreypullin jeffreypullin force-pushed the new-interface branch 2 times, most recently from c471148 to 99ff1f8 Compare May 28, 2020 07:46
@jeffreypullin
Copy link
Owner Author

So currently we have:

devtools::load_all("/Users/jeffreypullin/Documents/R/rater")
#> Loading rater
#> * The rater package uses `Stan` to fit bayesian models.
#> * If you are working on a local, multicore CPU with excess RAM please call:
#> * options(mc.cores = parallel::detectCores())
#> * This will allow Stan to run inference on multiple cores in parallel.

optim_fit <- rater(anesthesia, dawid_skene(), method = "optim")

grouped_fit <- rater(caries, dawid_skene(), method = "optim", 
                     data_format = "grouped")

wrong_grouped_fit <- rater(caries, dawid_skene(), method = "optim")
#> Error in validate_data(data, data_format): `data` must have exactly three columns.

Created on 2020-05-28 by the reprex package (v0.3.0)

This implementation adds an argument to rater(): data_format which controls which format of data rater() expects. It (currently) has two possible values: "long" and "grouped".

Strictly, we don't need the argument, we could use heuristics to decide whether the data is probably long or grouped. I.e. if it has three columns assume that the is long, if more that is grouped etc. We don't implement this currently because

  • I'm not really a fan of functions that make different actions depending on what they deduce from the input - I generally think it's a good idea to force the user to be explicit

  • It simplifies the code I had to write 😄

Perhaps it would be better to simplify the argument even more by just removing the argument entirely and using heuristics. What are your thoughts @dvukcevic

In addition, note that we only currently support grouped data for the base Dawid-Skene model.

@dvukcevic
Copy link
Collaborator

Great work with this!

I agree that it would be best to require users to be explicit, at least to start with. If we find that being explicit is cumbersome (and it doesn't seem likely in this case), or many users request it, then we can look to implementing some heuristics.

Is there any limitation to grouped data that would stop us from implementing it for the other models? Is it just a matter of getting around to it, or is there something special about the Dawid-Skene model in particular that allows it?

@jeffreypullin
Copy link
Owner Author

There is nothing mathematical stopping us from using the grouped data in the other models, though I am somewhat hesitant to do so. The reason is that to include a grouped data version we would need to include a whole new Stan file for each model, this would necessitate a lot more compilation when rater is built, which I am somewhat wary of.

@dvukcevic
Copy link
Collaborator

Okay, good to know. Let's just put those on the roadmap, to do sometime in the future.

@dvukcevic
Copy link
Collaborator

Oh, I see you've already done this with #78!

@jeffreypullin
Copy link
Owner Author

The only other reaming decision here (I think) is whether we allow non-numeric data. Currently we don't, but we discussed allowing it:

Pros:

Easier for the user, no need to convert i.e. rater names to numbers

Cons:

The large problem with this is how we would display the correspondence between the passed data and the output parameters. I.e. raters are named which element of theta corresponds to which name. We've already discussed labeling the output e.g. #66 (comment) and don't have really any good way of doing it. I don't think we should allow non-numeric data if our output essentially ignores it.

@dvukcevic
Copy link
Collaborator

I am definitely in favour of using non-numeric data, although we don't need to implement this with too much urgency.

I often find that using non-numeric labels helps me to keep clarity on variables that are categorical (at least, if they are nominal, but even if they are ordinal). It would be convenient, therefore, to allow users to use non-numeric data.

Using the factor class seems ideal here. It has an underlying numerical coding, but character labels.

For example, we could:

  • Convert all non-numeric variables into a factor.

  • Retain a copy of this (or, at least the mapping between labels and numeric codings) in the returned model fit (I think this already happens, since you in fact you keep the whole input data?).

  • Ensure that any output uses these labels in the right way, e.g. through naming of rows/columns, or by changing the output to also be in the same factor class where appropriate.

What do you think?

@jeffreypullin
Copy link
Owner Author

I agree that we should support non-numeric data, I just currently don't how to implement your third bullet. That's probably just a lack of imagination on my part though. Let's discuss separately - I've opened an issue (#81) so I can merge this PR.

We will eventually support non-numeric data (#81) but for now we error.
@jeffreypullin
Copy link
Owner Author

Hooray!!!!!!!

@jeffreypullin
Copy link
Owner Author

Also, need to investigate performance change with skin data using this branch....

We update the configure scripts using 2.1.0 and add RcppParrallel to
Imports and LinkingTo

Try importing/linking to RcppParallel

The addition of RcppParallel seems to be the main change in StanHeaders
and the CI error we are getting is related to missing tbb (a parallelism
library) files.

Debugging...

WIP

WIP

WIP

Force latest rstantools

Revert change to description and try now
@jeffreypullin
Copy link
Owner Author

Also, need to investigate performance change with skin data using this branch....

I double checked this and it seems fine - not sure what the problem I saw earlier was...

@jeffreypullin jeffreypullin merged commit 99e5321 into master Jun 18, 2020
@jeffreypullin jeffreypullin deleted the new-interface branch January 20, 2021 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants