Implementation of new data interface to rater() #75

jeffreypullin · 2020-05-27T09:33:23Z

This doesn't deal with the actual column types of the passed data.

Currently the user can specify what format they are passing through the data_format flag

To consider:

What names should we expect for the passed long/grouped data? (Currently we want `c("item", "rater", "rating") for the long data)

dvukcevic · 2020-05-27T12:19:57Z

I like c("item", "rater", "rating"). That's fine as a default. If there's an easy way for users to specify other names, that would be handy, but it's fine for that to be a feature for later on the roadmap.

@nord

All small improvments the rater_fit classes: * draws -> samples * more use of @nord * simpler calling of the new_* constructors

Also update documentation

Also, remove inaccurate documentation

The vignette will be completely overhauled/replaced at some point so for now we only do the bare minimum.

jeffreypullin · 2020-05-28T11:02:56Z

So currently we have:

devtools::load_all("/Users/jeffreypullin/Documents/R/rater")
#> Loading rater
#> * The rater package uses `Stan` to fit bayesian models.
#> * If you are working on a local, multicore CPU with excess RAM please call:
#> * options(mc.cores = parallel::detectCores())
#> * This will allow Stan to run inference on multiple cores in parallel.

optim_fit <- rater(anesthesia, dawid_skene(), method = "optim")

grouped_fit <- rater(caries, dawid_skene(), method = "optim", 
                     data_format = "grouped")

wrong_grouped_fit <- rater(caries, dawid_skene(), method = "optim")
#> Error in validate_data(data, data_format): `data` must have exactly three columns.

^{Created on 2020-05-28 by the reprex package (v0.3.0)}

This implementation adds an argument to rater(): data_format which controls which format of data rater() expects. It (currently) has two possible values: "long" and "grouped".

Strictly, we don't need the argument, we could use heuristics to decide whether the data is probably long or grouped. I.e. if it has three columns assume that the is long, if more that is grouped etc. We don't implement this currently because

I'm not really a fan of functions that make different actions depending on what they deduce from the input - I generally think it's a good idea to force the user to be explicit
It simplifies the code I had to write 😄

Perhaps it would be better to simplify the argument even more by just removing the argument entirely and using heuristics. What are your thoughts @dvukcevic

In addition, note that we only currently support grouped data for the base Dawid-Skene model.

dvukcevic · 2020-05-28T13:24:53Z

Great work with this!

I agree that it would be best to require users to be explicit, at least to start with. If we find that being explicit is cumbersome (and it doesn't seem likely in this case), or many users request it, then we can look to implementing some heuristics.

Is there any limitation to grouped data that would stop us from implementing it for the other models? Is it just a matter of getting around to it, or is there something special about the Dawid-Skene model in particular that allows it?

jeffreypullin · 2020-05-28T23:53:22Z

There is nothing mathematical stopping us from using the grouped data in the other models, though I am somewhat hesitant to do so. The reason is that to include a grouped data version we would need to include a whole new Stan file for each model, this would necessitate a lot more compilation when rater is built, which I am somewhat wary of.

dvukcevic · 2020-05-29T01:33:34Z

Okay, good to know. Let's just put those on the roadmap, to do sometime in the future.

dvukcevic · 2020-05-29T01:35:10Z

Oh, I see you've already done this with #78!

jeffreypullin · 2020-05-29T01:49:02Z

The only other reaming decision here (I think) is whether we allow non-numeric data. Currently we don't, but we discussed allowing it:

Pros:

Easier for the user, no need to convert i.e. rater names to numbers

Cons:

The large problem with this is how we would display the correspondence between the passed data and the output parameters. I.e. raters are named which element of theta corresponds to which name. We've already discussed labeling the output e.g. #66 (comment) and don't have really any good way of doing it. I don't think we should allow non-numeric data if our output essentially ignores it.

dvukcevic · 2020-05-29T05:12:13Z

I am definitely in favour of using non-numeric data, although we don't need to implement this with too much urgency.

I often find that using non-numeric labels helps me to keep clarity on variables that are categorical (at least, if they are nominal, but even if they are ordinal). It would be convenient, therefore, to allow users to use non-numeric data.

Using the factor class seems ideal here. It has an underlying numerical coding, but character labels.

For example, we could:

Convert all non-numeric variables into a factor.
Retain a copy of this (or, at least the mapping between labels and numeric codings) in the returned model fit (I think this already happens, since you in fact you keep the whole input data?).
Ensure that any output uses these labels in the right way, e.g. through naming of rows/columns, or by changing the output to also be in the same factor class where appropriate.

What do you think?

jeffreypullin · 2020-06-02T06:30:45Z

I agree that we should support non-numeric data, I just currently don't how to implement your third bullet. That's probably just a lack of imagination on my part though. Let's discuss separately - I've opened an issue (#81) so I can merge this PR.

We will eventually support non-numeric data (#81) but for now we error.

jeffreypullin · 2020-06-18T00:10:26Z

Hooray!!!!!!!

jeffreypullin · 2020-06-18T00:10:55Z

Also, need to investigate performance change with skin data using this branch....

We update the configure scripts using 2.1.0 and add RcppParrallel to Imports and LinkingTo Try importing/linking to RcppParallel The addition of RcppParallel seems to be the main change in StanHeaders and the CI error we are getting is related to missing tbb (a parallelism library) files. Debugging... WIP WIP WIP Force latest rstantools Revert change to description and try now

jeffreypullin · 2020-06-18T05:07:38Z

Also, need to investigate performance change with skin data using this branch....

I double checked this and it seems fine - not sure what the problem I saw earlier was...

jeffreypullin added 2 commits May 27, 2020 19:29

Initial implementation of new interface

4227081

Remove uneeded files

5c3320e

jeffreypullin changed the title ~~Initial implementation of new data interface to rater~~ Implementation of new data interface to rater() May 27, 2020

jeffreypullin added 2 commits May 28, 2020 15:30

Record data format in the rater_fit objects

02c072e

All small improvments the rater_fit classes: * draws -> samples * more use of @nord * simpler calling of the new_* constructors

Document

83ef75d

jeffreypullin force-pushed the new-interface branch from 4186f15 to 83ef75d Compare May 28, 2020 05:30

jeffreypullin added 8 commits May 28, 2020 15:36

Update anesthesia data to new naming convention

48c4c14

Also update documentation

Update the caries data to the new format

6075aa7

Also, remove inaccurate documentation

Convert setup for testing to new interface

1a6fa1b

Remove data type testing

027c38c

Update extraction of draws -> samples for fits

d37a722

Convert internal data to be data.frames

f90dcc5

Fixes and test updates

3cefdc9

Get vignette building with new interface

99ff1f8

The vignette will be completely overhauled/replaced at some point so for now we only do the bare minimum.

jeffreypullin force-pushed the new-interface branch 2 times, most recently from c471148 to 99ff1f8 Compare May 28, 2020 07:46

jeffreypullin added 3 commits May 28, 2020 17:48

Fix roxygen typo and redocument

0d4273e

Fix point estimate example

c7895c7

Add error tests for rater()

ed455d6

jeffreypullin mentioned this pull request Jun 2, 2020

Allow non-numeric data as input #81

Open

Update rater() doc

507ebb0

jeffreypullin added 2 commits June 2, 2020 16:38

Fix error message

72500b7

Add test for numeric data

4d0ea45

We will eventually support non-numeric data (#81) but for now we error.

jeffreypullin force-pushed the new-interface branch from 685fd17 to 565768a Compare June 18, 2020 01:15

jeffreypullin merged commit 99e5321 into master Jun 18, 2020

jeffreypullin deleted the new-interface branch January 20, 2021 00:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of new data interface to rater() #75

Implementation of new data interface to rater() #75

jeffreypullin commented May 27, 2020

dvukcevic commented May 27, 2020

jeffreypullin commented May 28, 2020

dvukcevic commented May 28, 2020

jeffreypullin commented May 28, 2020

dvukcevic commented May 29, 2020

dvukcevic commented May 29, 2020

jeffreypullin commented May 29, 2020

dvukcevic commented May 29, 2020

jeffreypullin commented Jun 2, 2020

jeffreypullin commented Jun 18, 2020

jeffreypullin commented Jun 18, 2020

jeffreypullin commented Jun 18, 2020

Implementation of new data interface to rater() #75

Implementation of new data interface to rater() #75

Conversation

jeffreypullin commented May 27, 2020

dvukcevic commented May 27, 2020

jeffreypullin commented May 28, 2020

dvukcevic commented May 28, 2020

jeffreypullin commented May 28, 2020

dvukcevic commented May 29, 2020

dvukcevic commented May 29, 2020

jeffreypullin commented May 29, 2020

dvukcevic commented May 29, 2020

jeffreypullin commented Jun 2, 2020

jeffreypullin commented Jun 18, 2020

jeffreypullin commented Jun 18, 2020

jeffreypullin commented Jun 18, 2020