Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Remove AIC, BIC, and DIC #9

Open
ParadaCarleton opened this issue Jun 4, 2021 · 54 comments
Open

Suggestion: Remove AIC, BIC, and DIC #9

ParadaCarleton opened this issue Jun 4, 2021 · 54 comments

Comments

@ParadaCarleton
Copy link

AIC, BIC, and DIC have all been fully supplanted by WAIC, WBIC, and LOO-IC, and there's no longer any real reason to use them. I would suggest removing them from the package to avoid confusing beginners. Someone seeing them used in the package may be led to believe that using them is a good idea, that they provide information different from that provided by the newer information criteria, or that there must be some reason they're in the package. BIC is especially problematic: The name simultaneously gives new users the false impression that the BIC is actually Bayesian and that WAIC/LOO-IC are not Bayesian.

It might be a good idea to implement WBIC to have a replacement for BIC, since BIC has a different use case from the other information criteria in this package (asymptotically maximizing the probability of selecting the correct model, rather than maximizing the expected predictive accuracy of the model). I would also suggest referring to it as the Watanabe-Schwarz Information Criterion to avoid giving the impression that WBIC is "Especially" Bayesian.

@goedman
Copy link
Member

goedman commented Jun 5, 2021

@itsdfish

Thank you Carlos. I tend to agree with you that today WAIC, LOO-IC (and WBIC, I don't know this one) are the information criteria to use. I'm including Chis on this response because he specifically asked to include these.

As an interim solutions we can also include a clear warning, similar to your above input or Richard McElreath's positioning of these older ic's. From Chris' thumb up I assume he also agrees with you message

I will take a look at WBIC although right now I have a bit too much on my plate.

EDIT: Given that this package is GPL3 licensed it might still be better to develop a new, MIT licensed package directly from the descriptions of the different ic's. From the StatisticalRethinking introductory/learning point of view MIT licensing is not very important, but for general use it is.

Rob

@itsdfish
Copy link
Contributor

itsdfish commented Jun 5, 2021

Rob, I agree also.

I was in the process of reimplementing PSIS-LOO but ran into some difficulties. Do you know anyone who is familiar with the algorithm?

@goedman
Copy link
Member

goedman commented Jun 5, 2021

What about Aki Vehtari? Are you using "Vehtari, A., A. Gelman, and J. Gabry. 2015. Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models"?

@itsdfish
Copy link
Contributor

itsdfish commented Jun 5, 2021

Yeah. I am using that article. My sticking point is the procedure for fitting the inverse Pareto.

@ParadaCarleton
Copy link
Author

Yeah. I am using that article. My sticking point is the procedure for fitting the inverse Pareto.

Funny coincidence -- I'm actually reimplementing it as well, since the R LOO package is looking at using Julia in its backend to speed up its calculations. I've got the code for fitting the inverse Pareto written up already, and I added some more comments that should make it easier to understand.

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jun 5, 2021

Note that the generalized Pareto distribution is like a slightly wonky monomial, and you can easily fit a monomial between any pair of points. The GPD fitting procedure takes advantage of this in a way that works a bit like the Theil-Sen estimator. You start by taking the first quartile as point A, and then for each individual point B in your sample, you fit a GPD going through A and B. This gives you a set of pointwise estimates for the shape parameter k. When this is done, you take the average of all of these pointwise estimators, weighted by the likelihood that they would produce this data set. At the end, an (optional) adjustment to k-hat is used, where you average k-hat together with a weakly informative prior; this reduces variance at the cost of a little bit of bias.

function gpdfit(x::AbstractArray, min_grid_pts::Int = 30, sort_x::Bool = true)
    
    # x must be sorted, but we can skip if x is already sorted
    if sort_x
        sort!(x)
    end

    n = length(x)
    m = min_grid_pts + isqrt(n) # isqrt = floor sqrt
    prior = 3
    n_0 = 10  # determines how strongly to nudge kHat towards .5
    
    # Estimate parameters

    xStar = x[(n+2) ÷ 4]  # take first sample quartile
    # build pointwise estimates of k and θ by using each element of the sample.
    vectorθ = @. 1 / x[n] + (1 - sqrt(m / (collect(1:m) - .5))) / prior / xStar
    vectorK = mean(log1p.(- vectorθ .* x'), dims = 2)  # take mean of each row
    logLikelihood = @. log(- vectorθ / vectorK) - vectorK - 1  # Calculate log-likelihood at each estimate
    weights = @. 1 / sum(exp(logLikelihood - logLikelihood')) # Calculate weights from log-likelihood

    θHat = weights ⋅ vectorθ  # Take the dot product of weights and pointwise estimates of θ to get the full estimate

    kHat = mean(log1p.(- θHat .* x))
    σ = -kHat / θHat
    # Drag towards .5 to reduce variance for small n
    if wip
      kHat = (kHat * n + .5 * n_0) / (n + n_0) 
    end

    return kHat, σ

end

(Note that I haven't tested this code yet.)

@itsdfish
Copy link
Contributor

itsdfish commented Jun 6, 2021

@ParadaCarleton, thank you for sharing your code. Given that you know more about PSIS-LOO and related metrics than I do, it might make sense to collaborate. Our main goal was avoid the GNU license and use the MIT license instead. I know that the community has expressed interest in having these metrics, but no one until now has had the expertise to put it together.

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jun 6, 2021

@ParadaCarleton, thank you for sharing your code. Given that you know more about PSIS-LOO and related metrics than I do, it might make sense to collaborate. Our main goal was avoid the GNU license and use the MIT license instead. I know that the community has expressed interest in having these metrics, but no one until now has had the expertise to put it together.

I'd love to help with this -- I'm pretty sure I can get PSIS-LOO working, along with some related measures like WAIC/WSIC. I do have a bit less experience with specific implementations of MCMC samplers (especially Stan -- most of my work has been in PyMC3+Turing) and Julia itself. If I could get some kind of function that would take objects produced by Stan, Turing, or Soss and convert them into standard objects I could work with, I'd be happy to build the code that takes these and actually calculates the information criteria for them.

@itsdfish
Copy link
Contributor

itsdfish commented Jun 6, 2021

@ParadaCarleton, that sounds great. I will provide a simple example that can be adapted to Stan, Turing and Soss. I can post something here within the next 24 - 36 hours.

Our original plan was to develop a generic method that would operate on arrays and then add conversion methods that would interface with MCMCChains and any other data type. For example:

A generic method:

function psisloo(x::Array, ...) 
        ...
end

A method to interface with MCMCChains

function psisloo(x::Chain, ...) 
        # some code to extract and format array
        ...
        # generic method
        psisloo(extracted_array)
end

With this approach, the model comparison metrics will work with any data format. We can define the most common data formats, such as MCMCChains. Others can develop custom functions as needed. Does this still sound like a good approach to you?

@ParadaCarleton
Copy link
Author

@ParadaCarleton, that sounds great. I will provide a simple example that can be adapted to Stan, Turing and Soss. I can post something here within the next 24 - 36 hours.

Our original plan was to develop a generic method that would operate on arrays and then add conversion methods that would interface with MCMCChains and any other data type. For example:

A generic method:

function psisloo(x::Array, ...) 
        ...
end

A method to interface with MCMCChains

function psisloo(x::Chain, ...) 
        # some code to extract and format array
        ...
        # generic method
        psisloo(extracted_array)
end

With this approach, the model comparison metrics will work with any data format. We can define the most common data formats, such as MCMCChains. Others can develop custom functions as needed. Does this still sound like a good approach to you?

I think this sounds good, although I would make the generic method use the AbstractMCMC API rather than just an array, and convert other objects to AbstractMCMC objects. The problem with a basic array is that it would get very confusing to keep track of all the different things.

@goedman
Copy link
Member

goedman commented Jun 8, 2021

@ParadaCarleton

I'm trying to wrap my mind about your 2 (very interesting) questions:

  1. use the AbstractMCMC API
  2. Switching to Makie/AlgebraOfGraphics

On the first item, in Stan you can:

chns = read_samples(m1_1s; output_format=:mcmcchains)

where m1_1s is the sampled Stan model. Typically m1_1 is the Stan language model.

This would provide a chns object with typeof(chns) <: AbstractMCMC.AbstractChains. Would that be enough for your approach to LOO & company?

I'm not sure if e.g. stepper functionality makes sense although maybe it is possible to come up with an mcmcsample method.

Your second question, on the MCMCChains.jl repo, is also very interesting. My main objection to MCMCChains.jl has always been the AxisArray base. But with your suggested approach we could, once convinced the chains are proper, use Makie Layers.

Maybe this combination could result in 2 new packages BayesMakiePlots.jl and a new, MIT licensed, version of StatsModelsComparisons.jl (or a better name for the latter repo)?

@ParadaCarleton
Copy link
Author

Regarding the first question, I actually made a mistake -- since I wasn't familiar with the AbstractMCMC or MCMCChains APIs, I didn't realize they don't really do what I need. I'm still trying to figure out how you get the log-pdf from Stan, Turing, and Soss models after leaving one observation out, which is the main thing I need to build PSIS-LOO.

As for the second thing, that's a longer-term interest that could be discussed in another thread -- for now, I'm just trying to get PSIS-LOO working.

@goedman
Copy link
Member

goedman commented Jun 8, 2021

Would it help if I augment the arsenic example in StatsModelComparisons.jl (using Stan) to generate an MCMCChain object with the log-pdfs and maybe Chris can do the same with Turing (using VarInfo?). I can take a look at Soss as well or maybe ask Chad?

i would store the log-pdfs in a separate section of the MCMCChain.

I fully agree with the second part of your reply. Although I find the pk-plots useful.

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jun 9, 2021

Would it help if I augment the arsenic example in StatsModelComparisons.jl (using Stan) to generate an MCMCChain object with the log-pdfs and maybe Chris can do the same with Turing (using VarInfo?). I can take a look at Soss as well or maybe ask Chad?

i would store the log-pdfs in a separate section of the MCMCChain.

I fully agree with the second part of your reply. Although I find the pk-plots useful.

I think Soss support would be great eventually, but Soss is definitely the least popular of the PPLs -- we can focus on getting Turing and Stan working first, then find a way to include Soss. It looks like the relevant objects are the MCMCChain object, which holds a sample from the posterior distribution, and the VarInfo object, which holds information about the full model. I'm going through ArviZ.jl at the moment to see how they did it.

As for turning this into a full package, the inference algorithm itself is pretty simple, but looking through the LOO package in R, things like "Check to make sure the user didn't make a mistake when they input their data" or "Clean the data so it's usable" seem to make up most of the actual code and could be pretty time-consuming to write.

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jun 13, 2021

I'm currently implementing PSIS-LOO for matrices, following the API for the loo package in R as closely as possible. Functions converting from ArviZ.jl and Stanfit objects will probably end up being useful, although InferenceData will probably be getting a revamp fairly soon to use Julia objects, so I'd hold off on ArviZ objects.

@goedman
Copy link
Member

goedman commented Jun 14, 2021

That looks pretty easy to do for Stan.jl (I don't know the details of ArviZ.jl).

@goedman
Copy link
Member

goedman commented Jun 17, 2021

Hi Carlos, not sure if you follow TuringLang/MCMCChains.jl#270 (comment) .

Rob

@ParadaCarleton
Copy link
Author

Hi Carlos, not sure if you follow TuringLang/MCMCChains.jl#270 (comment) .

Rob

Hi Rob, I hadn't followed it, but I've made some comments now.

@ParadaCarleton
Copy link
Author

@itsdfish I have the PSIS code up here. Pull requests adding support for Chains objects from MCMCChains or for Stanfit objects would be very welcome.

@itsdfish
Copy link
Contributor

Thanks @ParadaCarleton. I will have some time later this week. Would you consider the test file to be a good example of how to use the package?

@goedman
Copy link
Member

goedman commented Jun 30, 2021

Hi Carlos, thanks for this work!

I'm trying to temporarily incorporate your work into StatsModelComparisons.jl by adding ParetoSmooth.jl and RData.jl as dependencies:

(StatsModelComparisons) pkg> st
     Project StatsModelComparisons v1.0.2
      Status `~/.julia/dev/StatsModelComparisons/Project.toml`
  [336ed68f] CSV v0.8.5
  [a68b5a21] ParetoSmooth v0.1.0 `~/.julia/dev/ParetoSmooth`
  [df47a6cb] RData v0.8.2
  [3cdcf5f2] RecipesBase v1.1.1
  [c1514b29] StanSample v3.1.0
  [4c63d2b9] StatsFuns v0.9.8
  [10745b16] Statistics
  [8dfed614] Test

when I run a slightly modified version of your runtests.jl:

using ParetoSmooth
using Test
using Statistics

import RData

ProjDir = @__DIR__
cd(ProjDir)

let ogArray = RData.load("Example_Log_Likelihood_Array.RData")["x"]
    global logLikelihoodArray = copy(permutedims(ogArray, [3, 1, 2]))
end
let ogWeights = RData.load("weightMatrix.RData")["weightMatrix"]
    global rWeights = exp.(permutedims(reshape(ogWeights, 500, 2, 32), [3, 1, 2]))
end
rel_eff = RData.load("Rel_Eff.RData")["rel_eff"]
rPsis = RData.load("Psis_Object.RData")["psisObject"]
juliaPsis = psis(logLikelihoodArray)
relEffSpecified = psis(logLikelihoodArray, rel_eff)

@testset "ParetoSmooth.jl" begin
    @test sqrt(mean((relEffSpecified.weights - rWeights).^2)) ≤ .0001  # RMSE ≤ .0001 (~10%)
end

I get:

julia> include("/Users/rob/.julia/dev/StatsModelComparisons/research/paretosmooth.jl");
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
ERROR: LoadError: MethodError: no method matching check_input_validity_psis(::Matrix{Float64}, ::Vector{Float64})
Closest candidates are:
  check_input_validity_psis(::AbstractArray{T, 3}, ::AbstractVector{T}) where T<:AbstractFloat at /Users/rob/.julia/dev/ParetoSmooth/src/ImportanceSampling.jl:236
Stacktrace:
 [1] psis(log_ratios::Array{Float64, 3}, rel_eff::Vector{Float64}; source::String)
   @ ParetoSmooth.ImportanceSampling ~/.julia/dev/ParetoSmooth/src/ImportanceSampling.jl:80
 [2] psis(log_ratios::Array{Float64, 3}, rel_eff::Vector{Float64}) (repeats 2 times)
   @ ParetoSmooth.ImportanceSampling ~/.julia/dev/ParetoSmooth/src/ImportanceSampling.jl:44
 [3] top-level scope
   @ ~/.julia/dev/StatsModelComparisons/research/paretosmooth.jl:18
 [4] include(fname::String)
   @ Base.MainInclude ./client.jl:444
 [5] top-level scope
   @ REPL[6]:1
in expression starting at /Users/rob/.julia/dev/StatsModelComparisons/research/paretosmooth.jl:18

If I understand the setup right, ;source=:mcmc for this example is not required. I tried different options.

I get a similar error if I try this with a reshaped log_lik matrix generated in Stan in the cars_waic example. I think I need to reshape that in an Array with [1000, 50, 4]. log_lik contains an appended-chains version.

@goedman
Copy link
Member

goedman commented Jun 30, 2021

In order to be able to run the cars-waic example in SMC.jl I also added temporarily StanSample.jl, but that is not affecting above initial test script.

@goedman
Copy link
Member

goedman commented Jun 30, 2021

Aah:

julia> size(rPsis["log_weights"])
(1000, 32)

julia> size(rPsis["diagnostics"]["pareto_k"])
(32,)

julia> size(rPsis["diagnostics"]["n_eff"])
(32,)

it looks like my reshaping of log_lik is wrong and I need to permute the dims.

@goedman
Copy link
Member

goedman commented Jun 30, 2021

Ok, the problem is the shape of log_ratios when check_input_validity_psis() is called.

If I use the original format of log_ratios ( size(log_ratios) = (32, 500,2) ) it passes that test and:

ParetoSmooth.jl: Test Failed at /Users/rob/.julia/dev/StatsModelComparisons/research/paretosmooth.jl:30
  Expression: sqrt(mean((relEffSpecified.weights - rWeights) .^ 2)) ≤ 0.0001
   Evaluated: 0.00040040987501849365 ≤ 0.0001

But with the upper bound at 0.001 it will succeed. Not sure it that upper bound is too liberal.

@ParadaCarleton
Copy link
Author

Sorry about that -- I accidentally messed up my Github version control and used one that had a bug in it. Version 0.1.1 should be working. RData shouldn't be necessary, it's just a test requirement.

@goedman
Copy link
Member

goedman commented Jun 30, 2021

No problem, needed to look at it more closely anyway. It seems to be working now with the log_lik array generated in the car_waic example although the results appear rather different.

I'll also do some more checks tomorrow.

Edit: And the test is now indeed within 10%.

@goedman
Copy link
Member

goedman commented Jul 1, 2021

Attached pk_plots of 2 runs where pk is the current version StatsModelComparisons.jl (top plot) and pareto_k the ParetoSmooth.jl k values (bottom plot). Of course the pk_plots will be different for each run, they definitely look reasonable.

pk
pareto_k

@itsdfish
Copy link
Contributor

itsdfish commented Jul 1, 2021

Hi Rob,

Can you seed the RNG so that both receive the same input? That might be a more direct comparison.

@goedman
Copy link
Member

goedman commented Jul 1, 2021

In a sense both do get the exact same log_lik matrix as input (4000 x 50). Waics are of course identical.

I'm trying to figure out how to compute this from the Psis structure fields:

ll = reshape(nt_cars.log_lik, 50, 1000, 4);                                             # Log_lik is 50 x 4000 in the NamedTuple
                                                                                                                   # Will soon switch to the Tables.jl api
psis_ll = psis(ll)
(;weights, pareto_k, ess, tail_len, posterior_sample_size, data_size) = psis_ll;      # Only in Julia 1.7 and up

Adding Random.seed!(123) both before calling psisloo() and before calling psis() makes no difference.

@itsdfish
Copy link
Contributor

itsdfish commented Jul 1, 2021

Sorry for noise. I thought that you were passing two different sets of chains, which would provide different results. I don't know much about the diagnostics. Do you have reason to expect the diagnostic plots to differ when given the same input?

@goedman
Copy link
Member

goedman commented Jul 1, 2021

I was trying to compute the waic based on the weights computed by psis() and think I can do something like:

lwp = deepcopy(ll)
lwp += psis_ll.weights;
lwpt = reshape(lwp, 50, 4000)';
loos = reshape(logsumexp(lwpt; dims=1), size(lwpt, 2));
@show loo = sum(loos)
@show 2loo

which results in:


-2 * (sum(lppd) - sum(pwaic)) = 421.73093081708123

waic(log_lik) = (WAIC = 421.7287980706158, lppd = -206.5999724777712, penalty = 4.264426557536728, std_err = 16.297027050407028)

-2loo = 421.1547017803024

[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.

loo = sum(loos) = 208.11604949369575
2loo = 416.2320989873915

which is close. But is that close enough? The R results were near 421, but I'll check that.

To answer your question, no, I don't think so. The plots only change when a new set of draws is generated.

@goedman
Copy link
Member

goedman commented Jul 1, 2021

R results:

> data(cars)
> m <- quap(
+     alist(
+         dist ~ dnorm(mu,sigma),
+         mu <- a + b*speed,
+         a ~ dnorm(0,100),
+         b ~ dnorm(0,10),
+         sigma ~ dexp(1)
+     ) , data=cars )
> set.seed(94)
> post <- extract.samples(m,n=1000)
> 
> ## R code 7.20
> n_samples <- 1000
> logprob <- sapply( 1:n_samples ,
+     function(s) {
+         mu <- post$a[s] + post$b[s]*cars$speed
+         dnorm( cars$dist , mu , post$sigma[s] , log=TRUE )
+     } )
> 
> ## R code 7.21
> n_cases <- nrow(cars)
> lppd <- sapply( 1:n_cases , function(i) log_sum_exp(logprob[i,]) - log(n_samples) )
> 
> ## R code 7.22
> pWAIC <- sapply( 1:n_cases , function(i) var(logprob[i,]) )
> 
> ## R code 7.23
> -2*( sum(lppd) - sum(pWAIC) )
[1] 423.3169

But not using the quadratic approximation but Stan proper and taking 4000 samples:

> m <- ulam(
+     alist(
+         dist ~ dnorm(mu,sigma),
+         mu <- a + b*speed,
+         a ~ dnorm(0,100),
+         b ~ dnorm(0,10),
+         sigma ~ dexp(1)
+     ) , data=cars  , chains=4 , iter=2000)
recompiling to avoid crashing R session

SAMPLING FOR MODEL 'cd82d23a4157bed9d477761c28767a10' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 1: Adjust your expectations accordingly!
.. <snipped>
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.838425 seconds (Warm-up)
Chain 4:                0.825601 seconds (Sampling)
Chain 4:                1.66403 seconds (Total)
Chain 4: 

> 
> 
> post <- extract.samples(m, n=4000)
> n_samples <- 4000
> logprob <- sapply( 1:n_samples ,
+     function(s) {
+         mu <- post$a[s] + post$b[s]*cars$speed
+         dnorm( cars$dist , mu , post$sigma[s] , log=TRUE )
+     } )
> 
> ## R code 7.21
> n_cases <- nrow(cars)
> lppd <- sapply( 1:n_cases , function(i) log_sum_exp(logprob[i,]) - log(n_samples) )
> 
> ## R code 7.22
> pWAIC <- sapply( 1:n_cases , function(i) var(logprob[i,]) )
> 
> ## R code 7.23
> -2*( sum(lppd) - sum(pWAIC) )
[1] 421.0427

@itsdfish
Copy link
Contributor

itsdfish commented Jul 1, 2021

A difference of about 4 seems quite large if the the R and Julia code are using the same algorithm and settings with the same inputs. It is definitely more than rounding error. Hmmm.

@goedman
Copy link
Member

goedman commented Jul 1, 2021

I'll try to use the loglik matrix from R to check both psi's and psisloo.

@goedman
Copy link
Member

goedman commented Jul 2, 2021

Using the R generated matrix:

julia> include("/Users/rob/.julia/dev/StatisticalRethinkingStan/research/cars_waic_R.jl");
-2 * (sum(lppd) - sum(pwaic)) = 421.0426501217422

waic(log_lik) = (WAIC = 421.040729342191, lppd = -206.67976595825237, penalty = 3.840598712843116, std_err = 16.161917924434487)

-2loo = 420.55434215126473
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 208.03612793433302
2loo = 416.07225586866605

@itsdfish
Copy link
Contributor

itsdfish commented Jul 2, 2021

Those values look much closer.

@goedman
Copy link
Member

goedman commented Jul 2, 2021

We'll, with the existing psisloo(). But still somewhat off with the new psis().

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jul 3, 2021

Sorry, are these comparisons of the WAIC with PSIS-LOO, or of the R and Julia versions, or of the new and old Julia versions? And are they being calculated with the newest version on Github (The main version, not the 1.0 release)?

@goedman
Copy link
Member

goedman commented Jul 3, 2021

These are comparisons between psis() in ParetoSmooth.jl and psisloo() in StatsModelComparisons.jl.

Initially I compared them using Julia & StanJulia but the last results above were using a loglik matrix generated by R.

The results of the today's released version (v0.1.0) of ParetoSmooth.jl are identical to above results.

@ParadaCarleton
Copy link
Author

These are comparisons between psis() in ParetoSmooth.jl and psisloo() in StatsModelComparisons.jl.

Initially I compared them using Julia & StanJulia but the last results above were using a loglik matrix generated by R.

The results of the today's released version (v0.1.0) of ParetoSmooth.jl are identical to above results.

Can you try again with v0.1.1?

@goedman
Copy link
Member

goedman commented Jul 3, 2021

Hmm, this is the result:

julia> include("/Users/rob/.julia/dev/StatisticalRethinkingStan/research/cars_waic_R.jl");
-2 * (sum(lppd) - sum(pwaic)) = 421.0426501217422

waic(log_lik) = (WAIC = 421.040729342191, lppd = -206.67976595825237, penalty = 3.840598712843116, std_err = 16.161917924434487)

-2loo = 420.55434215126473
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 208.03612793433302
2loo = 416.07225586866605

from running this:

using StatsModelComparisons, Statistics
using StatsFuns, CSV, DataFrames, Random
using ParetoSmooth
#using StatisticalRethinking

ProjDir = @__DIR__

df = CSV.read(joinpath(ProjDir, "cars_logprob.csv"), DataFrame)

log_lik = Matrix(df)';
n_sam, n_obs = size(log_lik)
lppd = reshape(logsumexp(log_lik .- log(n_sam); dims=1), n_obs);

pwaic = [var(log_lik[:, i]) for i in 1:n_obs];
@show -2(sum(lppd) - sum(pwaic))
println()

@show waic(log_lik)
println()

#Random.seed!(123)
loo, loos, pk = psisloo(log_lik);
@show -2loo

#Random.seed!(123)
ll = reshape(Matrix(df), 50, 1000, 4);
psis_ll = psis(ll);

lwp = deepcopy(ll);
lwp += psis_ll.weights;
lwpt = Matrix(reshape(lwp, 50, 4000)');
loos = reshape(logsumexp(lwpt; dims=1), size(lwpt, 2));

@show loo = sum(loos)
@show 2loo

if isdefined(Main, :StatisticalRethinking)
    pk_plot(pk)
    savefig(joinpath(ProjDir, "pk.png"))

    pk_plot(psis_ll.pareto_k)
    savefig(joinpath(ProjDir, "pareto_k.png"))
end

with installed packages:

(StatisticalRethinkingStan) pkg> st
      Status `~/.julia/dev/StatisticalRethinkingStan/Project.toml`
  [80f14c24] AbstractMCMC v3.2.1
  [6e4b80f9] BenchmarkTools v1.1.0
  [5f4fecfd] BrowseTables v0.3.0
  [336ed68f] CSV v0.8.5
  [31c24e10] Distributions v0.25.7
  [38e38edf] GLM v1.5.1
  [c7f686f2] MCMCChains v4.13.0
  [d9ec5142] NamedTupleTools v0.13.7
  [d96e819e] Parameters v0.12.2
  [a68b5a21] ParetoSmooth v0.1.1 `~/.julia/dev/ParetoSmooth`
  [08abe8d2] PrettyTables v1.1.0
  [276daf66] SpecialFunctions v1.5.1
  [fbd8da12] StanOptimize v2.3.0
  [e4723793] StanQuap v1.0.5
  [c1514b29] StanSample v3.1.0
  [2d09df54] StatisticalRethinking v3.4.3
  [4c63d2b9] StatsFuns v0.9.8
  [854dedd9] StatsModelComparisons v1.0.1
  [a41e6734] StructuralCausalModels v1.0.8
  [bd369af6] Tables v1.4.4

@goedman
Copy link
Member

goedman commented Jul 3, 2021

If you have rethinking installed in R, I run a slightly modified:

## R code 7.19
data(cars)
m <- ulam(
    alist(
        dist ~ dnorm(mu,sigma),
        mu <- a + b*speed,
        a ~ dnorm(0,100),
        b ~ dnorm(0,10),
        sigma ~ dexp(1)
    ) , data=cars  , chains=4 , iter=2000)
set.seed(94)
post <- extract.samples(m, n=1000)

## R code 7.20
n_samples <- 1000
logprob <- sapply( 1:n_samples ,
    function(s) {
        mu <- post$a[s] + post$b[s]*cars$speed
        dnorm( cars$dist , mu , post$sigma[s] , log=TRUE )
    } )

to generate the logprob matrix. Modified to use ulam() and not quap().

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jul 3, 2021

@goedman

Carlos, you probably understand this much better than I do, but both PSIS and Waic are to compare models, so a constant offset in the values might be less important?

That's correct -- you might want to double-check that's not the problem. If the answers are off by a constant offset, this shouldn't matter. In addition, are the rel_eff arguments specified and the same between Julia and R? If not, there might be a problem with MCMCChains' ESS method returning different answers, since there are lots of methods for calculating ESS.

If neither of these work, try changing these two lines:

  1. In the file GPD.jl, replace line 58, which is:
    @turbo @. θ_hats = 1 / sample[len] + (1 - sqrt((m+1) / $(1:m))) / prior / quartile
    with
    @turbo @. θ_hats = 1 / sample[len] + (1 - sqrt(m / $((1:m) - .5))) / prior / quartile
  2. In file ImportanceSampling.jl, replace line 205, which is:
    @turbo @. tail = GPD.gpd_quantile($(1:len) / (len + 1), ξ, σ) + cutoff
    with:
    @turbo @. tail = GPD.gpd_quantile(($(1:len) - .5) / len, ξ, σ) + cutoff
    Try it changing one line at a time, then after changing both lines if that doesn't work. Neither of these should make a meaningful difference -- they're slightly different ECDF interpolations that should be off by at worst 1/1000 = .1% -- but I might be wrong about one or both of them. When I tested they didn't seem to result in a big error relative to the R version, so I'd be very surprised if this causes a 4 point difference in the ELPD calculation.

@goedman
Copy link
Member

goedman commented Jul 3, 2021

Hi Carlos,

Have tried all three, no change in the loo value in either step.

The update to GDP.jl needed to be:

    @turbo @. θ_hats = 1 / sample[len] + (1 - sqrt(m / $((1:m) .- .5))) / prior / quartile

Not quite sure why I needed to add the broadcast '.' before the - .5.

| In addition, are the rel_eff arguments specified and the same between Julia and R?

I believe the settings in R and psisloo() in StatsModelsComparisons.jl are identical, at least the values were a good match in about 8 or so example cases I used for testing before releasing StatsModelsComparisons.jl. At the same time I have a hard time comparing the handling of rel_eff between the 2 Julia versions. psisloo() certainly does not use MCMCChains' ESS method. Doing the Turing test with the car data might shed more light on this.

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jul 3, 2021

Hi Carlos,

Have tried all three, no change in the loo value in either step.

The update to GDP.jl needed to be:

    @turbo @. θ_hats = 1 / sample[len] + (1 - sqrt(m / $((1:m) .- .5))) / prior / quartile

Not quite sure why I needed to add the broadcast '.' before the - .5.

| In addition, are the rel_eff arguments specified and the same between Julia and R?

I believe the settings in R and psisloo() in StatsModelsComparisons.jl are identical, at least the values were a good match in about 8 or so example cases I used for testing before releasing StatsModelsComparisons.jl. At the same time I have a hard time comparing the handling of rel_eff between the 2 Julia versions. psisloo() certainly does not use MCMCChains' ESS method. Doing the Turing test with the car data might shed more light on this.

So, are the comparisons between psisloo() and the new psis()? The R version of Loo has had some pretty significant changes over the past few years, and from what I can see the new psis() seems to give weights indistinguishable from those of loo, so it might be that the old psisloo() has drifted from the new R package? Alternatively, it might be that the testing code has a bug in it.

(As for the broadcast, my original had a typo -- $((1:m) - .5), rather than ($(1:m) - .5). In the former, the $ prevents broadcasting over the minus sign, while in the latter it only keeps it from broadcasting over the sequence 1:m.)

@goedman
Copy link
Member

goedman commented Jul 4, 2021

The results of the tests are that psisloo() and R’s psis remain identical (I just reran that after upgrading all R packages). The new psis() (yours) gives a slightly different value. I suggest we wait until you are done with the loo extensions because it is very possible my test code is at fault. If that’s not the case we should look at the ess values.

In the research subdirectory of StatisticalRethinkingStan I added a 2nd check using the chimpanzees dataset from R's rethinking. Here the K_L_divergence differences are quite big:

julia> include("/Users/rob/.julia/dev/StatisticalRethinkingStan/research/chimpanzees/chimpanzees_psis");

/var/folders/l7/pr04h0650q5dvqttnvs8s2c00000gn/T/jl_xxoffZ/m11_4s.stan updated.
-2 * (sum(lppd) - sum(pwaic)) = 680.4253578535058

waic(log_lik) = (WAIC = 680.4243767462555, lppd = -338.25046442596323, penalty = 1.9617239471644632, std_err = 9.255022377796303)

-2loo = 680.1994143478744
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 3842.0770399859157
2loo = 7684.154079971831

If you have rethinking installed in R, ?PSIS shows the example R code to run it. I'll check this example with Chris' code.

Given that you mentioned the possibility of more packages for model selection purposes, maybe it makes sense to create a ModelSelection Github organization to have an anker point? Just a thought.

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jul 5, 2021

The results of the tests are that psisloo() and R’s psis remain identical (I just reran that after upgrading all R packages). The new psis() (yours) gives a slightly different value. I suggest we wait until you are done with the loo extensions because it is very possible my test code is at fault. If that’s not the case we should look at the ess values.

In the research subdirectory of StatisticalRethinkingStan I added a 2nd check using the chimpanzees dataset from R's rethinking. Here the K_L_divergence differences are quite big:

julia> include("/Users/rob/.julia/dev/StatisticalRethinkingStan/research/chimpanzees/chimpanzees_psis");

/var/folders/l7/pr04h0650q5dvqttnvs8s2c00000gn/T/jl_xxoffZ/m11_4s.stan updated.
-2 * (sum(lppd) - sum(pwaic)) = 680.4253578535058

waic(log_lik) = (WAIC = 680.4243767462555, lppd = -338.25046442596323, penalty = 1.9617239471644632, std_err = 9.255022377796303)

-2loo = 680.1994143478744
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 3842.0770399859157
2loo = 7684.154079971831

If you have rethinking installed in R, ?PSIS shows the example R code to run it. I'll check this example with Chris' code.

Given that you mentioned the possibility of more packages for model selection purposes, maybe it makes sense to create a ModelSelection Github organization to have an anker point? Just a thought.

Ahh, I just plan to build one package for model selection purposes, separately from the implementation of PSIS. This way, people can use PSIS for things like VI+Importance Sampling without having to install a bunch of model comparison tools.

Note that when rel_eff is specified, ESS methods shouldn't matter (since ESS is only used in calculations of rel_eff).

The model comparison package has been put here, and now implements a LOO method. The answers looked sensible but I'm not 100% sure they're good either.

@goedman
Copy link
Member

goedman commented Jul 5, 2021

Thank you.

I modified the function definition a bit:

    function psis_loo(log_likelihood::ArrayType,
        score::Function=identity;  # Default to log_likelihood (elpd) score function 
        rel_eff=similar(log_likelihood, 0)
        ) where {F<:AbstractFloat, ArrayType<:AbstractArray{F, 3}}

because loo in similar(loo, 0) is undefined.

The result I'm getting for the cars example:

-2 * (sum(lppd) - sum(pwaic)) = 421.45579489140215

waic(log_lik) = (WAIC = 421.453730532524, lppd = -206.59917968923025, penalty = 4.127685577031707, std_err = 16.19157000365693)

-2loo = 420.8822100662317

Using ParetoSmooth' psis()

[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.

loo = sum(loos) = 208.1168127112213
2loo = 416.2336254224426

Using Bombe's psis_loo()

[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.

cars_loo = psis_loo(ll);
sum(cars_loo.pointwise.estimate) = -204.9399874711436

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jul 5, 2021

Thank you.

I modified the function definition a bit, because loo in similar(loo, 0) is undefined.

Thanks. I have no idea how my testing didn't catch that. The function shouldn't have run at all when I tried that.

The result I'm getting for the cars example:

-2 * (sum(lppd) - sum(pwaic)) = 421.45579489140215

waic(log_lik) = (WAIC = 421.453730532524, lppd = -206.59917968923025, penalty = 4.127685577031707, std_err = 16.19157000365693)

-2loo = 420.8822100662317

Using ParetoSmooth' psis()

[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.

loo = sum(loos) = 208.1168127112213
2loo = 416.2336254224426

Using Bombe's psis_loo()

[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.

cars_loo = psis_loo(ll);
sum(cars_loo.pointwise.estimate) = -204.9399874711436

So it looks like the problem isn't in the addition code.

I've tried updating the code in ParetoSmooth; can you see if the new version works? It seems to give identical weights (up to floating point error) to R's Loo package. (isapprox returns true)

@goedman
Copy link
Member

goedman commented Jul 5, 2021

Makes little difference. I think, given above your check on the weights, your code at this point is more trustworthy than the old code. I now get:

-2 * (sum(lppd) - sum(pwaic)) = 421.46518110423153

waic(log_lik) = (WAIC = 421.46312317812544, lppd = -206.61673833982252, penalty = 4.114823249240203, std_err = 16.209320889781484)

-2loo = 420.900376736517

Using ParetoSmooth' psis()

[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.

loo = sum(loos) = 208.0992394555812
2loo = 416.1984789111624

Using Bombe's psis_loo()

[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.

sum(cars_loo.pointwise.estimate) = -204.91024688501258

For completeness, the pareto_k plots. First plot is basically cars_loo.pointwise.pareto_k, the 2nd plot is pk as computed in the old psis_loo(). I have often wondered why the old pk's go up that much. Maybe your autocorrelation correction is better?

pareto_k_plot
pk_plot

@goedman
Copy link
Member

goedman commented Jul 5, 2021

Carlos, can you describe how you compare the weights between Julia and R. Just for my education.

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jul 6, 2021

Carlos, can you describe how you compare the weights between Julia and R. Just for my education.

Just using count(r_weights .≈ jul_weights), which indicated all the weights were approximately equal.

Makes little difference. I think, given above your check on the weights, your code at this point is more trustworthy than the old code.

The weights and pareto_k values are the same, but the code in Bombe has an error somewhere that I'm trying to find -- the LOO-CV is off from the value returned by R, by about 4. I suspect we're both making a similar mistake somewhere in our Loo code.

@goedman
Copy link
Member

goedman commented Jul 6, 2021

Noticed you just did a major addition.

Thanks for above suggestions, this gives me a better target to look for if the new version doesn't solve it!

Thanks for this work, in my opinion a major step forwards.

@ParadaCarleton
Copy link
Author

ParadaCarleton commented Jul 6, 2021

Noticed you just did a major addition.

Thanks for above suggestions, this gives me a better target to look for if the new version doesn't solve it!

Thanks for this work, in my opinion a major step forwards.

Yep, I found the mistake and fixed it. Because the LOO code ended up being smaller than I expected, I've decided to incorporate it into ParetoSmooth.jl instead of spinning it off. The code can be found here. However, I think it doesn't run at the moment -- I'm trying to sort out a bug that keeps me from running tests on my own computer...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants