Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update index.qmd #548

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Update index.qmd #548

wants to merge 2 commits into from

Conversation

willtebbutt
Copy link
Member

Addresses part of #547 .

@gdalle does this read more correctly?

Copy link
Contributor

github-actions bot commented Nov 4, 2024

Preview the changes: https://turinglang.org/docs/pr-previews/548
Please avoid using the search feature and navigation bar in PR previews!

@gdalle
Copy link

gdalle commented Nov 4, 2024

I think there might still be a misunderstanding here. The keyword argument AutoReverseDiff(; compile) only controls whether the tape is optimized after being recorded. It doesn't control whether a recorded tape is only used once, or reused across function calls (inducing possible control flow problems).
Tape reuse is instead determined by how the ADGradient is created, and whether or not an example vector x is passed to it. See these key lines in LogDensityProblemsAD.jl:

https://github.com/tpapp/LogDensityProblemsAD.jl/blob/e3401f21b5a065df0d5de38b37fad0e6650618f3/ext/LogDensityProblemsADADTypesExt.jl#L52-L54

https://github.com/tpapp/LogDensityProblemsAD.jl/blob/e3401f21b5a065df0d5de38b37fad0e6650618f3/ext/LogDensityProblemsADReverseDiffExt.jl#L45-L56

So the real question is:

  • in which places does Turing "prepare" the ADGradient with an example x
  • can the user of the library do anything about it, if they know that their loglikelihood has value-dependent control flow?

@willtebbutt
Copy link
Member Author

Ah, I see. Maybe @torfjelde is the right person to ask about this? I've not dug in to exactly where to sort out RD gradient stuff yet in Turing.jl.

@torfjelde
Copy link
Member

in which places does Turing "prepare" the ADGradient with an example x

Typically this occurs in the initial step of the sampler, i.e. once (there are exceptions, but in those cases you can't really do much better in our case).

can the user of the library do anything about it, if they know that their loglikelihood has value-dependent control flow?

Unfortunately not for ReverseDiff.jl in compiled mode (they can however just not use compiled mode).


Cached tapes should only be used if you are absolutely certain that the sequence of operations performed in your code does not change between different executions of your model.
Thus, e.g., in the model definition and all implicitly and explicitly called functions in the model, all loops should be of fixed size, and `if`-statements should consistently execute the same branches.
For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the data will always execute the same branches during sampling (if the data is constant throughout sampling and, e.g., no mini-batching is used).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the data will always execute the same branches during sampling (if the data is constant throughout sampling and, e.g., no mini-batching is used).
For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the model, e.g. fixed data.

I don't think there's much point mentioning minibatching here, as there's no "easy to use" support for this in Turing.jl and so it's really not something people do much of (don't think I've ever seen anyone actually do this in applications with Turing.jl).

Copy link
Member

@torfjelde torfjelde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments 👍

@@ -20,11 +20,12 @@ As of Turing version v0.30, the global configuration flag for the AD backend has
Users can pass the `adtype` keyword argument to the sampler constructor to select the desired AD backend, with the default being `AutoForwardDiff(; chunksize=0)`.

For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of 0 permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of 0 permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size).
For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of `nothing` permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add this, as @gdalle pointed out, this is not actually the case anymore (this was a left-over thing from using LogDensityProblemsAD.jl I believe, but this changed when it all moved to ADTypes.jl backed)

@gdalle
Copy link

gdalle commented Nov 6, 2024

Unfortunately not for ReverseDiff.jl in compiled mode (they can however just not use compiled mode).

@torfjelde I think that's the key confusion between us. My understanding is that, regardless of whether it is compiled or not, the tape only captures execution for one version of the control flow. Thus, the tape always becomes invalid if the control flow changes (e.g. different branch of an if statement). Compilation just speeds up the differentiation but does not affect anything else, especially not correctness. Do you have a different understanding?

Source: https://juliadiff.org/ReverseDiff.jl/dev/api/#The-AbstractTape-API

@devmotion
Copy link
Member

The tape is purely internal and neither stored nor reused if compile = Val(false).

Tape reuse is instead determined by how the ADGradient is created, and whether or not an example vector x is passed to it

This is wrong. The tape is only reused if compile = Val(true): https://github.com/tpapp/LogDensityProblemsAD.jl/blob/e3401f21b5a065df0d5de38b37fad0e6650618f3/ext/LogDensityProblemsADReverseDiffExt.jl#L51-L56

@gdalle
Copy link

gdalle commented Nov 6, 2024

Okay then IIUC it's a matter of inconsistency between

  • the "compile" kwarg here and in LogDensityProblemsAD (which determines whether a tape is reused)
  • the compilation as defined by ReverseDiff, which is about speeding up an existing tape (and disconnected from the reuse issue)

@gdalle
Copy link

gdalle commented Nov 6, 2024

Note that in ADTypes, the "compile" argument to AutoReverseDiff is defined ambiguously too. So we should perhaps add more details to the struct, something like AutoReverseDiff(tape=true, compile=false)?

@gdalle
Copy link

gdalle commented Nov 6, 2024

I'd love for you to chime in here @devmotion @torfjelde: SciML/ADTypes.jl#91

@torfjelde
Copy link
Member

Note that in ADTypes, the "compile" argument to AutoReverseDiff is defined ambiguously too. So we should perhaps add more details to the struct, something like AutoReverseDiff(tape=true, compile=false)?

I'm not too opinionated about this, as compilation without caching seems somewhat useless? Are there scenarios where you'd like to do that?

@gdalle
Copy link

gdalle commented Nov 6, 2024

I'm not too opinionated about this, as compilation without caching seems somewhat useless? Are there scenarios where you'd like to do that?

Indeed you can't compile a tape you never record in the first place. In any case, I think the ambiguous terminology was fixed by SciML/ADTypes.jl#91. It's just a shame that the word "compile" was chosen instead of "record", given how both are used in ReverseDiff's documentation. But it's a sunk cost now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants