Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow static covariates in BGNBDModel #1390

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

PabloRoque
Copy link
Contributor

@PabloRoque PabloRoque commented Jan 16, 2025

Description

Allows static covariates in BetaGeoModel

NOTE: It seems there are convergence issues with the dropout-covariates-related params a|b. Related to similar observations by @juanitorduz here. As a consequence the last two assertions in test_distribution_method are a dubious hack.

Related Issue

Checklist

Modules affected

  • MMM
  • CLV
  • Customer Choice

Type of change

  • New feature / enhancement
  • Bug fix
  • Documentation
  • Maintenance
  • Other (please specify):

📚 Documentation preview 📚: https://pymc-marketing--1390.org.readthedocs.build/en/1390/

Copy link

codecov bot commented Jan 16, 2025

Codecov Report

Attention: Patch coverage is 85.91549% with 10 lines in your changes missing coverage. Please review.

Project coverage is 94.48%. Comparing base (cd09bd6) to head (a8c1328).

Files with missing lines Patch % Lines
pymc_marketing/clv/models/beta_geo.py 85.91% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1390      +/-   ##
==========================================
- Coverage   94.61%   94.48%   -0.14%     
==========================================
  Files          50       50              
  Lines        5441     5493      +52     
==========================================
+ Hits         5148     5190      +42     
- Misses        293      303      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wd60622 wd60622 changed the title [DRAFT]: Allow static covariates in BGNBDModel Allow static covariates in BGNBDModel Jan 17, 2025
@ColtAllen
Copy link
Collaborator

Merging #1375 has created a merge conflict in clv.distributions.py, but shouldn't be hard to fix. Seems like this branch was created off of that one.

@ColtAllen ColtAllen added the enhancement New feature or request label Jan 18, 2025
@ColtAllen ColtAllen added this to the 0.12.0 milestone Jan 18, 2025
@ColtAllen
Copy link
Collaborator

The _extract_predictive_variables internal method for covariates can probably be moved into the Base CLV class to reduce repetition, because static covariates can be included in all CLV models.

@PabloRoque PabloRoque marked this pull request as ready for review January 22, 2025 18:18
@PabloRoque
Copy link
Contributor Author

Merging #1375 has created a merge conflict in clv.distributions.py, but shouldn't be hard to fix. Seems like this branch was created off of that one.

Solved

@PabloRoque
Copy link
Contributor Author

The _extract_predictive_variables internal method for covariates can probably be moved into the Base CLV class to reduce repetition, because static covariates can be included in all CLV models.

I don't see a fast and easy way forward. Different models have different params, and different relations between covariates and params. We could:

  • Implement some sort of config for these relationships
  • Iterate over the different RVs and perform some checks

I much rather address this in a separate PR. What do you think @ColtAllen ?

@PabloRoque PabloRoque requested a review from ColtAllen January 24, 2025 14:18
@ColtAllen
Copy link
Collaborator

The _extract_predictive_variables internal method for covariates can probably be moved into the Base CLV class to reduce repetition, because static covariates can be included in all CLV models.

I don't see a fast and easy way forward. Different models have different params, and different relations between covariates and params. We could:

  • Implement some sort of config for these relationships
  • Iterate over the different RVs and perform some checks

I much rather address this in a separate PR. What do you think @ColtAllen ?

I agree it should be left to a separate PR, but before creating an issue let me give it some more thought. This would require refactoring to pass arguments into the internal method for the specific model param names, and probably conditionals for the child model class instance itself. I'm not sure if an overly complex, hard to maintain base class method is worth eliminating boilerplate in the child classes.

Copy link
Collaborator

@ColtAllen ColtAllen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments around things like renaming variables, but the poor convergence in distribution_new_customers needs more investigation and plotting of visuals in a notebook.

On that note, a covariate example will need to be added to the notebooks for BGNBD and/or the Quickstart. You can work off the example in the PNBD notebook.

I have an idea on why distribution_new_customers may not be converging well. I'll create an issue for it.

@@ -140,6 +144,9 @@ class BetaGeoModel(CLVModel):
Error Problem." http://brucehardie.com/notes/027/bgnbd_num_error.pdf.
.. [4] Fader, P. S. & Hardie, B. G. (2019) "A Step-by-Step Derivation of the BG/NBD
Model." https://www.brucehardie.com/notes/039/bgnbd_derivation__2019-11-06.pdf
.. [5] Fader, Peter & G. S. Hardie, Bruce (2007).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an in-line citation for this reference in the top-level of the docstring.

Comment on lines 217 to 219
purchase_coefficient_gamma1 = self.model_config[
"purchase_coefficient_prior"
].create_variable("purchase_coefficient_gamma1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the gamma1 suffix being used here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to use model coordinates instead?

Comment on lines 249 to 254
dropout_coefficient_gamma2 = self.model_config[
"dropout_coefficient_prior"
].create_variable("dropout_coefficient_gamma2")
dropout_coefficient_gamma3 = self.model_config[
"dropout_coefficient_prior"
].create_variable("dropout_coefficient_gamma3")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change these _gamma% suffixes to _alpha and _beta? Gamma is a confusing term because it pertains to the purchasing process in the research.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to follow the convention here
image. There is no beta, but a and b (I can call them coefficient_a, and coefficient_b if you like)

  • We can rename as purchase_coefficient, dropout_coefficient to follow the implementation in ParetoNBDModel. But then the gamma2 and gamma3 coefficients must be equal. This is in fact how the implementation in R's CLVTools is done btw.
  • I tried that implementation, and is not helping with the convergence

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my mistake - I meant _a and _b.

Are you saying CLVTools fixes these coefficients to be equal to each other? They share the same data, but this doesn't seem to align with the research note.

Also, which implementation is not helping with convergence?

Copy link
Contributor Author

@PabloRoque PabloRoque Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying CLVTools fixes these coefficients to be equal to each other? They share the same data, but this doesn't seem to align with the research note.

That is indeed the case. See here and here

Also, which implementation is not helping with convergence?

Using the same implementation as CLVTools, fixing gamma2=gamma3. It did not help with the test issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My best guess as to why the CLVTools developers did this was for easier interpretability and/or to speed up model fits. What's weird is why they still went with it if convergence is negatively impacted. This could be a good selling point for pymc-marketing compared to other open-source tools.

Explaining covariate impacts on overall dropout in terms of separate a and b coefficients will be tricky, but not impossible. Generally if a >b, then p increases, and vice-versa. The greater both values are, the narrower the distribution.

Comment on lines 289 to 294
dropout_coefficient_gamma2 = self.model_config[
"dropout_coefficient_prior"
].create_variable("dropout_coefficient_gamma2")
dropout_coefficient_gamma3 = self.model_config[
"dropout_coefficient_prior"
].create_variable("dropout_coefficient_gamma3")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

This hierarchical pooling conditional block may be worth its own internal method because it appears in several models, but we can leave that as a separate PR.

Comment on lines 965 to 975
# NOTE: These are the problematic tests due to poor convergence
# We would need to test eg:
# assert (res_zero["dropout"] < res_high["dropout"]).all()
# Instead we test "less than" within tolerance
assert (
(
res_zero["recency_frequency"].sel(obs_var="recency")
- res_high["recency_frequency"].sel(obs_var="recency")
)
< 0.35
).all()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to break this into multiple tests, because there are 9 separate assert statements happening here.

We should also investigate this particular assert in a notebook because the corresponding test in ParetoNBDModel doesn't have this problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the notebook. I am drafting something in #1430. That PR is more focused on:

  • Fixing plotting to allow covariates
  • Introducing a new dataset ApparelTrans exported from CLVTools's covariates examples

I wanted to keep this in a separate PR, so I will be adding here only a small gist with convergence plots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced these final tests all together taking onto account the properties of the Beta distribution.
It should make sense now.

@ColtAllen
Copy link
Collaborator

Created an issue that may be related to the poor convergence: #1431

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added the docs Improvements or additions to documentation label Feb 2, 2025
@PabloRoque
Copy link
Contributor Author

PabloRoque commented Feb 2, 2025

@ColtAllen

OK, the plot kind of thickened.

the poor convergence in distribution_new_customers needs more investigation and plotting of visuals in a notebook.

I added docs/notebooks/dev/clv/dev/bg_nbg_covariates_test_issues.ipynb. Findings:

  • We were assuming the tests to be equivalent to the one in ParetoNBD. But this should not be the case.
  • Whenever a = b, the Beta distribution is symmetric with E[X]=0.5.
  • Under this condition, the only effect of the introduction of covariates is the narrowing of the distribution of E[X] around 0.5.
  • I believe this has implications on the results from CLVTools. For them gamma2=gamma3, and thus if a0 = b0 you will have basically a fancy 50/50 coin toss discerning your dropout probability

Generally if a >b, then p increases, and vice-versa. The greater both values are, the narrower the distribution.

This was the whole thing!. I was not taking into account the properties of the Beta dist, and following the ParetoNBD implementation blindly. Working on the notebook, I thought I was going crazy. E["dropout"]~0.5 always, regardless of the covariates I was using. Note that in the test setup we have equal coefficients: dropout_coefficient_a=np.array([3.0]), dropout_coefficient_b=np.array([3.0]). This was meant to be the case from the beginning.

Note however some findings. We are using a|b = pm.Flat("a|b") in distribution_new_customer. If we are not to pass data to the function, we would be having issues because Flat and Beta don't get along too well.

@PabloRoque PabloRoque requested a review from ColtAllen February 2, 2025 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLV docs Improvements or additions to documentation enhancement New feature or request tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants