Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multi-stage formulas. #24

Open
matthewwardrop opened this issue Aug 30, 2020 · 7 comments · May be fixed by #108
Open

Add support for multi-stage formulas. #24

matthewwardrop opened this issue Aug 30, 2020 · 7 comments · May be fixed by #108
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@matthewwardrop
Copy link
Owner

In some of my work I am interested in exploring two-stage least-square regression on sparse data, and thus in making Formulaic able to handle it nicely.

My plan is to allow formulas of form:
y ~ a + [b + c ~ z1 + z2] | a + [e + f ~ z1 + z2] | d + [b + c ~ z1 + z2] | d + [e + f ~ z1 + z2]
In my proposed grammar, this would also be equivalent to:
y ~ (a|d) + [b + c | e + f ~ z1 + z2]
Using multipart syntax in the rhs of nested formulas would be forbidden.

The API for accessing the various pieces of this Formula is as yet not fully fleshed out, and naming has not been properly considered, but would be something like:

f = Formula('y ~ (a|d) + [b + c | e + f ~ z1 + z2]')
f.formula_for(rhs_part=0, stage=0)  # b + c ~ z1 + z2
f.formula_for(rhs_part=0, stage=1)  # y ~ a + b + c
f.formula_for(rhs_part=1, stage=0)  # e + f ~ z1 + z2
f.formula_for(rhs_part=1, stage=1)  # y ~ a + e + f
f.formula_for(rhs_part=0) # y ~ a + [b + c ~ z1 + z2]

f = Formula('y ~ x + z')
f.formula_for() # y ~ x + z

On a multipart formula like this one, calls to get_model_matrix will need to specify the part and stage for which the model matrix should be generated. If there is only one part or stage, this will not be necessary. Formulaic explicitly will not attempt to do any modeling with this, and will expect users of the library to do any memoisation that is required for two-stage least-squares to work when pumping new data sets through a pre-trained model.

I'm especially keen to know what @bashtage thinks about this, given that this is something he has explored a lot more in linearmodels.

@matthewwardrop matthewwardrop added the enhancement New feature or request label Aug 30, 2020
@matthewwardrop matthewwardrop self-assigned this Aug 30, 2020
@bashtage
Copy link
Contributor

What is the intention of the first formula? What is exogenous and what is endogenous? Clearly the Z are instruments.
.

@matthewwardrop matthewwardrop modified the milestones: 0.3.x, 0.3.0 Oct 17, 2021
@matthewwardrop
Copy link
Owner Author

matthewwardrop commented Mar 16, 2022

Returning to this after several years 😓 .

Multi-part formulas are already implemented as of v0.3.0: y ~ a | b | c does the right thing.

@bashtage : If I were to take this further, I'd look to implement something like: y ~ 1 + x1 + x2 + x3 + [ x4 + x5 ~ z1 + z2 + z3], exactly as you have done here. The results would be made available on the Structured instance as something like:

.lhs
    y
.rhs
    1 + x1 + x2 + x3 + IV[x4] + IV[x5]
    .iv_x4:
        z1 + z2 + z3
    .iv_x5:
        z1 + z2 + z3

This is within reach of the parser now, but I'd love your take on this (given that you have much more experience in this space).

@matthewwardrop matthewwardrop changed the title Add support for multi-stage and multi-part formulas. Add support for multi-stage formulas. Mar 16, 2022
@bashtage
Copy link
Contributor

An advanced syntax would be great. I have a few current uses.

  1. IV like you have above.
  2. Absorbing regression where high dimensional fixed effects are absorbed. Something like y ~ x + [eff1 + eff2 + eff3] where eff# are categorical variables usually that are then encoded to sparse arrays.
  3. Systems equations. I currently use a dictionary. These models have multiple equations, something like y1 ~ x + z, y2 ~ x + w. Not sure if something like this woudl make sense to have as a syntax.

@matthewwardrop
Copy link
Owner Author

matthewwardrop commented Mar 17, 2022

Nice. I don't yet know how much it makes sense to always have these advanced operators in place (versus having a family of parsers that extend some common set), but I'll definitely be working toward making the parser capable of generating formulae for these kinds of situations.

For further clarity:
On 2. Absorbing regression is just your usual fixed-effects regression, right? Where you demean the data based on a set of covariates prior to modelling, perhaps using another regression? What would you want output in that case? Something like:

.lhs
    y_residuals
    .fixed_effects
         eff1 + eff2 + eff3
.rhs
    x

On 3. Would a Structured instance of a tuple of formulas work? That could be implemented trivially today (either in formulaic or downstream by adding the , operator):

[0]
    .rhs
        y1
    .lhs
        x + z
[1]
    ...

@bashtage
Copy link
Contributor

I haven't really through about it. I could imagine that formulas could be nested. For example

y ~ 1 + x + [w ~ z]

could be something like

.lhs
   y
.rhs
   1 + x + [w ~ z]

and when you access .rhs it would be [1{Term}, x{Term}, [w ~ z]{Formula}] so that one could handle nested formulas with some recusions, e.g.

for term_or_fmla in formula.rhs.terms:
    if isinstance(term_or_fmla , Term):
        """Do something"""
    else:
        """Handle nested formula probably using a recursion"""

Maybe too complicted.

@matthewwardrop matthewwardrop modified the milestones: 0.3.x, 0.4.x Jun 20, 2022
@matthewwardrop matthewwardrop modified the milestones: 0.4.x, 1.1.0 Dec 20, 2023
@GuiMarthe
Copy link

Just to add to this style of syntax, mlogit uses something similar for multinomial choice models. Not saying it should be implemented here, but there is another use case for the | syntax.
In that literature, y ~ x | z | w, is the notation used for the anatomy of utility functions and translates to

choice ~ alternative vars. with generic coefficients |
                individual vars. with specific coefficients |
                alternative vars. with specific coefficients

@matthewwardrop
Copy link
Owner Author

matthewwardrop commented Mar 8, 2024

@GuiMarthe This is actually already implemented in Formulaic (leaving the interpretation to the calling library).

image

The wrapping library would then just need to validate that the formula has the expected structure (it could also, if desired, disable the intercept additions in the formula parser).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants