Add support for multi-stage formulas. #24

matthewwardrop · 2020-08-30T05:32:51Z

In some of my work I am interested in exploring two-stage least-square regression on sparse data, and thus in making Formulaic able to handle it nicely.

My plan is to allow formulas of form:
y ~ a + [b + c ~ z1 + z2] | a + [e + f ~ z1 + z2] | d + [b + c ~ z1 + z2] | d + [e + f ~ z1 + z2]
In my proposed grammar, this would also be equivalent to:
y ~ (a|d) + [b + c | e + f ~ z1 + z2]
Using multipart syntax in the rhs of nested formulas would be forbidden.

The API for accessing the various pieces of this Formula is as yet not fully fleshed out, and naming has not been properly considered, but would be something like:

f = Formula('y ~ (a|d) + [b + c | e + f ~ z1 + z2]')
f.formula_for(rhs_part=0, stage=0)  # b + c ~ z1 + z2
f.formula_for(rhs_part=0, stage=1)  # y ~ a + b + c
f.formula_for(rhs_part=1, stage=0)  # e + f ~ z1 + z2
f.formula_for(rhs_part=1, stage=1)  # y ~ a + e + f
f.formula_for(rhs_part=0) # y ~ a + [b + c ~ z1 + z2]

f = Formula('y ~ x + z')
f.formula_for() # y ~ x + z

On a multipart formula like this one, calls to get_model_matrix will need to specify the part and stage for which the model matrix should be generated. If there is only one part or stage, this will not be necessary. Formulaic explicitly will not attempt to do any modeling with this, and will expect users of the library to do any memoisation that is required for two-stage least-squares to work when pumping new data sets through a pre-trained model.

I'm especially keen to know what @bashtage thinks about this, given that this is something he has explored a lot more in linearmodels.

The text was updated successfully, but these errors were encountered:

bashtage · 2020-09-14T12:39:39Z

What is the intention of the first formula? What is exogenous and what is endogenous? Clearly the Z are instruments.
.

matthewwardrop · 2022-03-16T21:53:35Z

Returning to this after several years 😓 .

Multi-part formulas are already implemented as of v0.3.0: y ~ a | b | c does the right thing.

@bashtage : If I were to take this further, I'd look to implement something like: y ~ 1 + x1 + x2 + x3 + [ x4 + x5 ~ z1 + z2 + z3], exactly as you have done here. The results would be made available on the Structured instance as something like:

.lhs
    y
.rhs
    1 + x1 + x2 + x3 + IV[x4] + IV[x5]
    .iv_x4:
        z1 + z2 + z3
    .iv_x5:
        z1 + z2 + z3

This is within reach of the parser now, but I'd love your take on this (given that you have much more experience in this space).

bashtage · 2022-03-16T23:10:06Z

An advanced syntax would be great. I have a few current uses.

IV like you have above.
Absorbing regression where high dimensional fixed effects are absorbed. Something like y ~ x + [eff1 + eff2 + eff3] where eff# are categorical variables usually that are then encoded to sparse arrays.
Systems equations. I currently use a dictionary. These models have multiple equations, something like y1 ~ x + z, y2 ~ x + w. Not sure if something like this woudl make sense to have as a syntax.

matthewwardrop · 2022-03-17T02:12:48Z

Nice. I don't yet know how much it makes sense to always have these advanced operators in place (versus having a family of parsers that extend some common set), but I'll definitely be working toward making the parser capable of generating formulae for these kinds of situations.

For further clarity:
On 2. Absorbing regression is just your usual fixed-effects regression, right? Where you demean the data based on a set of covariates prior to modelling, perhaps using another regression? What would you want output in that case? Something like:

.lhs
    y_residuals
    .fixed_effects
         eff1 + eff2 + eff3
.rhs
    x

On 3. Would a Structured instance of a tuple of formulas work? That could be implemented trivially today (either in formulaic or downstream by adding the , operator):

[0]
    .rhs
        y1
    .lhs
        x + z
[1]
    ...

bashtage · 2022-03-17T09:47:33Z

I haven't really through about it. I could imagine that formulas could be nested. For example

y ~ 1 + x + [w ~ z]

could be something like

.lhs
   y
.rhs
   1 + x + [w ~ z]

and when you access .rhs it would be [1{Term}, x{Term}, [w ~ z]{Formula}] so that one could handle nested formulas with some recusions, e.g.

for term_or_fmla in formula.rhs.terms:
    if isinstance(term_or_fmla , Term):
        """Do something"""
    else:
        """Handle nested formula probably using a recursion"""

Maybe too complicted.

GuiMarthe · 2024-03-01T20:55:50Z

Just to add to this style of syntax, mlogit uses something similar for multinomial choice models. Not saying it should be implemented here, but there is another use case for the | syntax.
In that literature, y ~ x | z | w, is the notation used for the anatomy of utility functions and translates to

choice ~ alternative vars. with generic coefficients |
                individual vars. with specific coefficients |
                alternative vars. with specific coefficients

matthewwardrop · 2024-03-08T05:51:29Z

@GuiMarthe This is actually already implemented in Formulaic (leaving the interpretation to the calling library).

The wrapping library would then just need to validate that the formula has the expected structure (it could also, if desired, disable the intercept additions in the formula parser).

matthewwardrop added the enhancement New feature or request label Aug 30, 2020

matthewwardrop self-assigned this Aug 30, 2020

matthewwardrop modified the milestones: 0.3.x, 0.3.0 Oct 17, 2021

matthewwardrop changed the title ~~Add support for multi-stage and multi-part formulas.~~ Add support for multi-stage formulas. Mar 16, 2022

matthewwardrop modified the milestones: 0.3.x, 0.4.x Jun 20, 2022

matthewwardrop linked a pull request Sep 25, 2022 that will close this issue

Add support for nested formulae (useful e.g. in IV contexts). #108

Open

matthewwardrop modified the milestones: 0.4.x, 1.1.0 Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for multi-stage formulas. #24

Add support for multi-stage formulas. #24

matthewwardrop commented Aug 30, 2020

bashtage commented Sep 14, 2020

matthewwardrop commented Mar 16, 2022 •

edited

Loading

bashtage commented Mar 16, 2022

matthewwardrop commented Mar 17, 2022 •

edited

Loading

bashtage commented Mar 17, 2022

GuiMarthe commented Mar 1, 2024

matthewwardrop commented Mar 8, 2024 •

edited

Loading

Add support for multi-stage formulas. #24

Add support for multi-stage formulas. #24

Comments

matthewwardrop commented Aug 30, 2020

bashtage commented Sep 14, 2020

matthewwardrop commented Mar 16, 2022 • edited Loading

bashtage commented Mar 16, 2022

matthewwardrop commented Mar 17, 2022 • edited Loading

bashtage commented Mar 17, 2022

GuiMarthe commented Mar 1, 2024

matthewwardrop commented Mar 8, 2024 • edited Loading

matthewwardrop commented Mar 16, 2022 •

edited

Loading

matthewwardrop commented Mar 17, 2022 •

edited

Loading

matthewwardrop commented Mar 8, 2024 •

edited

Loading