Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assigning a state to a record, randomly, based on probability of the record being from particular states (versus distributing portions of records to states) #2

Open
donboyd5 opened this issue Dec 23, 2017 · 7 comments

Comments

@donboyd5
Copy link
Collaborator

I am moving Dan's comments on this point to here. Here is his initial comment (PSLmodels/taxdata#138 (comment)).

It is possible to assign a single state to each record in an unbiased manner. The way I have done this is to calculate a probability of a record being in each of the 50 states, and assign it to one of those state in proportion to those probabilities. That is, if a record has high state income tax the procedure will show high probabilities for New York, California, etc and low probabilities (but not zero) for Florida and Texas. Then the computer will select New York or California with high probability and Florida or Texas with low probability. In expectation the resulting totals will be the same as the "long" format but with some unbiased error. I have done this and find that state level aggregates match nearly as well as summing over all possible states. If desired, one could take 2 draws, or any other number. It would not be necessary to multiply the workload by 51.

@donboyd5
Copy link
Collaborator Author

Ernie asked for code that does this. Here is the code Dan provided (PSLmodels/taxdata#138 (comment)).

I only have Stata code. The variables below (cumulativ, onstate, etc) are
vectors with an element for each record. p1 through p51 are the estimated
probability of a record being in each state. r is a random value uniform
on 0-1.

gen cumulative = 0
gen byte onestate = 1
quietly {
forvalues state = 1/51 {
replace cumulative = cumulative + pstate' replace cumstate' = cumulative
replace onestate = state' + 1 if cumstate' < r
}
}

quietly {
forvalues state = 1/52 {
foreach v in vars' { generate pv'state' = v'*p`state'
}
}

It should be straightforward to do this in R or any language.

dan

@donboyd5
Copy link
Collaborator Author

donboyd5 commented Dec 23, 2017

Dan,

Thanks. A few questions:

  • I presume the probabilities come from a multinomial logit, similar to some of our prior discussions? Or did you generate them some other way?

  • The cumulative method would seem to favor whatever states were ordered earliest (e.g., alpha order), since you make the state assignment (in essence) as soon as you've accumulated probabilities for a given record that sum to at least the value of r, the uniform random variable. Is my interpretation correct? If so, does that concern you? If so, would it make sense, for each record, to order the states by descending probability? (The ordering of states would differ from record to record.)

  • I don't understand what (or why) you're doing in your second step. It looks like you're taking each variable (such as wages?) and creating 52 new variables, one for each state, with the amount of the variable given to the state based upon that state's probability of being chosen. Am I interpreting this correctly? I don't follow why you do this. Or are the "vars" different weight variables, and you are distributing them to states according to the probabilities, in which case this would be an alternative to the single-state assignment?

  • Let's say you did the single-state assignment based upon probabilities. Let's say you did it 10 times, each with different r values, and/or doing it without replacement so that a state assigned to a record previously could not be assigned in a later step to the same state. This would yield a file with 1.5m records (if we started with 150k records), with the most populous states generally having the most records (right?). (Perhaps we did it 10 times because we wanted to ensure that the smaller states had "enough" records, somehow defined.)

  • If we did that, and we summarized the file (after scaling weights appropriately), we would find that our sums for each state and income range do not hit our desired targets, although they might not be terribly far off because we based them on reasonably estimated probabilities. Those of us who want a file that hits those targets (certainly me) would want a second stage, where we adjusted weights to hit the targets. So I think of the approach you have defined above as a first stage, similar in objective to my first stage (a simple scaling of weights), but instead using probabilities. But we still need a second stage. Presumably we like the weights we have obtained in the first stage so we would adjust the weights in a way that minimizes a penalty function based on how far the weights move from the first-stage weights. Do you agree?

@feenberg
Copy link

feenberg commented Dec 23, 2017 via email

@donboyd5
Copy link
Collaborator Author

donboyd5 commented Dec 23, 2017

Yes, unfortunately, the fact that we don't have state codes on the newer files means that we can't estimate probabilities that way. What do you think of approaches that try to construct a measure of how "close" a record is to a particular state?

Let's take the $50-75k income range as an example:

  1. Using the SOI summary data, get measures of what average returns in this range look like in each state. Get % of returns that have wages, average wages of those who have; % that have capgains, average cap gains of those who have; % that have interest income, average interest income of those who have, % who had SALT deduction, average SALT deduction of those who have, and so on.

  2. For each of the (approx) 40k PUF records in this range, compute Euclidean distances from each of the 50 states on the 20 or so attributes defined in step 1. (If the record has no SALT deduction, then the SALT-proportion component of its distance will be farther from NY, where let's say the proportion is 30% than from MS, where let's say the proportion is 10%. If the record has cap gains and the amount is great, then the 2 capgains components of the distance will be closer to NY, where the cap gains % is, let's say, 20% and the capgains average is high, than it is to MS.) We end up with 50 distances for each of 40k records.

This has information that, I would argue, is meaningful - information that max entropy would not have.

The question, then, is what to do with this information?

One approach would be to use it in a first stage, similar to the probability approach Dan outlined above, to assign a state to each record (perhaps multiple times, as he notes). We would (or at least I would) still need a 2nd stage, to adjust weights to hit targets - an NLP approach.

Another approach is to use the distance measures as a component of the objective function (the penalty function to be minimized) so that distributing weight to a low-distance state is penalized less than distributing weight to a high-distance state.

Thoughts?

@feenberg
Copy link

feenberg commented Dec 23, 2017 via email

@donboyd5
Copy link
Collaborator Author

donboyd5 commented Dec 24, 2017

I've created a new issue to take up the question of how to implement a maximum-entropy constrained NLP approach, as it is really not about assigning a state randomly to a record (this issue), and deserves a full discussion: #3.

Before that, a few quick comments back:

You want to use the count and amount for each of a dozen or more state by income class aggregates. That will be the basis of any imputation with no help from SOI.

I want to start by creating targets (constraints) based upon counts and amounts for aggregates from the publicly available available SOI summaries available at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2. That provides a rich set of targets. It's not that I don't want help from SOI. If they (the people at SOI) can provide better information for targeting, or if you, through your work with them, can develop better information for targeting, then that is a big plus. I am a believer in iterative refinement: We should start by doing the best we can with what we have now. If we can get better information from the people at SOI, then one of the iterative improvements would be to incorporate that information when available.

You don't say how you would translate Eucluidian distances into probabilities. Any way you do it, if you assign taxpayers to the state most like themselves, the variation within a state will be attenuated.

Yes, that's true of any approach that assigns records to states based upon probability, rather than distributing portions of records to states (the latter allowing portions of low-probability records to be distributed to a state, the former not), isn't it, including the approach you outlined? My intent, however we defined probabilities or distance, would be to distribute portions of records to states, rather than to uniquely assign records to a state, to avoid this problem.

@donboyd5 donboyd5 changed the title Assigning a state to a record, randomly, based on probability of the record being from particular states Assigning a state to a record, randomly, based on probability of the record being from particular states (versus distributing portions of records to states) Dec 24, 2017
@feenberg
Copy link

feenberg commented Dec 24, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants