-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assigning a state to a record, randomly, based on probability of the record being from particular states (versus distributing portions of records to states) #2
Comments
Ernie asked for code that does this. Here is the code Dan provided (PSLmodels/taxdata#138 (comment)). I only have Stata code. The variables below (cumulativ, onstate, etc) are gen cumulative = 0 quietly { It should be straightforward to do this in R or any language. dan |
Dan, Thanks. A few questions:
|
On Sat, 23 Dec 2017, Don Boyd wrote:
Dan,
Thanks. A few questions:
* I presume the probabilities come from a multinomial logit, similar to some of our prior
discussions?
Yes. I ran a logit on the <200K taxpayers and then used that to impute a
state of residence. I then tested the state aggregates for a few variables
such as tax tax paid on the <200K and >200K taxpayers separately. It did
well for the estimation sample, and poorly for the >200K sample. So I do
not propose to use the coefficients from 2008 on the 2011 PUF. That is why
I am working on the MaxEnt procedure, and also why I have applied to SOI
to run the logit regressions on recent confidential data. I have heard
back from Barry that he intends to respond favorably to that application,
but hasn't done anything definite.
dan
…
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVdFsVDVd8e4JdMzLK2B35XnAWpJ7ks5tDN46gaJpZM4RLo9v.gif]
|
Yes, unfortunately, the fact that we don't have state codes on the newer files means that we can't estimate probabilities that way. What do you think of approaches that try to construct a measure of how "close" a record is to a particular state? Let's take the $50-75k income range as an example:
This has information that, I would argue, is meaningful - information that max entropy would not have. The question, then, is what to do with this information? One approach would be to use it in a first stage, similar to the probability approach Dan outlined above, to assign a state to each record (perhaps multiple times, as he notes). We would (or at least I would) still need a 2nd stage, to adjust weights to hit targets - an NLP approach. Another approach is to use the distance measures as a component of the objective function (the penalty function to be minimized) so that distributing weight to a low-distance state is penalized less than distributing weight to a high-distance state. Thoughts? |
On Sat, 23 Dec 2017, Don Boyd wrote:
Yes, unfortunately, the fact that we don't have state codes on the newer files means that we can't estimate
probabilities that way. What do you think of approaches that try to construct a measure of how "close" a
I assume by "that way" you mean logit.
record is to a particular state?
Let's take the $50-75k income range as an example:
1.
Using the SOI summary data, get measures of what average returns in this range look like in each state.
Get % of returns that have wages, average wages of those who have; % that have capgains, average cap
gains of those who have; % that have interest income, average interest income of those who have, % who
had SALT deduction, average SALT deduction of those who have, and so on.
You want to use the count and amount for each of a dozen or more state by
income class aggregates. That will be the basis of any imputation with no
help from SOI.
2.
For each of the (approx) 40k PUF records in this range, compute Euclidean distances from each of the 50
states on the 20 or so attributes defined in step 1. (If the record has no SALT deduction, then the
SALT-proportion component of its distance will be farther from NY, where let's say the proportion is 30%
than from MS, where let's say the proportion is 10%. If the record has cap gains and the amount is
great, then the 2 capgains components of the distance will be closer to NY, where the cap gains % is,
let's say, 20% and the capgains average is high, than it is to MS.) We end up with 50 distances for each
of 40k records.
This has information that, I would argue, is meaningful - information that max entropy would not have.
MaxEnt has all of that information if it is in the constraints. We want to
maximize the entropy of the probability assignments constrained by the
necessity of matching the published aggregates for multiple variables. So
the information is used.
The question, then, is what to do with this information?
One approach would be to use it in a first stage, similar to the probability approach Dan outlined above, to
assign state to records (perhaps multiple times). We would (or at least I would) still need a 2nd stage, to
adjust weights to hit targets - an NLP approach.
You don't say how you would translate Eucluidian distances into
probabilities. Any way you do it, if you assign taxpayers to the state
most like themselves, the variation within a state will be attenuated.
Each state will have taxpayers that look like the average for that state.
The MaxEnt approach ensures that each state has the amount of variation
that the data implies. Since we don't have specific information on the
within state variation, the within state variation is that which results
when no additional information beyond the taxpayer data and state level
aggregates is imposed on the result. Use the minimum Euclidian distance
imposes a strong requirement on the result.
Another approach is to use the distance measures as a component of the objective function (the penalty
function to be minimized) so that distributing weight to a low-distance state is penalized less than
distributing weight to a high-distance state.
MaxEnt makes the best estimate of the proportion of good to bad
taxpayer to state matches that can be made without additional information.
dan
… Thoughts?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVeVYYdnXE_1hIa__9MR_D3lgcXWqks5tDPHYgaJpZM4RLo9v.gif]
|
I've created a new issue to take up the question of how to implement a maximum-entropy constrained NLP approach, as it is really not about assigning a state randomly to a record (this issue), and deserves a full discussion: #3. Before that, a few quick comments back:
I want to start by creating targets (constraints) based upon counts and amounts for aggregates from the publicly available available SOI summaries available at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2. That provides a rich set of targets. It's not that I don't want help from SOI. If they (the people at SOI) can provide better information for targeting, or if you, through your work with them, can develop better information for targeting, then that is a big plus. I am a believer in iterative refinement: We should start by doing the best we can with what we have now. If we can get better information from the people at SOI, then one of the iterative improvements would be to incorporate that information when available.
Yes, that's true of any approach that assigns records to states based upon probability, rather than distributing portions of records to states (the latter allowing portions of low-probability records to be distributed to a state, the former not), isn't it, including the approach you outlined? My intent, however we defined probabilities or distance, would be to distribute portions of records to states, rather than to uniquely assign records to a state, to avoid this problem. |
On Sun, 24 Dec 2017, Don Boyd wrote:
I've created a new issue to take up the question of how to implement a
maximum-entropy constrained NLP approach, as it is really not about
assigning a state randomly to a record (this issue), and deserves a
full discussion: #3.
Before that, a few quick comments back:
You want to use the count and amount for each of a dozen
or more state by income class aggregates. That will be the
basis of any imputation with no help from SOI.
I want to start by creating targets (constraints) based upon counts
and amounts for aggregates from the publicly available available SOI
summaries available at
https://www.irs.gov/statistics/soi-tax-stats-historic-table-2. That
provides a rich set of targets. It's not that I don't want help from
SOI. If they (the people at SOI) can provide better information for
targeting, or if you, through your work with them, can develop better
information for targeting, then that is a big plus. I am a believer in
I have no confidence that Barry will come through on his promise, so I
would never suggest we wait on him. Eventually we might have his
cooperation.
iterative refinement: We should start by doing the best we can with
what we have now. If we can get better information from the people at
SOI, then one of the iterative improvements would be to incorporate
that information when available.
You don't say how you would translate Euclidean distances
into probabilities. Any way you do it, if you assign
taxpayers to the state most like themselves, the variation
within a state will be attenuated.
Yes, that's true of any approach that assigns records to states based
upon probability, rather than distributing portions of records to
states (the latter allowing portions of low-probability records to be
be distributed to a state, the former not), isn't it, including the
approach you outlined? My intent, however we defined probabilities or
distance, would be to distribute portions of records to states, rather
than to uniquely assign records to a state, to avoid this problem.
I guess my response is that I expect the best way to distribute the
weights will be given by MaxEnt, unless some additional information is
available.
dan
|
I am moving Dan's comments on this point to here. Here is his initial comment (PSLmodels/taxdata#138 (comment)).
It is possible to assign a single state to each record in an unbiased manner. The way I have done this is to calculate a probability of a record being in each of the 50 states, and assign it to one of those state in proportion to those probabilities. That is, if a record has high state income tax the procedure will show high probabilities for New York, California, etc and low probabilities (but not zero) for Florida and Texas. Then the computer will select New York or California with high probability and Florida or Texas with low probability. In expectation the resulting totals will be the same as the "long" format but with some unbiased error. I have done this and find that state level aggregates match nearly as well as summing over all possible states. If desired, one could take 2 draws, or any other number. It would not be necessary to multiply the workload by 51.
The text was updated successfully, but these errors were encountered: