Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleshing out a maximum-entropy constrained NLP approach fully enough to allow implementation #3

Open
donboyd5 opened this issue Dec 24, 2017 · 0 comments

Comments

@donboyd5
Copy link
Collaborator

donboyd5 commented Dec 24, 2017

Summary of previous relevant discussion at #2.

In #2 we discussed assigning PUF records to individual states, either based on probability that a record in the PUF actually is from a particular state or based on a measure of the record's (Euclidean) distance from summary characteristics of records from individual states, which is really a similar concept.

Probabilities might be estimated from other similar microdata that has state codes, using a multinomial logit approach. At the moment, we probably don't have data that are sufficient for this.

Distances might be estimated by comparison to summary state-level data such as those at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2, which loses the richness of microdata but has the advantage of feasibility (we have the necessary data).

We discussed two further, related, issues:

  1. If we only assign each record to its highest-probability or closest-distance state, then each state will look like its average taxpayer and we will lose important variation that exists in the real world, because we do not include low-probability records. This is undesirable. We discussed two ways to avoid this:
    (a) Assign records to states randomly in a manner that makes it likely that records are assigned to high-probability states, but allows them to be assigned to low-probability states (this is the Stata code that Dan provided; the mechanism was assigning states to records rather than records to states, but it is the same thing). This assignment can be repeated multiple times. Or,
    (b) Distribute portions of records to states based upon probabilities (or distances), so that each record can be assigned to multiple states, with higher portions likely to go to the high probability (or low distance) states. This allows portions of low-probability records to be distributed to states, so that we get variation.

  2. Let's assume we have addressed the first point, either by multiple assignments of records to states, or by distribution of portions of records to states. We now have a file that in some sense is representative of the 50 states. It would have more than the initial number of, let's say, 150k records. If we used the assignment approach 10 times, it would have 1.5 million records. If we used the distribution approach and included all 50 states in each record's distribution, it would have 7.5 million records. The records generally would be consistent with characteristics of states, with variation. But there is no reason to believe this file would hit the targets we have for the 50 states, from the SOI summary data, although it should have moved in that direction.

For people who want a file that hits known/estimated totals (me), this is a problem to be solved. We talked about adjusting record weights from this point using a constrained NLP approach to ensure that targets are hit. Dan proposed a maximum-entropy objective function.

Is this an accurate summary? I'll propose some next steps but would love to see feedback first.

@donboyd5 donboyd5 changed the title Fleshing out a maximum-entropy approach Fleshing out a maximum-entropy constrained NLP approach Dec 24, 2017
@donboyd5 donboyd5 changed the title Fleshing out a maximum-entropy constrained NLP approach Fleshing out a maximum-entropy constrained NLP approach fully enough to allow implementation Dec 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant