This is a project to create Taxsim 22 variable files from the March CPS files. I start with IPUMS data. While a program (cps229.sas) exists, it has not been fully debugged and tested. I becamse discouraged when the following problems became apparent: DEPSTAT: This is an indicator that is supposed to point from the dependent to the taxpayer, but is has several problems beyond the difficulty in knowing if a person receives half or more of his support from the taxpayer: 1) Available 2004+ only 2) for 2004-2005 the taxpayer's spouse is (always?) shown as a dependent, which is not consistent with the tax law. 3) for 2006 all married taxpayers are listed as dependent of each other, which is confusing, as well as wrong. 4) for 2004-2006 many taxpayers are shown as dependent on themselves. In 2004 and 2005 many of these are very young, and a momloc or poploc is shown and usually little or no income. In 2006 this appears to indicate a child with significant income. 5) for 2007+ this seems correct. 6) for all years, the depstat points to the person line number (lineno) rather than to the person number (pernum), making for rather complex programming to unite parents and children. Proposed corrections are to change DEPSTAT to zero for all married persons and for persons dependent on themselves, change DEPSTAT to zero if they have income, and to momloc or poploc if they don't. It would be an uncorrectable error for an individual with no income to have no parent. FILESTAT The correspondence between AGI and filing isn't consistent through time. There is legitimate ambiguity as to which low income individuals will file, and the CPS changes its decision rule for of low income individuals through time. But sometimes high income individuals are not shown as filing, which can't be right. 7) For 2007+ a small percentage of non-filers have non-zero AGI and FEDTAX. Of course a small AGI is compatible with non-filing, however 47 of these have AGI so high it is top-coded, and anyone with FEDTAX should surely file. 8) For 2004 and 2005 there are 74,000+ non-filers with AGI of zero, and only 7 (2004) and 4,472 (for 2005) filers with AGI of zero. For all other years there are 25,000 to 42,000 filers with AGI of zero. Provided one doesn't try to count filers, and calculates taxes for all individuals with non-zero, non-missing income, there is no correction needed. PROPTAX and HOUSRET: It isn't clear where these values come from, they are some sort of imputation, and there is no indication of what HOUSRET is intended to proxy - it can be negative, so perhaps it includes a capital gain or loss. Oddly, the values for both these items are repeated for all dependents of the taxpayer, at the same value imputed to the taxpayer. That is, for a taxpayer imputed $1,000 in PROPTAX, and with 2 dependents, each dependent will also have $1,000 for PROPTAX. These are the only two tax variables treated in this way. This oddity seems to be consistent from 1992 on. Summary: It is not possible to use the information in the tax variables of the March CPS to form tax families, and any attempt to do so will lead to inconsistent treatments across time. Therefore, it makes sense to form your own tax families, using the relationship information which is consistently provided through the years with the momloc and poploc variables provided by IPUMS. However, this sacrifices any ability to allocate adult dependents to taxpayers. Comparison of Aggregates: We can compare the aggregate income and liability from the tax data to known aggregates, however since the CPS values are top-coded, it can be important to take that into consideration. Using the IRS Statistics of Income Division Public Use Files I can create top-coded samples of actual tax returns to create top-coded versions of the the public use files, and create top-coded aggregates. I do this in two ways. First, I simply drop all values over $99,997, and second I replace all top-coded values with $99,997 and drop all missing values. The results are surprising. In 1998 income and tax in the CPS is about half what it is in comparably top-coded PUF. In all other years, the CPS values for tax are 20% to 35% higher than the PUF values for the top-coded sample, and 5% to 23% higher for the sample where top-coded values are dropped entirely. I don't find these differences large, considering the difficulties in survey data, however the large year to year variation is a concern. Daniel Feenberg August 2010