SOI State Aggregates by Income Class

Each year the Statistics of income division of the Internal Revenue Service provides aggregate values for a number of tax statistics by state and income class. These are presented as publication quality tables in published volumes and on the Taxstats website however this format is not friendly for analysis over time. After saving the Excel files in CSV format,the problems revealed are:
  1. In early years, division of data into two arbitrary panels.
  2. Numeric fields are often contaminated by footnotes.
  3. Placing suppression indicator in spreadsheet cell format.
  4. Inconsistent or unclear indicators for disclosure suppression.
  5. Aggregating into non-adjacent cells.
  6. Inconsistent or unclear placement of value from suppressed cells.
  7. Confusion of zero with disclosure suppression in some years.
  8. Variables have descriptions but no names until 2012.
  9. Variable descriptions are unstable.

Here we provide the data for 51 states (plus "other") for 1997 through 2015 in a single file, using a panel format. That is, each record in the file refers to a single year, state and income class and provides values for 60-109 variables. Many variables repeat every year, but many are transient or only in recent years. The list is always the same for all states within a year. Income class is specified by separate variables for the lower and upper limit of AGI. This panel format is natural for regressions where each state by year by income class can be treated as an observation. Addressing the complication that the income classes change in both nominal and real values is left as an exercise for the reader. It is possible that for cross year comparisons the user will interpolate "virtual" real thresholds by assuming a density function.

In the published tables for 1997-2003 some cells are suppressed, and the contents of those cells is present only in the summary total. We replace the suppressed value with a missing value indicator further unbalancing the panel, but in a natural way. Note that suppression is is an indicator for "very small" number of taxpayers, but not necessarily a small dollar amount.

After 2003 suppressed cells are combined with "adjacent" cells, however the exact meaning of "adjacent" is a bit unclear. I have done my best to determine what the actual combined categories are and have added an additional record for each such situation. My correspondence with SOI to clarify the grouping rules are here, here and here.

In our file all the unsuppressed variables will share a common record for each income class, just as with the earlier data. For any suppressed data there will be an additional record with the correct income range. For example in Wyoming 2014 the million dollar+ field was combined with 500K-1,000K in the published table for variable A03220. So we have no data for either of those ranges alone but we do have the combined amount. Here we add an additional record with the known lower and upper limits and the known value of A03220:

use aggregates keep if state=="WY"&year==2014&lowerlim>=500000 .list year state lowerlim upperlim a03220 year state lowerlim upperlim a03220 2014 WY 500000 1000000 . 2014 WY 500000 1e20 30 2014 WY 1000000 1e20 .

Here we can see that A03220 is suppressed for the highest income class so it is missing in that record. Instead of adding the suppressed amount to the 500000-1000000 income class as is done in the published data we add an additional record with a larger income range just for A03220 and leaving the other variables missing. If no variables are suppressed, there are no extra records. If two variables have the same pattern of suppression, they are represented in a single additional record. Note that the dollar values are never changed or interpolated, all that is happening is that the tab variable is shown correctly.

This presentation has the advantage that each record displays a correct amount, without modifying footnotes and can therefore be used with confidence for tabbing other data or in a regression.

Some cells are shown only as "less than .5", in which case we have recorded .25 as a numeric value. There are no other instances of fractional values. In some years true zeroes are indicated by a footnote. We have recoded those as "0".

The suppression rules lead to complex patterns of footnotes and values. Take for example variable MVITA for Wyoming 2015. There are 40 returns reported for the sum of the 2 lowest income categories, then 20 returns for the next 2. Then a "20" in the box labeled "25,000 less than $50,000" covers all income categories above $25,000 EXCEPT "$200,000 less than $500,000" which is reported as a true 0. In our conversion this last bit is not done as specified. Our output format would not support income ranges with holes, so $200,000+ is reported as zero, and $25,000-$200,000 as "20". In the MVITA case this is probably closer to the truth (do we really believe there are returns prepared by military volunteers for taxpayers with AGI greater than a million dollars?) but that may not always be the case. Users of this data are urged to be on the lookout for errors that may have been introduced by my conversion program misinterpreting the suppression fields. If you do see anything suspicious in these files, please compare to the tables online at SOI before contacting me.

Variable names are translated to lowercase for .dta and .csv.

We are aware that the published tables are suitable for visual inspection. These tables are for computer consumption.


Daniel Feenberg

Last revised 24 July 2018 by drf