Taxsim supplement
to the
Survey of Consumer Finance
Provided in Stata, SAS and CSV formats
The SCF is a survey of income and wealth done by the US Federal Reserve
Board every three years. It is particularly appropriate for tax analysis
because it over-samples high income households which are the source of
the majority of tax revenue. Here we offer small files with just the
taxsim relevant variables, and large files with those variables joined
to the full SCF public use files. Although the SCF began in 1963, here
we have only 1989+.
Downloads
There is taxsim data for each SCF from 1989-2019. These files
contain the necessary variables to submit to the taxsim32
calculator to calculate federal income tax liability.
The files "byhousehold" are one record per SCF respondent. Since a since
respondent may encompass several taxpaying units, the taxsim data is
aggregated to the household level. Variables that don't make sense to be
summed are omitted. These files have all the SCF variables and even if
you don't want taxsim, it may be convenient to use these files for Stata
or other packages that can't read sas7bdat format.
The files "bytaxunit" include a record for each combination of household,
replication and taxunit. This means that you need to be careful comparing
taxsim variables with household variables. For example, if you add an SCF
income item over all records, that double counts the income of households
with more than one tax unit. Nor would you want to find an average tax
rate by dividing tax liability by household income.
Note that the SCF does not provide child care expense, so dep13 isn't
provided. Nor are the business tax deduction variables available in
2019. As a result -taxsim27.ado- and -taxsim32.ado- should generate the
same tax liability.
Recall that naive calculations of the standard error will be incorrect
because each actual interview is divided into 5 records to allow the
calculation of standard errors for imputed variables. Here is a link to
a document on the SCF website about
calculation standard errors.
If you are using Stata you may wish to install -micombine- to simplify
the calculation:
ssc install micombine
An example showing the calculation of total federal income tax
liabililty by filing status is simply:
log using example1,text replace
use "https://www.nber.org/taxsim/to-taxsim/scf27-32/taxsim/txpydata19"
generate weight=x42001/5
taxsim32,replace full
table mstat [pw=weight],c(sum fiitax)
Taxsim only uses 32 variables, while the ./bytaxunit and ./byhousehold
file include all the thousands of variables in the SCF with the
addition of computed variables suitable for taxsim. None of the SCF
variables can go directly in to taxsim. the ./taxsim directory provides
just the taxsim variables in the ./taxsim/txpydataNN.dta files.
Here the dividing the original weight by 5 compensates for the
implicates in the calculation of the sum, and since fiitax is a
tax-unit variable there is no problem with multiple taxunits in a single
household.
But perhaps you will need something from the SCF. Here we
tab by "Expect inheritance". Note maxvar.
log using example,text replace
set maxvar 8000
use "https://www.nber.org/taxsim/to-taxsim/scf27-32/bytaxunit/dta/scftax19"
generate weight=x42001/5
taxsim32,replace full
table x5819 [pw=weight],c(sum fiitax)
Because of the multiple implicates,the -table- command isn't
sufficient to get correct standard error of estimate. The -micombine-
command is an improvement on multiplying the estimated standard error
by the square root of 5. Here we find the standard error of the mean
of income tax liability. We need to :
micombine regress fiitax [aw=x42001],obsid(y1) impid(rep)
by taking advantage of the fact that with one independent variable, that
coefficient (_con) is just the mean.. A multivariate regression just
adds independent variables. Here we regress liability on AGI.
micombine regress fiitax v10 [aw=x42001],obsid(y1) impid(imp)
Packages -scfses- and -scfcombo- take account of the complex sample
design, which -micombine- does not. This requires merging the weighting
matrix to the survey. Here we do the same regression ,with standard
errors corrected for survey design:
set maxvar 10000
use "http://www.nber.org/taxsim/to-taxsim/scf27-32/wts/dta/p19_rw1"
merge 1:m y1 using "http://www.nber.org/taxsim/to-taxsim/scf27-32/orig/dta/p19i6"
generate weight = x42001/5
generate rep = y1 - 10*int(y1/10)
scfses fiitax [pw=weight],p(mean)
scfcombo fiitax v10 [aw=weight],command(reg)
So far the examples are based on tax filing units. For households,
we need to sum the taxunit variables the level of the survey record.
collapse (mean) x5729 weight rep (sum) fiitax,by(y1)
scfcombo fiitax x5729 [aw=weight], command(reg)
With each level of y1, x5279 will be constant, so the mean will do
for that variable. The sum of fiitax over y1 will give total tax
liability in that household. Clustered standard errors are left as
an exercise for the reader.
Notes for merging ../txpydataNN.dta with ../orig:
- x42001 sums to the number of respondents in the universe.
- y1 indexes household*implicate and identifies each.
record in ./wts
- yy1 indexes household
- rep indexes implicate number - 1-5
- taxsimid is 1000*y1+100*imp+taxunit and identifies each record in
./txpydata and ./bytaxunit.
The SAS program frbscftax.sas is by Kevin Moore of the Federal Reserve
Board, and I thank him for making it available. Comments by him on the
taxsim integration here: ...
1) In my SAS program, if you turn the HTAXFILE=YES then the program looks
for the file from TAXSIM, reads it in and then creates three different
datasets (see line 3839 in the SAS program). Basically the three
datasets are 1) tax units, 2) combine all the tax units in the primary
economic unit (PEU) back into a household, 3) same as 2), but also adds
any tax units from the non-primary economic unit (NPEU) back into the
household. The PEU is the main household in the survey, the NPEU consists
of individuals who are not financially dependent on the PEU. An example
is a financially independent sibling (over 18) that lives with his brother
and the brother's spouse or partner. I know it may seem a bit confusing,
but I was trying to give users as much flexibility as possible.
2) So your Stata code in the readme file will produce dataset 3), which
includes PEU and NPEU tax units. If you didn't want to keep the NPEU
tax units, you could only keep observations from the TAXSIM file where the
last digit of the taxsimid is less than 3, as that would exclude all
the NPEU tax units. Tax units from the PEU have a last taxsimid digit of
0, 1, or 2, where 0=tax unit and household are the same, and 1 or 2 means
the original household has been split into 2 tax units.
...
Sources:
https://www.federalreserve.gov/econres/scfindex.htm
Much more documentation is available at that URL.
URL of this directory:
http://www.nber.org/taxsim/to-taxsim/scf27-32
Daniel Feenberg
feenberg@nber.org
May 28, 2021