Guidelines for non-NBER users of -taxpuf.ado

From time to time I have run regressions or tables on the SOI Individual Income Tax Public Use Files for other researhers.

These guidelines are for researchers sending me Stata .do files to run against the SOI PUF files kept here at NBER. I'd like to keep my role quick and mechanical so that turnaround for users is minimal.

I will send each such user a .zip file with test data. Once you are satisfied with the tests, send me the .do file and I will run it against the full sample, including more recent files and return the results to you. Turnaround should be a day or less, but no promises.


The sample file of test data is fully random. Range and mean for each variable will be similar to real data, but there will be zero correlation among the variables or across years. Tax calculations will not provide sensible aggregates. The only function of the test data is to allow you confirm you are not referring to missing data or confusing codes and values. Note that different years have slightly different variable sets. Information about the NBER collection of SOI PUF data is at:

The original SOI files can not be used as input for taxsim, which requires more uniform files through time.

SOI files did not name the variables until recently, and data elements were inconsistent through time. The NBER has created .dta files with a highly consistent naming (actually numbering) convention through time for a subset of the original variables. Wages are always "data11". Full long term gains are calculated by dividing the SOI supplied amount by 1., .5 or .4 and stored as "data70". Similar calculations are done for various items subject to a floor or ceiling.

Tax Calculator

-taxpuf.ado- is a tax calculator that can use the PUF data. Note that the more well-known Internet Taxsim operates with a subset of data, while -taxpuf.ado- uses all the data available in the PUF, using the variable names established at NBER back in 1976.

From within Stata Install taxpuf with:

net from "" net install taxpuf,all


Your .do file should be named "", where N is an integer that increments with each submission and userid is an id that will help me keep track of who is submitting what. For example "" would be Joseph's first attempt.

Your program should start with a log command:

log using joseph1,text replace This will help me keep track of what is happening. I will be keeping the .do files, but please send a complete file each time, don't ask me to edit your earlier file. The "text replace" options are important to me.

Programs may read only from /home/data/soi/taxsim/dta. Use a macro to specify the filename so that changing to the NBER system requires minimal edits:

local taxfile D:\Data\tax\randomtax keep if data103==2008 if c(username)=="feenberg" then local taxfile /home/data/soi/taxsim/dta/s2008 ... use taxfile With this specification The directory for SOI data will be specified correctly, and without my editing the program. Also, it is simple for me to change from the subsets to the full dataset by changing "s" to "x" in one place if the program appears to be working.

You can read multiple years in a loop:

clear forvalues i=1965/1991 { append using /home/data/soi/taxsim/dta/s`i' } data103 always gives the file year. The file of random data will include records for all years.

Programs may write only to the current directory or /tmp (Stata temp files are ok).

If you want .dta files returned, please place a -summarize- command after the -save- command so that I can look in the log to see what is happening. Name the files so that I can zip them up easily:

zip a joseph* Before sending me a program look it over carefully for signs of code that is specific to your system, such as directory names, system commands, etc. Do not use uppercase letters or spaces in filenames! That will make me mad.

This is a new service, so expect changes in the guidelines as I gain experience.

Daniel Feenberg