Self-Censoring Stata Output
CMS requires that stataistical tables produced from Medicare billing data
not include any results based on cells with fewer than 10 respondents. In
this project we attempt to provide some Stata procedures that directly
enforce this requirement and report only a missing value code for those
cells. Additionally, the equivalent restriction is imposed on dummy variables
in regression output.
These programs are intended to make examination of output for compliance with CMS
standards faster and more reliable and to prevent inadvertant violations of the
standard. They are not intended to prevent deliberate disclosures, and it is
certainly possible to transmit a disclosure through this filter with determined
intention. Given that the user does have access to the original micro-data, there is
no incentive for him to disclose via the more complicated official release channel.
These program depend on the global macro variable $mincellsize, which defaults to
10. Any cell based on fewer than $mincellsize records is replaced with the .s
missing value code.
Each table they create includes the line ".s indicates a statistic based on fewer
than $mincellsize records". If no other tables are included in $release.log, then
that file should be suitable for release. Exceptions would be if similar tables with
cells differing by a single record were produced, or if there were published
aggregates for the sample. These are things that should be obvious to any reviewer.
Some of these programs append censored output to a log file, whose name (less
filetype) is given by global macro $release, and whose default value is "release".
This is in addition to the any log file maintained by the user's program itself.
Appending allows multiple files to be appended to a single release.log file, but it
means starting a new log file is the responsibility of the user.
Programs producing non-ascii output have options and defaults of their own for
directing output. These procedures do not edit program output. They either redo the
computations with additional restrictions (stable and ssummtab) or edit saved output
(sreg). Parsing Stata log files is difficult and unreliable.
The Programs available now
-stable-
-stable- is a modification to the Stata -table- command except
that the "row" and "column" options are not supported. This is
fortuitous since some suppressed cells could be reconstructed by
subtracting the remaining cells from the total. -stable- places
output in the user's log file and in $release.log. For example, the
commands:
sysuse nlsw88
stable grade, contents(mean wage iqr wage max tenure)
yield the output:
-------------------------------------------------
current |
grade |
completed | mean(wage) iqr(wage) max(tenure)
----------+--------------------------------------
0 | .s .s .s
4 | 3.011271 .6441219 .s
5 | .s .s .s
6 | 3.82026 1.513687 .s
7 | 3.797682 1.731077 .s
8 | 5.437 2.986808 .s
9 | 5.655415 2.198068 .s
10 | 4.692721 2.553535 .s
11 | 5.688235 2.801929 .s
12 | 6.638048 3.663443 .s
13 | 8.315217 4.49275 .s
14 | 9.130599 4.806763 .s
15 | 9.885779 4.42029 .s
16 | 9.806044 5.809176 .s
17 | 10.43081 6.070848 .s
18 | 11.60784 4.609798 .s
-------------------------------------------------
.s indicates fewer than 3 records
Note that statistics max, min, median, first and last are always suppressed, while
interquartile range is allowed. I think that conforms to the spirit of the
regulations, though perhaps not the letter.
-ssummtab-
-ssummtab- is a modification of the SSC -summtab- command for
creating publication quality summary statistics in Word or Excel
formats, as would typically accompany a paper with regression
results. If any cells are missing, all cells in that column will be
missing, suggesting the table is not yet suitable for publication.
-ssummtab- includes an option to name the output file but does put
any results to either log as it has no option for ASCII output. For
example, the commands:
keep if race==2
ssummtab, by(union) mean word replace contvars(age married grade south wage hours tenure)
yield the output (after conversion from xlsx to html with st):
C1 | C2 | C3
|
| nonunion | union
|
| (N = 16) | (N = 8)
|
age in current year
Mean (SD) | 39.06 (3.64) | .s (.s)
|
married
Mean (SD) | 0.69 (0.48) | .s (.s)
|
current grade completed
Mean (SD) | 13.44 (4.47) | .s (.s)
|
lives in south
Mean (SD) | 0.19 (0.40) | .s (.s)
|
hourly wage
Mean (SD) | 9.46 (5.90) | .s (.s)
|
usual hours worked
Mean (SD) | 39.88 (10.50) | .s (.s)
|
job tenure (years)
Mean (SD) | 4.28 (3.22) | .s (.s)
|
sreg
-sreg- does not do regressions, but creates reports based on
information stored by Stata from most (all?) regression commands.
Running -sreg- after a regression command will append the results to
$release.log with non-releasable factor variable parameter estimates
suppressed. The suppressed variables would include _cons and any
factor variable with fewer than 10 non-zero or zero values in the
estimation sample. At this time -sreg- does not yet examine
non-factor variables so some care is required before output is
released. An enhancement to detect non-factor variable dummies is
feasible. For example, the commands:
sysuse nlsw88
regress wage tenure i.grade
sreg
yield the output:
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tenure | .1070933 .0346732 3.09 0.002 .0389881 .1751984
|
grade |
4 | .s . . . . .
6 | .569052 4.819114 0.12 0.906 -8.89666 10.03476
7 | 1.07241 4.751872 0.23 0.822 -8.261224 10.00604
8 | .8121087 4.638867 0.18 0.861 -8.299561 9.923779
9 | 1.669255 4.581578 0.36 0.716 -7.329887 10.6684
10 | 1.1163 4.577486 0.24 0.807 -7.874805 10.10741
11 | 2.001984 4.554536 0.44 0.660 -6.944043 10.04801
12 | 2.413749 4.520139 0.53 0.594 -6.464716 11.29221
13 | 4.101939 4.575359 0.90 0.370 -4.884988 13.08887
14 | 5.155073 4.562176 1.13 0.259 -3.80596 14.11611
15 | 6.131421 4.622188 1.33 0.185 -2.947488 15.41033
16 | 8.011438 4.559408 1.76 0.079 -.9441576 16.96703
17 | 6.613028 4.607801 1.44 0.152 -2.437623 15.66368
18 | 8.428397 4.599476 1.83 0.067 -.6059005 17.46269
_cons | .s . . . .
------------------------------------------------------------------------------
.s indicates fewer than 3 records
Users of -outreg- or -estout- can use -sreg- to process the VCV matrix
before those procedures see it. For example:
global release results1
reg y x dummy
sreg
outreg2 using results2
estimates store m
estout m using results3,style(html)
Ritchie points out that there merely removing the coeficient for the
constant is actually sufficient to prevent disclosure and would be
considerably cheaper, as detecting dummy variables is potentially
expensive.
To Do
-table- and -summtab- are .ado files, with source supplied by Statacorp,
making them feasible to modify. Only a few lines were changed in each
program and it is likely that many other Stata programs can be treated
similarly. Please contact me with suggestions. -sreg- works because Stata
has standard return values for extimation commands, and those matricies
can be modified for printing. Many programs, are Stata builtins that can
not be modified, and do not return any results in machine-readable format.
Those I can't do much with.
Examples
Sources and Patches
References
Felix Ritchie1 and Mark Elliot, Principles- Versus RulesBased Output Statistical
Disclosure Control In Remote Access Environments
https://iassistdata.org/sites/default/files/iqvol_39_2_ritchie.pdf
Felix Ritchie , Output-based disclosure control for regressions
http://www2.uwe.ac.uk/faculties/BBS/BUS/Research/economics2012/1209.pdf
Felix Ritchie , Analyzing the disclosure risk of regression coefficients
TRANSACTIONS ON DATA PRIVACY 12 (2019) 145???173
Bleninger P., Drechsler J., and Ronning G. (2011) Remote data access and the risk
of
disclosure from
linear regression, Stat. and Op. Res. Trans. Special Issue: Privacy in
statistical
databases, pp 7-24
http://www.idescat.cat/sort/sortspecial2011/DataPrivacy.1.bleninger-etal.pdf
last modified 10 May 2020 by drf