Uncommon Reaction: Geekfest: Bootstrapped Credible Intervals of the Mean in SAS

Its not called 'Uncommon Reaction(s)' for nothing. This will be weird to many visitors, but that's ok. If you're not into this kind of thing just move along to the next post. If you don't care for the talky parts, I've pasted the full code without interruption at the bottom.

I wrote a little program to calculate bootstrapped nonparametric confidence intervals of the mean in SAS. Yes, I know STATA users, STATA apparently has a command that just does this, but you know what? I don't use STATA and I don't feel like learning. Besides, when I write it in SAS it reinforces my understanding of what I'm doing.

I've been using this with simulation data, but you could use the same program with observational data. Its not different. There is nothing particularly special about this, I just combined a bunch of different pieces of code from different SUGI doc's. Its just when I looked there was nothing

Here goes (and please feel free to post questions in the comments).

/*the first step of is computationally intensive, so I suggest you add this first step to reduce your data set to only the variables that you'd like compute confidence intervals for*/

data new;
set old;
keep var_1 var_2. . . var_n;
run;

/*Ok, so here we go. The first real step of the process is to create the random samples. In case you need a little refresher, the method of bootstrapping reuses your existing data set to create a bunch of new data samples by selecting observations at random from the original data set. The idea is that by evaluating your statistic across these psuedo-samples you can get an idea of the true sampling distribution of that statistic, thus negating the need to use parametric assumptions to estimate your confidence intervals.

This is the step that creates those samples, therefore it is computationally intensive. It takes time to create all those datasets. However, the code for this step is very simple*/

%let rep=1000;

proc surveyselect data=new out=bootsample

seed=1234 method=urs

samprate = 1 outhits rep=&rep;

run;

/*Even when its simple, its nice to know what you're doing.
Rep= sets the number of new psuedo or in SAS-terms 'replicate' datasets you'd like to create. The more replications you do, the better your estimates will be, but the longer it takes to run. If you have a small sample to begin with, you'll probably want to do more replications. Given you have a reasonably decent computer, it doesn't take too long regardless. I've been setting it to 1,000 and then just running it before I go to lunch, or before I leave the office for the day.

Seed= of course sets the random seed. It can be any integer, but it needs to be an integer, and you need to CHANGE it between rounds or you'll keep on generating the same so-called random process.

Method= sets the sampling method. URS stands for unrestricted random sampling with replacement which (to simplify things a great deal) is the proper method to use for bootstrapping.

Samprate= sets the size of the replicate database you'd like to generate. When this is set to 1, then each replicate is going to be the same size as the original. When its set to 0.5, each replicate set will be half as big. Your sample size of your new data set after running this set will be equal to your original n times the samprate value times the number of replications you choose.

For more information on surveyselect, click here.

Now that you have a sample data set its time to summarize it. In this case I am interested in means, so I will set my proc summary step as follows*/

proc summary data=bootsample nway;

by replicate;

var var1 var2 . . . var_n;

output out=out1 (drop=_type_ _freq_) mean= ;

run;

/*The proc surveymeans command automatically outputs a variables called 'replicate' which is equal to 1 for observations in the first replicate sample, 2 in the second replicate sample, etc.

This step simply calculates the mean value of each variable for each replicate sample and outputs them as one row per replicate. A dataset with 1,000 replicates would produce 1,000 rows of mean values. This final step actually produces what you are looking for*/

Proc univariate data=out1 noprint;
var var_1;
output out=ci pctlpts=5 95 pctlpre=ci_;
proc print;
title 'var_1';
run;

/* This is actually some pretty nifty SAS code. Basically the deal with proc univariate is that it creates this massive output. This code suppresses that output and tells SAS to only keep the values for where the percentile point is 5% and 95%. For strict Bayesians who stay awake at night with nightmares that someone is going to mistake them for a Frequentist, you can set these values to anything you want. I've seen Bayesians use 10% and 90% just to make the point that these are not Frequentist estimates, but heck, knock yourself out. Use 4% and 96% for all I care. Also, you're not limited to just two points. You can output as many as you'd like. The pctlpre statement is just a nice feature that will stick a prefix at the beginning of each credible interval to identify it in output.

The proc print feature just prints your values, and your title gives it a name. Unfortunately, I don't right now know how to output the univariate step for multiple variables. I have just been running this last step multiple times for all the variables I need credible intervals for, but I'd love it if someone would post a fix for this below.

When you're done, your full code should look like this.*/

data b;
set a;
keep var_1 var_2;
run;

%let rep=1000;
proc surveyselect data=b out=bootsample
seed =1234 method=urs
samprate = 1 outhits rep=&rep;
run;

proc summary data=bootsample nway;
by replicate;
var var_1 var_2;
output out=out1 (drop=_type_ _freq_) mean= ;
run;

Proc univariate data=out1 noprint;
var var_1;
output out=ci pctlpts=5 95 pctlpre=ci_;
proc print;
title 'Var_1';
run;

Proc univariate data=out1 noprint;
var var_2;
output out=ci pctlpts=5 95 pctlpre=ci_;
proc print;
title 'Var_2';
run;

etc. . .

Have fun and please post a note if this was helpful.

Thursday, February 4, 2010

Geekfest: Bootstrapped Credible Intervals of the Mean in SAS

No comments:

Post a Comment

Blog Archive

Followers