Estimating Sampling Variation

Danny Kaplan and Randall Pruim, Fri Jan 4 10:33:33 2013

Sampling variation refers to the uncertainty created by the sampling process. If your data is just a sample, another data set will produce a somewhat different result. The question we seek to answer is, “How much is somewhat?”

The method used here is called bootstrapping. Bootstrapping relies on a simulation of the sampling process called resampling: sampling from the sample.

Obtaining a model from data

To illustrate, consider estimating the power-law exponent on the relationship between species number and land area.

Species = fetchData("http://www.math.smith.edu/~bbaumer/mth247/SpeciesArea.csv")
plotPoints(Species ~ Area, data=Species)

plot of chunk unnamed-chunk-2

The model we seek to fit is \[ \mathrm{Species} = a \mathrm{Area}^b \;. \] Judging from the shape of the graph, Species goes roughly as the square-root of area, so use \( b=0.5 \) as the initial guess for the nonlinear parameter. Finding the parameters is a straightforward application of fitModel:

SpeciesModel = fitModel( Species ~ a*Area^b, 
          data=Species, 
          start=list(b=.5) )
SpeciesModel
## function (Area, b = 0.256503447940427, a = 4.10981633631802) 
## a * Area^b
## <environment: 0x103b576e0>
## attr(,"coefficients")
##      b      a 
## 0.2565 4.1098
coef(SpeciesModel)  # if we just want the coeffiecients
##      b      a 
## 0.2565 4.1098

It looks like a fourth-root model is appropriate. But how precise is the estimate of the coefficient \( b \)?

Estimating parameter precision

We begin by repeating what is above, but wrapping it in curly braces. This tells R to treat the entire code block as a single expression. We'll make use of this in the next step.

{ SpeciesModel = fitModel( Species ~ a*Area^b, 
                  data=Species, 
                  start=list(b=.5) )
coef(SpeciesModel)
}
##      b      a 
## 0.2565 4.1098

Apply resampling and iterate

Creating a resampling distribution is straightforward once we have gotten this far. Edit the code above as follows:

ResamplingDist = do(500)*
  { SpeciesModel = fitModel( Species ~ a*Area^b, 
                  data=resample(Species), 
                  start=list(b=.5) )
              coef(SpeciesModel) 
  }

Now we have a set of 500 resampling realizations. We begin by looking at the variability in our estimated power graphically.

densityplot( ~ b, data=ResamplingDist )  # we could also have done a histogram

plot of chunk unnamed-chunk-6

plotPoints( b ~ a, data=ResamplingDist)  

plot of chunk unnamed-chunk-6

Notice that the coefficient estimates are not independent of each other.

One common way to quantify sampling variability is with the standard error. The standard error is simply the standard deviation of the sampling distribution, which we can estimate using our resampling distribution.

sd( ~ b, data=ResamplingDist )
## [1] 0.02109

Another standard way to present the precision is with a confidence interval.
For instance, to find the 95% confidence interval, use confint.

confint(ResamplingDist)
##   name  lower  upper
## 1    b 0.2114 0.2943
## 2    a 2.4933 6.0975

A 95% confidence (when based on normal distribution theory) stretches roughly 2 standard errors in each direction from the estimate.

se <- sd( ~b, data=ResamplingDist )
coef(SpeciesModel)['b'] + c(-1,1) * 2 * se
## [1] 0.2327 0.3171

Alternatively, we could take the middle 95% of the samping distribution as our interval.

qdata(c(0.025,0.975), b, data=ResamplingDist)
##   2.5%  97.5% 
## 0.1886 0.2838

Using the model to make predictions

A point estimate

Suppose that we were interested in predicting the number of species in a land with area \( 2 \times 10^6 \) km2. This would be an extrapolation, and extrapolations are generally unreliable, but here goes …

The procedure is essentially the same as before, but now we compute an estimated response from our model rather that estimated coefficients.

First we do this with our original data to get a point estimate, a single number that is our best guess but does not reveal anything about its precision.

{ SpeciesModel = fitModel( Species ~ a*Area^b, 
                  data=Species, 
                  start=list(b=.5) )
SpeciesModel( Area=2e6 ) 
}
## [1] 169.8

Interval estimates via resampling

Now we would like to get an interval estimate via resampling.

ResamplingDist2 = do(500)*
  { SpeciesModel = fitModel( Species ~ a*Area^b, 
                  data=resample(Species), 
                  start=list(b=.5) )
              SpeciesModel(2e6) 
  }

Now we have a set of 500 resampling realizations.

densityplot( ~ result, data=ResamplingDist2)

plot of chunk unnamed-chunk-13

To find the 95% confidence interval, use confint.

confint(ResamplingDist2)
##     name lower upper
## 1 result 135.9 198.3

Your turn

Fit a model to data of your own choosing. Give point and interval estimates for the parameters of your model and for an estimated response produced by your model. You might like to return to Eyeballing Parameters for some exmple data sets and models.