After running a (linear) regression I regularly want to know the sample size of the sample used in the estimation, the “estimation sample”.
Similarly, I like to be able to identify the observations used in the estimation, e.g. to subset my data. – Stata users know this as the “e(sample)”, which allows one to generate an identifier variable to be used in subsequent operations:
sysuse auto reg price mpg in 1/70 gen sampleid = e(sample) label define sampleid 1 "Included in estimation" 0 "Not included in estimation"
R’s lm() function does not produce an object with this information automatically. Instead I have found this useful:
fit <- lm(speed ~ dist, data= cars) # - N used esample.n <- nobs(fit) # - Sample identifier, a set of row names which can be used to subset the corresponding dataframe esample<-rownames(as.matrix(resid(fit))) # E.g. subsetting cars[esample,] #trivial here since all obs are included