Estimation sample information from linear regression in R using lm() aka Stata’s: e(sample)


After running a (linear) regression I regularly want to know the sample size of the sample used in the estimation, the “estimation sample”.

Similarly, I like to be able to identify the observations used in the estimation, e.g. to subset my data. – Stata users know this as the “e(sample)”, which allows one to generate an identifier variable to be used in subsequent operations:

sysuse auto
reg price mpg in 1/70
gen sampleid = e(sample)
label define sampleid 1 "Included in estimation" 0 "Not included in estimation"

R’s lm() function does not produce an object with this information automatically. Instead I have found this useful:

fit <- lm(speed ~ dist, data= cars)
# - N used
esample.n <- nobs(fit)
# - Sample identifier, a set of row names which can be used to subset the corresponding dataframe
# E.g. subsetting
cars[esample,] #trivial here since all obs are included