Chapter 4 Model Validation

So far we have seen how to calibrate or adjust regression models using a few approaches. However, it has already been discussed that measuring the degree of fit or performance of the models with the same sample we fit our model is not appropriate. We must use an independent data sample to conduct that evaluation.

The process of measuring the degree of adjustment of a model is also known under the term validation. Therefore, from now on we will always distinguish between calibration and validation samples and, thus, between model calibration (using the calibration sample) and validation (using the validation sample).

4.1 Sampling our data

As previously mentioned, in order to carry out a validation process, it is mandatory to use an independent data sample. What does this mean? Basically, we can not use the data with which we calibrated the model.

By using the calibration sample to validate a model we are skewing the result, forcing the validation to be better than it should be. This is because the model that would be adapted to the data itself and, therefore, it is not realistic to validate with the same sample. This is particularly important in the case of predictive models, such as the linear regression model of temperature prediction we fitted in Fitting a linear regression model.

The solution is simple, it is enough to split the original data into two sub-samples, using one to adjust/calibrate the model, reserving the other to carry out the validation procedure¹¹. There are different ways to carry out this process. The most common is to randomly separate our data, using different percentages:

75% calibration and 25% validation.
70% calibration and 30% validation.
60% calibration and 40% validation.

But what percentage share is appropriate? The truth is that there is no single answer to this question, although there is some consensus that the larger the sample of data, the greater the proportion of data that we can allocate for validation.

How do we conduct the sampling process in R? No wonder there is an specific function to do this. The most basic process involves using sample() on a data.frame (or any table-like object ) to obtain a vector with random row indexes and later split the data accordingly. The basic syntax of the function looks like this:

sample(vector, size=number_of_elements)

The result is a vector with n = size random data selected from the vector. To apply this to an array or data.frame, we will select a random sample of the column with the row numbers by calling nrow(). Once this is done, we will select the data from our array or data.frame that corresponds to the randomly selected row numbers. Let’s look at an example. Execute the following instructions:

regression<-read.table('Data/regression.txt', header= TRUE, sep='\t',dec = ',')
sample(nrow(regression), size = 50)

##  [1]  98 161 207  54 124  68 173  39  42  45 191   4 222  82  71
## [16]   6 186 132  74  62 130 198 116 149 143  79 150 125   3 200
## [31] 167 174 192 166  48  21 119  91 234 196  47 121  64  41 102
## [46] 100  16  52 201  89

Here we randomly selected 50 rows from the object regression. If you are doing this at home do not expect to obtain the same numbers, this is RANDOM and thus most likely looks different to your output.

To apply transform this into an effective random selection we will store row numbers in an object and the use that object to extract data from regression. On the one hand we will extract 50 records as validation sample and the remaining as calibration.

val.sample<-sample(nrow(regression), size = 50)
regression.val<-regression[val.sample,]
nrow(regression.val)

## [1] 50

Once we retrieved the validation sample we go for the calibration one:

regression.cal<-regression[-val.sample,]
nrow(regression.cal)

## [1] 184

Now we have two samples that we can use to calibrate a model with lm() (regression.cal) and another to use predict() on (regression.val). The only issue here is that we have specified the sample size using a number and I was previously talking about percentage. Well, we can work this out using a couple of functions:

nrow(): returns the number of rows, ie records of a table-like object.
floor(): return the integer part of a number.

Therefore, if we use nrow() to retrieve the number of records, multiply that number by a percentage/ratio and then get the integer part of that operation we will obtain the corresponding size of a given ratio.

numberOfRows<-nrow(regression)
numberOfRows

## [1] 234

ratio<-0.4 #40% of records
sizeValue<-floor(numberOfRows*ratio)
sizeValue

## [1] 93

Then we modify the sample extraction as follows:

numberOfRows<-nrow(regression)
ratio<-0.4 #40% of records
sizeValue<-floor(numberOfRows*ratio)
val.sample<-sample(nrow(regression), size = sizeValue)
regression.val<-regression[val.sample,]
nrow(regression.val)

## [1] 93

regression.cal<-regression[-val.sample,]
nrow(regression.cal)

## [1] 141

EXERCISE 6:

Modify the script for linear regression to include a random sample procedure:

Use a calibration sample with 75% of the data to fit the model.
Use the remaining 25% as validation sample and calculate the RMSE.

Deliverables:

Create a script (.R file ) with the instructions used.
Document the script adding comments (#) describing each step of the procedure.

4.2 Cross-validation

We have seen how to split or randomize data, but if we want to make sure that our result is not influenced by random effects in the sampling we must repeat the process several times and check to what extent the result varies.

This can be done manually or by designing some procedure or script that performs this process recursively. As alternative to that, which in addition requires less effort and knowledge, is to use some predesigned function. Specifically, we will use a function that performs a cross-validation process. Cross-validation (CV) is a technique used to evaluate the results of a statistical analysis and ensure that they are independent of the partition between training (calibration) and test (validation) data. It consists of repeating and calculating the arithmetic mean obtained from the evaluation measures on different partitions. It is used in environments where the main objective is prediction for we most likely need to estimate the accuracy of our model in some way.

CV is a technique widely used in artificial intelligence projects to validate models. In our case, we can apply this procedure to any of the regression types we have seen so far. To do this in R, we will install an additional package called boot. Follow the steps below to carry out the process:

#install.packages('boot')
library(boot)
logit <- read.csv2('Data/logit.csv')
mylogit <- glm(logit_1_0 ~ ., data = logit, family = binomial)
cv.err <- cv.glm(logit,mylogit)
cv.err$delta

## [1] 0.1018896 0.1018894

The exectution might take a while. At this stage I will assume that before you ran the above chunck of code you already took a look to help(cv.glm) which is a new instruction and most important, the one that allows to conduct the CV. Just to summarize it:

We need our data object (logit in the example) as first parameter.
We also need a fitted regression model (mylogit).

If we use cv.glm() using these two parameters alone we are then conduction a leave-one-out (LOO) cross-validation. This is the most basic form of CV which is based on using all records but one to fit a model and then predict the result and calculate the residual on the one-left-out record. That is repeated for each record and at the end we get the average residual.

Another option is to use an optional argument K and pass it an integer number. By doing this we are changing to a k-fold CV, based on determining a number of groups \((2,3,4,..,n<N)\), and proceed as in the LOO case using \(k\) groups instead of single values.

Figure 4.1: Boxplot and normal distribution. Retrieved from *https://stats.stackexchange.com*

https://i.stack.imgur.com/fhMza.png

#install.packages('boot')
library(boot)
mylogit <- glm(logit_1_0 ~ ., data = logit, family = binomial)
cv.err <- cv.glm(logit,mylogit,K=4)
cv.err$delta

## [1] 0.1017795 0.1016166

The later approach is significantly faster. In any case, we will see later how to conduct a more appropriate k-fold CV, improving the error measurement. Using this function we just calculate the average value of the residuals which is informative but not more.

Whatever process we should employ.↩