So you have your own idea for a heuristic. Just implement a few functions– a fitting function and a predicting function– and you can evaluate performance with heuristica and compare it with other models.
First, write a function to fit data. All of the heuristica models have three required arguments, and they are recommended:
The function is required to output a structure with these elements:
You can also output:
In this vignette, we will build a bare bones model first (and a more realistic model later). This model has no extra parameters in its fit:
myRandModel <- function(train_data, criterion_col, cols_to_fit) {
# We will fill in a more interesting version below.
structure(list(criterion_col=criterion_col, cols_to_fit=cols_to_fit),
class="myRandModel")
}
The class name– myRandModel– will tell heuristica how to find the predicting functions. But if you are curious, you can read more about S3 classes.
Most statistical models in R implement “predict,” which takes N rows of data and outputs N predictions. Heuristica, however, is focused on predicting pairs of rows, so its models implement a “predict pair” function. Two rows are passed in, row1 or row2, and the model must predict which has a higher criterion. This is a categorical question: row1 or row2? Some models answer this by first estimating the criterion for the two rows, which is quantitative, but models like Take The Best do not.
PredictPair refers to predicting which of the pair of rows, row1 or row2, is greater. In this case the outputs have the following meaning:
To make predictions with heuristica, implement a function called predictPairInternal.yourClassName. The “internal” in the name is supposed to indicate that you never call this function yourself– you call predictPair instead. The inputs for this function should be:
The output should be:
If you output anything else, the aggregating functions will not correctly calculate accuracy.
For our bare-bones example, we will randomly guess -1 or 1. 1 means we predict the criterion in row1 will be greater. -1 means we predict the criterion in row2 will be greater– that is, the rows should be reversed.
Now to use this function, we call predictPair and pass in the fitted
model. We could call predictPairInternal directly, but predictPair takes
care of some housekeeping. (It calls
predictPairInternal(object, row1[cols_to_fit], row2[cols_to_fit])
,
makes sure the output has the dimension of one row, names its column
with the model class or fit_name
.) Let’s get some data and
run our model on it.
Consider a subset of the high school dropout data included with this package, focusing on just 5 schools. The first column has the school name. The drop-out rates are in column 2, and we will fit them using columns 3-5, namely Enrollment, Attendance Rate, and Low Income Students.
## Name Dropout_Rate Enrollment Attendance_Rate Low_Income_Students
## 1 Austin 39.4 1403 72.3 63.7
## 2 Farragut 34.0 1839 76.4 95.9
## 3 Fenger 28.7 1098 79.5 63.2
## 4 Crane 28.6 969 65.5 74.8
## 5 Orr 26.3 1388 68.8 74.6
To analyze myRandModel on this data requires these steps: 1. “Fit” myModel to the schools data. 2. Ask the model to predict whether Austin has a higher dropout rate than Farrgut (the first two schools in our data).
## Name Dropout_Rate Enrollment Attendance_Rate Low_Income_Students
## 1 Austin 39.4 1403 72.3 63.7
## Name Dropout_Rate Enrollment Attendance_Rate Low_Income_Students
## 2 Farragut 34 1839 76.4 95.9
## [1] 1
We can see in the original data.frame that Austin had the higher dropout rate, meaning 1 would have been correct output. We can get heuristica to show this using a more general rowPair function to view the correct choice based on the criterion– which we tell it is in row 2 using correctGreater. It outputs the correct answer of the rowPair, which could be -1 or 1, and we see it turned out to be 1.
myFit <- myRandModel(schools, 2, c(3:5))
myData <- rbind(oneRow(schools, 1), oneRow(schools, 2))
rowPairApply(myData, correctGreater(2), heuristics(myFit))
## CorrectGreater myRandModel
## [1,] 1 -1
What if we want to see results for all pairs of the first 5 schools? Use rowPairApply.
## CorrectGreater myRandModel
## [1,] 1 -1
## [2,] 1 1
## [3,] 1 -1
## [4,] 1 -1
## [5,] 1 -1
## [6,] 1 1
## [7,] 1 1
## [8,] 1 -1
## [9,] 1 -1
## [10,] 1 -1
This outputs 10 predictions because there are 5*4/2 = 10 row pairs for 5 schools. Notice that because the data is sorted, CorrectGreater is always 1. (If we had a different sorting, we would see some -1’s.)
Heuristica can assess the performance of this categorizing of ranking. First, there is a confusion matrix, as in many machine learning problems, except this one assumes you always want to see the output for -1, 0, 1 always. We need to give it the correct answers, which are based on column 2, so we use correctGreater(2). And we need to give it the predictions, which we generate with heuristics(myFit). We pass these generators into rowPairApply. Then we send the output to our confusion matrix function.
set.seed(1)
predictions <- data.frame(rowPairApply(schools, correctGreater(2), heuristics(myFit)))
confusionMatrixFor_Neg1_0_1(predictions$CorrectGreater, predictions$myRandModel)
## predictions
## correct -1 0 1
## -1 0 0 0
## 0 0 0 0
## 1 4 0 6
Assuming you used the same random seed as this module, the matrix shows that out of the 10 cases, 6 were predicted correctly (the prediction was 1 and the correct answer was 1) and 4 were not (the prediction was -1 but the correct answer was 1).
What percent correct is this? In this case, it’s an easy division of 6/10 = 0.6. But you can get this result in one step– for many fitted models– with percentCorrect. Below is the call (with the random seed set again).
## myRandModel
## 1 60
That was a toy example. What if you want to wrap a real model, like lasso regression? That is implemented in another package, glmnet, so if necessary, install it.
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:reshape':
##
## expand
## Loaded glmnet 4.1-8
The fitting function is easy. Wrap the lasso regression and then add just a little bit of extra information, including the subclass name, criterion column, and columns to fit. Be sure to keep the class of the original output!
lassoModel <- function(train_data, criterion_col, cols_to_fit) {
# glmnet can only handle matrices, not data.frames.
cvfit <- suppressWarnings(cv.glmnet(y=as.matrix(train_data[,criterion_col]),
x=as.matrix(train_data[,cols_to_fit])))
# Make lassoModel a subclass. Be sure to keep the original class, glmnet.
class(cvfit) <- c("lassoModel", class(cvfit))
# Functions in this package require criterion_col and cols_to_fit.
cvfit$criterion_col <- criterion_col
cvfit$cols_to_fit <- cols_to_fit
return(cvfit)
}
A fit should now include the extra information we add.
my_data <- cbind(y=c(4, 3, 2, 1), x1=c(1.2, 1.1, 1.0, 1.0), x2=c(1, 0, 1, 1))
lasso <- lassoModel(my_data, 1, c(2,3))
lasso$criterion_col
## [1] 1
## [1] 2 3
## [1] "lassoModel" "cv.glmnet"
And if you correctly kept the original glmnet class, you can still use all the functions it offers.
## 3 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -6.264828
## x1 8.153328
## x2 .
## lambda.1se
## [1,] 3.519166
## [2,] 2.703833
## [3,] 1.888500
## [4,] 1.888500
The task is selecting between two rows, so lasso should predict each row and choose the one with the higher criterion. Below is an example implementation of predictPairInternal that will work, although it is not very efficient.
First, prove we can predict one row pair.
## [1] 1
Now predict all row pairs in our data set. It got 91% correct– pretty good!
## lassoModel
## 1 91.66667
Which comparison did it miss? Check out the helper function below. We predict all row pairs and then find the row pairs where Lasso did not agree with the Correct answer. Lasso missed only one comparison, namely row 3 vs. row 4, where it guessed. It predicted the other 5 row pairs correctly!
out <- data.frame(rowPairApply(my_data, rowIndexes(), heuristics(lasso), correctGreater(lasso$criterion_col)))
out[out$lassoModel != out$CorrectGreater,]
## Row1 Row2 lassoModel CorrectGreater
## 6 3 4 0 1
If you run lasso’s predictions for even a moderately large data set, it will take a while. For example, for the schools data set, it took 20 seconds on a Macbook air.
The solution is to make the prediction step purely a matrix calculation. (This might require saving and caching extra information in the fitting step so it does not have to be recalculated on every prediction.)