Linear regression, as a predictive modeling tool, is undeniably very useful when it comes to making predictions based on historical data. The modeling tool does have its limitations, however. One notable shortcoming is that a categorical variable violates the assumptions of linear regression. As a workaround, a person could assign numerical values to the categories– such as 1 for “yes” and 0 for “no”. So using categorical variables with linear regression is certainly possible, but not ideal. More suitable predictive methods are available and this tutorial will introduce logistic regression as a more appropriate method for categorical variables– specifically, binary variables. My logistic regression tutorial is a very condensed rendition of the tutorial offered in Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R by Daniel S. Putler and Robert E. Krider. We will use the BCA package in R, which was created by the authors to be used alongside the text. I strongly recommend this textbook to anyone seeking to learn the theories behind data mining while also getting hands on experience with the R language. The tutorials at the end of every chapter provide a great means of applying the concepts. This particular tutorial is from Chapter 5’s introduction to logistic regression as a predictive method.
The dataset used to demonstrate logistic regression in R comes from a project conducted for the Canadian Charitable Society, or CCS. It was CCS’s goal to create a model that would provide insight into which of their donors should be contacted to join a monthly donation program. CCS had data that suggested that the donors who joined the monthly donation program tended to donate more, and with a very low attrition rate to boot. Unfortunately, only about one percent of current donors chose to join the monthly donation program when it was previously offered as a payment option. CCS concluded that the donors needed to be better targeted for recruitment into program and decided upon a mailing campaign to accomplish this goal. However, to mail the promotion to each and every donor would not be cost-effective in the slightest. Furthermore, the success rate would be most likely be very low. With this in mind, CCS hoped to target only the donors that were more likely to donate.
A database containing donor data was put together which included donation history up to receiving the promotion and a variable which indicates whether the donor joined the monthly donation program after receiving the promotion. In this tutorial, we will use this database to construct a model to predict which of the charitable society’s other donors should be targeted.
Before diving into the tutorial, there are a few concepts of database marketing that deserve a brief mention– oversampling and model validation. As mentioned earlier, the portion of donors that actually joined the monthly donation program was extremely low. In turn, it is tough to determine which of the variables have the most impact on influencing a donor’s successful recruitment to the program. To adjust for this, a new database was created in which an equal number of donors that joined and did not join the monthly program. This is referred to as oversampling the target population and in our case, it provides significantly better predictive performance due to a more balanced set of data. Through oversampling, our model is more capable of detecting key differences between the two types of donors.
A great way to prevent the “overfitting” of a model to a particular set of data, is to validate the model using two (or more) samples of the data. Each sample contains the same set of variables and the oversampling rate remains constant–however, each sample will also contain a completely different set of individuals. The first sample, which is called the estimation sample, is used to construct and calibrate our potential models. Our second sample, the validation sample, is withheld from this process and is instead used to aid in the selection of the best model. The validation sample’s predictor variables are used to calculate predicted target values. Keep in mind that we also have observed target values (that were not used to create the model) that we can benchmark our model’s prediction capabilities against. The model (estimated using the estimation sample) that we select will be the model that best predicts the target value in our validation sample. Often, even more than two samples will be created. A third sample is frequently referred to as a holdout sample. For this tutorial though, we will only use estimation and validation samples.
Enough said– let’s jump into the tutorial.In the next tutorial that I followed from this textbook, I’ll assess the performance of these models using lift charts in R.