The online gambling industry would like to obtain the maximum level of precision when forecasting the outcome of a race, in order to set the proper odds for the bet. In this model, we chose to focus on car racing and, in particular, on Formula One. The gambling industry competes to provide the best odds to its customers. The industry must also accurately and consistently predict the outcomes of events to ensure a profit for the house. Any improvement in the accuracy of predictions could result in enhanced market share, through more generous odds to consumers, and potentially increased profits depending on the costs associated with each bet. With this knowledge, we set to evaluate the data and develop a model for predicting Formula One race outcomes with the intent of improving the accuracy of odds setting in the industry.
By analysing a database containing information about many races analysed on different dimensions (independent By analyzing a database containing information about many races analyzed on different dimensions (independent variables), we set up a model able to forecast the odds of a particular racer winning. To perform this analysis, we used circuit, car, pilot, geography and others to establish a baseline for analysis. From that point, we will identify which variables are least predictive and simplify the mode. In this analysis, the dependent variable is the predicted position obtain by a car in each race. A successful model will predict the outcome of races with a higher degree of accuracy than existing market models, allowing the owner to offer better rates than the competition.
We load up the information from several databases: - Circuits: containing information about the race’s circuit (name, location) - Constructor: containing information about the car’s constructor (Name, nationality, wikipedia reference) - Drivers: containing information about the drivers (Name, DOB, nationality) - Races: containing information about the race (date, time, circuit) - Results: containing information about the results (race, constructor, driver, position)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : embedded nul(s) found in input
## circuitId circuitRef name location
## 1 1 albert_park Albert Park Grand Prix Circuit Melbourne
## 2 2 sepang Sepang International Circuit Kuala Lumpur
## 3 3 bahrain Bahrain International Circuit Sakhir
## country lat lng alt
## 1 Australia -37.84970 144.9680 10
## 2 Malaysia 2.76083 101.7380 NA
## 3 Bahrain 26.03250 50.5106 NA
## url
## 1 http://en.wikipedia.org/wiki/Melbourne_Grand_Prix_Circuit
## 2 http://en.wikipedia.org/wiki/Sepang_International_Circuit
## 3 http://en.wikipedia.org/wiki/Bahrain_International_Circuit
## constructorId constructorRef name nationality
## 1 1 mclaren McLaren British
## 2 2 bmw_sauber BMW Sauber German
## 3 3 williams Williams British
## url X
## 1 http://en.wikipedia.org/wiki/McLaren NA
## 2 http://en.wikipedia.org/wiki/BMW_Sauber NA
## 3 http://en.wikipedia.org/wiki/Williams_Grand_Prix_Engineering NA
## driverId driverRef number code forename surname dob nationality
## 1 1 hamilton 44 HAM Lewis Hamilton 07/01/1985 British
## 2 2 heidfeld NA HEI Nick Heidfeld 10/05/1977 German
## 3 3 rosberg 6 ROS Nico Rosberg 27/06/1985 German
## url
## 1 http://en.wikipedia.org/wiki/Lewis_Hamilton
## 2 http://en.wikipedia.org/wiki/Nick_Heidfeld
## 3 http://en.wikipedia.org/wiki/Nico_Rosberg
## raceId year round circuitId name date time
## 1 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00
## 2 2 2009 2 2 Malaysian Grand Prix 2009-04-05 09:00:00
## 3 3 2009 3 17 Chinese Grand Prix 2009-04-19 07:00:00
## url
## 1 http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix
## 2 http://en.wikipedia.org/wiki/2009_Malaysian_Grand_Prix
## 3 http://en.wikipedia.org/wiki/2009_Chinese_Grand_Prix
## resultId raceId driverId constructorId number grid position positionText
## 1 1 18 1 1 22 1 1 1
## 2 2 18 2 2 3 5 2 2
## 3 3 18 3 3 7 7 3 3
## positionOrder points laps time milliseconds fastestLap rank
## 1 1 10 58 34:50.6 5690616 39 2
## 2 2 8 58 5.478 5696094 41 3
## 3 3 6 58 8.163 5698779 41 5
## fastestLapTime fastestLapSpeed statusId
## 1 01:27.5 218.3 1
## 2 01:27.7 217.586 1
## 3 01:28.1 216.719 1
The dataset are merged, the unnecessary columns are deleted, and some checks are performed to analyse the data type of each column.
## Warning in merge.data.frame(Temp03, Circuits, by = "circuitId"): column
## names 'url.x', 'url.y' are duplicated in the result
## 'data.frame': 23777 obs. of 38 variables:
## $ circuitId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ constructorId : int 1 15 205 6 131 9 5 9 7 3 ...
## $ driverId : int 57 64 5 4 3 817 26 14 23 811 ...
## $ raceId : int 224 158 338 338 880 926 55 55 71 860 ...
## $ resultId : int 4385 2942 20359 20350 21731 22543 795 794 1150 21247 ...
## $ number.x : int 7 16 19 8 9 3 21 14 17 19 ...
## $ grid : int 5 19 19 3 6 6 18 11 15 14 ...
## $ position : int 5 NA 13 4 NA 6 9 8 12 16 ...
## $ positionText : Factor w/ 39 levels "1","10","11",..: 29 38 5 28 38 30 33 32 4 8 ...
## $ positionOrder : int 5 12 13 4 20 6 9 8 12 16 ...
## $ points : num 2 0 0 12 0 8 0 1 0 0 ...
## $ laps : int 58 41 56 58 26 57 57 57 56 52 ...
## $ time.x : Factor w/ 5759 levels "","+ 1:06.7",..: 1243 1 1 2268 1 1 5630 5162 1 1 ...
## $ milliseconds : int 5665471 NA NA 5632835 NA NA 5746670 5721770 NA NA ...
## $ fastestLap : int NA NA 52 47 18 46 26 32 56 49 ...
## $ rank : int NA NA 14 10 18 10 14 12 9 13 ...
## $ fastestLapTime : Factor w/ 552 levels "","01:07.4","01:07.5",..: 1 1 251 212 238 243 199 198 180 224 ...
## $ fastestLapSpeed: Factor w/ 5145 levels "","01:42.6","100.615",..: 1 1 2744 3734 3142 2972 4047 4069 4364 3496 ...
## $ statusId : int 1 7 12 1 10 11 1 1 11 4 ...
## $ year : int 1996 2000 2010 2010 2013 2015 2006 2006 2005 2012 ...
## $ round : int 1 1 2 2 1 1 3 3 1 1 ...
## $ name.x : Factor w/ 42 levels "Abu Dhabi Grand Prix",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ date : Factor w/ 997 levels "1950-05-13","1950-05-21",..: 582 647 822 822 879 917 753 753 732 859 ...
## $ time.y : Factor w/ 21 levels "","03:00:00",..: 1 1 5 5 5 4 14 14 14 5 ...
## $ driverRef : Factor w/ 842 levels "abate","abecassis",..: 337 205 433 17 675 659 732 177 639 120 ...
## $ number.y : int NA NA NA 14 6 3 NA NA NA NA ...
## $ code : Factor w/ 82 levels "","ALB","ALG",..: 1 1 38 4 60 59 66 14 64 65 ...
## $ forename : Factor w/ 466 levels "Adolf","Adolfo",..: 313 349 175 131 326 83 401 87 363 51 ...
## $ surname : Factor w/ 785 levels "Abate","Abecassis",..: 349 209 404 15 619 607 678 164 652 659 ...
## $ dob : Factor w/ 824 levels "","01/01/1909",..: 748 563 500 767 708 18 608 700 789 389 ...
## $ nationality.x : Factor w/ 41 levels "American","American-Italian",..: 17 8 17 36 19 5 1 9 19 8 ...
## $ constructorRef : Factor w/ 208 levels "adams","afm",..: 131 166 116 69 136 163 194 163 195 206 ...
## $ name.y : Factor w/ 208 levels "Adams","AFM",..: 133 167 111 73 138 164 195 164 196 206 ...
## $ nationality.y : Factor w/ 24 levels "American","Australian",..: 6 24 17 15 11 3 15 3 16 6 ...
## $ circuitRef : Factor w/ 73 levels "adelaide","ain-diab",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ name : Factor w/ 73 levels "A1-Ring","Adelaide Street Circuit",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ location : Factor w/ 70 levels "Abu Dhabi","Adelaide",..: 37 37 37 37 37 37 37 37 37 37 ...
## $ country : Factor w/ 32 levels "Argentina","Australia",..: 2 2 2 2 2 2 2 2 2 2 ...
## circuitId constructorId driverId raceId resultId number.x grid position
## 1 1 1 57 224 4385 7 5 5
## 2 1 15 64 158 2942 16 19 NA
## 3 1 205 5 338 20359 19 19 13
## positionText positionOrder points laps time.x milliseconds fastestLap
## 1 5 5 2 58 +1:35.071 5665471 NA
## 2 R 12 0 41 NA NA
## 3 13 13 0 56 NA 52
## rank fastestLapTime fastestLapSpeed statusId year round
## 1 NA 1 1996 1
## 2 NA 7 2000 1
## 3 14 01:33.6 203.878 12 2010 2
## name.x date time.y driverRef number.y code
## 1 Australian Grand Prix 1996-03-10 hakkinen NA
## 2 Australian Grand Prix 2000-03-12 diniz NA
## 3 Australian Grand Prix 2010-03-28 06:00:00 kovalainen NA KOV
## forename surname dob nationality.x constructorRef name.y
## 1 Mika HÌ_kkinen 28/09/1968 Finnish mclaren McLaren
## 2 Pedro Diniz 22/05/1970 Brazilian sauber Sauber
## 3 Heikki Kovalainen 19/10/1981 Finnish lotus_racing Lotus
## nationality.y circuitRef name location
## 1 British albert_park Albert Park Grand Prix Circuit Melbourne
## 2 Swiss albert_park Albert Park Grand Prix Circuit Melbourne
## 3 Malaysian albert_park Albert Park Grand Prix Circuit Melbourne
## country Podium
## 1 Australia 0
## 2 Australia 0
## 3 Australia 0
Set up two new variables - Pilot year winning ratio (% of winnings over the times raced, this year) - Pilot lifetime winning ratio (% of total winnings over all races so far)
## driverId year circuitId constructorId raceId resultId number.x grid
## 1 1 2007 6 1 40 458 2 2
## 2 1 2007 11 1 46 589 2 1
## 3 1 2007 2 1 37 392 2 4
## position positionText positionOrder points laps time.x milliseconds
## 1 2 12 2 8 78 4109 6033424
## 2 1 1 1 10 70 3775 5752991
## 3 2 12 2 8 56 2346 5552487
## fastestLap rank fastestLapTime fastestLapSpeed statusId round name.x
## 1 28 2 69 328 1 5 28
## 2 13 2 117 1951 1 11 19
## 3 22 1 282 3063 1 2 26
## date time.y driverRef number.y code forename surname dob nationality.x
## 1 773 12 340 44 30 271 326 162 9
## 2 779 12 340 44 30 271 326 162 9
## 3 770 6 340 44 30 271 326 162 9
## constructorRef name.y nationality.y circuitRef name location country
## 1 131 133 6 40 21 40 19
## 2 131 133 6 26 38 10 12
## 3 131 133 6 59 64 29 17
## Podium yr_count yr_win cumu_count cumu_win year_rwin_ratio
## 1 1 17 12 17 12 0.7058824
## 2 1 17 12 17 12 0.7058824
## 3 1 17 12 17 12 0.7058824
## life_win_ratio
## 1 0.7058824
## 2 0.7058824
## 3 0.7058824
Remove single occurences: some of the drivers or constructors were only present for very few races. Consequently, when dividing the data into estimation, testing and validation, there are some record of such drivers or constructors that cannot be found in a certain dataset. Those drivers or constructors were removed if there were less than 6 and 10 respectively.
Our model is based on a 80% probability threshold of being a Class 1, or in our case to be on the podium.
Let’s look into the data. This is how the first 10 out of the total of 22100 rows look like (transposed, for convenience):
01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|
driverId | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
year | 2007.00 | 2007.00 | 2007.00 | 2007.00 | 2007.00 | 2007.00 | 2007.00 | 2007.00 | 2007.00 | 2007.00 |
circuitId | 6.00 | 11.00 | 2.00 | 8.00 | 17.00 | 5.00 | 9.00 | 3.00 | 18.00 | 16.00 |
constructorId | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
year_rwin_ratio | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 |
life_win_ratio | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 |
In this model, we will consider two classification methods: logistic regression and classification and regression trees (CART).
Logistic Regression is a method similar to linear regression except that the dependent variable is discrete (e.g., 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 45.9 | 23.3 | 2.0 | 0.0 |
driverId2 | 0.7 | 0.4 | 1.6 | 0.1 |
driverId3 | 0.6 | 0.4 | 1.5 | 0.1 |
driverId4 | 0.0 | 0.3 | 0.1 | 0.9 |
driverId5 | 0.3 | 0.7 | 0.4 | 0.7 |
driverId6 | -15.7 | 1995.4 | 0.0 | 1.0 |
Given a set of independent variables, the output of the estimated logistic regression (the sum of the products of the independent variables with the corresponding regression coefficients) can be used to assess the probability an observation belongs to one of the classes. Specifically, the regression output can be transformed into a probability of belonging to, say, class 1 for each observation. The estimated probability that a validation observation belongs to class 1 (e.g., the estimated probability that the customer defaults) for the first few validation observations, using the logistic regression above, is:
Actual Class | Predicted Class | Probability of Class 1 | |
---|---|---|---|
Obs 1 | 0 | 1 | 0.84 |
Obs 2 | 0 | 0 | 0.23 |
Obs 3 | 1 | 0 | 0.43 |
Obs 4 | 1 | 0 | 0.47 |
Obs 5 | 0 | 0 | 0.23 |
Obs 6 | 0 | 0 | 0.24 |
Obs 7 | 0 | 0 | 0.22 |
Obs 8 | 1 | 0 | 0.21 |
Obs 9 | 1 | 0 | 0.24 |
Obs 10 | 1 | 1 | 0.84 |
Running a basic CART model with complexity control cp=0.001, leads to the following tree (NOTE: for better readability of the tree figures below, we will rename the independent variables as IV1 to IV6 when using CART):
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
The leaves of the tree indicate the number of estimation data observations that “reach that leaf” that belong to each class. A perfect classification would only have data from one class in each of the tree leaves. However, such a perfect classification of the estimation data would most likely not be able to classify well out-of-sample data due to overfitting of the estimation data.
One can estimate larger trees through changing the tree’s complexity control parameter (in this case the rpart.control argument cp). For example, this is how the tree would look like if we set cp=0.008:
Probability that an observation that “reaches that leaf” belongs to a class.
One can also use the percentage of data in each leaf of the tree to get an estimate of the probability that an observation (e.g., customer) belongs to a given class. The purity of the leaf can indicate the probability that an observation that “reaches that leaf” belongs to a class. In our case, the probability our validation data belong to class 1 (i.e., a customer’s likelihood of default) for the first few validation observations, using the first CART above, is:
Actual Class | Predicted Class | Probability of Class 1 | |
---|---|---|---|
Obs 1 | 0 | 0 | 0.71 |
Obs 2 | 0 | 0 | 0.25 |
Obs 3 | 1 | 0 | 0.20 |
Obs 4 | 1 | 1 | 1.00 |
Obs 5 | 0 | 0 | 0.25 |
Obs 6 | 0 | 0 | 0.25 |
Obs 7 | 0 | 0 | 0.31 |
Obs 8 | 1 | 0 | 0.25 |
Obs 9 | 1 | 0 | 0.20 |
Obs 10 | 1 | 1 | 0.82 |
The table above assumes that the probability threshold for considering an observations as “class 1” is 0.8. In practice we need to select the probability threshold: this is an important choice that we will discuss below.
We have already discussed feature selection and complexity control for classification methods. In our case, we can see the relative importance of the independent variables using the
variable.importance
of the CART trees (seehelp(rpart.object)
in R) or the z-scores from the output of logistic regression. For easier visualization, we scale all values between -1 and 1 (the scaling is done for each method separately - note that CART does not provide the sign of the “coefficients”). From this table we can see the key drivers of the classification according to each of the methods we used here.
Logistic Regression | CART 1 | CART 2 | |
---|---|---|---|
driverId2 | 0.05 | 0.40 | 0.26 |
driverId3 | 0.05 | 0.03 | 0.00 |
driverId4 | 0.00 | 0.15 | 0.00 |
driverId5 | 0.01 | 0.03 | 0.00 |
driverId6 | 0.00 | 0.00 | 0.00 |
driverId7 | 0.00 | 0.00 | 0.00 |
This is the percentage of the observations that have been correctly classified (i.e., the predicted class and the actual class are the same). We can just count the number of the validation data correctly classified and divide this number with the total number of the validation data, using the two CART and the logistic regression above. These are as follows for probability threshold 80%:
For the validation data, the hit rates are:
Hit Ratio | |
---|---|
Logistic Regression | 88.42476 |
First CART | 87.92149 |
Second CART | 86.81429 |
For the estimation data, the hit rates are:
Hit Ratio | |
---|---|
Logistic Regression | 88.17024 |
First CART | 89.29290 |
Second CART | 86.95264 |
The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example, for the method above with the highest hit rate in the validation data (among logistic regression and the 2 CART models), and for probability threshold 80%, the confusion matrix for the validation data is:
Predicted 1 (Podium) | Predicted 0 (No Podium) | |
---|---|---|
Actual 1 (Podium) | 15.27 | 84.73 |
Actual 0 (No Podium) | 0.46 | 99.54 |
When we change the probability threshold we get different values of hit rate, false positive and false negative rates, or any other performance metric. We can plot for example how the false positive versus true positive rates change as we alter the probability threshold, and generate the so called ROC curve.
The ROC curves for the validation data for the logistic regression as well as both the CARTs above are as follows:
In the betting environment case we are studying, the gains charts for the validation data for our three classifiers are the following:
Notice that if we were to examine cases selecting them at random, instead of selecting the “best” ones using an informed classifier, the “random prediction” gains chart would be a straight 45-degree line.
So how should a good gains chart look like? The further above this 45-degree reference line our gains curve is, the better the “gains”. Moreover, much like for the ROC curve, one can select the percentage of all cases examined appropriately so that any point of the gains curve is selected.
Finally, we can generate the so called profit curve, which we often use to make our final decisions.
We can measure some measure of profit if we only select the top drivers and races in terms of the probability of winning assigned by our classifier.
We can plot the profit curve by changing, as we did for the gains chart, the percentage of cases we select, and calculating the corresponding total estimated profit (or loss) we would generate.
Predict 1 (Podium) | Predict 0 (No Podium) | |
---|---|---|
Actual 1 (Podium) | 220 | 0 |
Actual 0 (No Podium) | -100 | 0 |
Based on these profit and cost estimates, the profit curves for the validation data for the three classifiers are:
We can then select the percentage of selected races that corresponds to the maximum estimated profit (or minimum loss, if necessary).
Having iterated steps 2-5 until we are satisfied with the performance of our selected model on the validation data, in this step the performance analysis outlined in step 5 needs to be done with the test sample. This is the performance that best mimics what one should expect in practice upon deployment of the classification solution, assuming (as always) that the data used for this performance analysis are representative of the situation in which the solution will be deployed.
Let’s see in our case how the hit ratio, confusion matrix, ROC curve, gains chart, and profit curve look like for our test data. For the hit ratio and the confusion matrix we use 80% as the probability threshold for classification.
Hit Ratio | |
---|---|
Logistic Regression | 87.50566 |
First CART | 87.46039 |
Second CART | 86.28339 |
The confusion matrix for the model with the best test data hit ratio above:
Predicted 1 (Podium) | Predicted 0 (No Podium) | |
---|---|---|
Actual 1 (Podium) | 12.87 | 87.13 |
Actual 0 (No Podium) | 0.63 | 99.37 |
ROC curves for the test data:
Gains chart for the test data:
Finally the profit curves for the test data, using the same profit/cost estimates as we did above:
Our current model achieves some success in predicting who will be on the podium, but these estimates could be improved with some further feature engineering.
Some paths to explore would be:
Hot streak: some drivers may have a higher chance to win if they have won in the last few races. With the momemtum of past victories, they may be more likely to win.
Home field advantage: a driver may train in the same place as the race, or be from the same country. With the additional encouragement from his/her countrymen, he may win more.
Fatigue: based on the number of races in a set time period, some racers might suffer from fatigue and not perform as well.
Age: Age may play a role in the performance. There could be a peak time where most racers perform well, and then that performance could decrease.
Pit stops: based on the average of past pit stops, we could estimate the number of pit stops per race, and its effect on performance.
Qualifying: some further analysis could be done on the qualifying races data. Based on those races (which determine the start position of the actual races), we could get further insight on performance.
Lap times: average or maximum lap time could be insights for race performance.