Business Problem:

The online gambling industry would like to obtain the maximum level of precision when forecasting the outcome of a race, in order to set the proper odds for the bet. In this model, we chose to focus on car racing and, in particular, on Formula One. The gambling industry competes to provide the best odds to its customers. The industry must also accurately and consistently predict the outcomes of events to ensure a profit for the house. Any improvement in the accuracy of predictions could result in enhanced market share, through more generous odds to consumers, and potentially increased profits depending on the costs associated with each bet. With this knowledge, we set to evaluate the data and develop a model for predicting Formula One race outcomes with the intent of improving the accuracy of odds setting in the industry.

Business Solution:

By analysing a database containing information about many races analysed on different dimensions (independent By analyzing a database containing information about many races analyzed on different dimensions (independent variables), we set up a model able to forecast the odds of a particular racer winning. To perform this analysis, we used circuit, car, pilot, geography and others to establish a baseline for analysis. From that point, we will identify which variables are least predictive and simplify the mode. In this analysis, the dependent variable is the predicted position obtain by a car in each race. A successful model will predict the outcome of races with a higher degree of accuracy than existing market models, allowing the owner to offer better rates than the competition.

Load raw data files

We load up the information from several databases: - Circuits: containing information about the race’s circuit (name, location) - Constructor: containing information about the car’s constructor (Name, nationality, wikipedia reference) - Drivers: containing information about the drivers (Name, DOB, nationality) - Races: containing information about the race (date, time, circuit) - Results: containing information about the results (race, constructor, driver, position)

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : embedded nul(s) found in input

Preliminary Summary Statistics

Show the first 3 rows in each dataframe

##   circuitId  circuitRef                           name     location
## 1         1 albert_park Albert Park Grand Prix Circuit    Melbourne
## 2         2      sepang   Sepang International Circuit Kuala Lumpur
## 3         3     bahrain  Bahrain International Circuit       Sakhir
##     country       lat      lng alt
## 1 Australia -37.84970 144.9680  10
## 2  Malaysia   2.76083 101.7380  NA
## 3   Bahrain  26.03250  50.5106  NA
##                                                          url
## 1  http://en.wikipedia.org/wiki/Melbourne_Grand_Prix_Circuit
## 2  http://en.wikipedia.org/wiki/Sepang_International_Circuit
## 3 http://en.wikipedia.org/wiki/Bahrain_International_Circuit
##   constructorId constructorRef       name nationality
## 1             1        mclaren    McLaren     British
## 2             2     bmw_sauber BMW Sauber      German
## 3             3       williams   Williams     British
##                                                            url  X
## 1                         http://en.wikipedia.org/wiki/McLaren NA
## 2                      http://en.wikipedia.org/wiki/BMW_Sauber NA
## 3 http://en.wikipedia.org/wiki/Williams_Grand_Prix_Engineering NA
##   driverId driverRef number code forename  surname        dob nationality
## 1        1  hamilton     44  HAM    Lewis Hamilton 07/01/1985     British
## 2        2  heidfeld     NA  HEI     Nick Heidfeld 10/05/1977      German
## 3        3   rosberg      6  ROS     Nico  Rosberg 27/06/1985      German
##                                           url
## 1 http://en.wikipedia.org/wiki/Lewis_Hamilton
## 2  http://en.wikipedia.org/wiki/Nick_Heidfeld
## 3   http://en.wikipedia.org/wiki/Nico_Rosberg
##   raceId year round circuitId                  name       date     time
## 1      1 2009     1         1 Australian Grand Prix 2009-03-29 06:00:00
## 2      2 2009     2         2  Malaysian Grand Prix 2009-04-05 09:00:00
## 3      3 2009     3        17    Chinese Grand Prix 2009-04-19 07:00:00
##                                                       url
## 1 http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix
## 2  http://en.wikipedia.org/wiki/2009_Malaysian_Grand_Prix
## 3    http://en.wikipedia.org/wiki/2009_Chinese_Grand_Prix
##   resultId raceId driverId constructorId number grid position positionText
## 1        1     18        1             1     22    1        1            1
## 2        2     18        2             2      3    5        2            2
## 3        3     18        3             3      7    7        3            3
##   positionOrder points laps    time milliseconds fastestLap rank
## 1             1     10   58 34:50.6      5690616         39    2
## 2             2      8   58   5.478      5696094         41    3
## 3             3      6   58   8.163      5698779         41    5
##   fastestLapTime fastestLapSpeed statusId
## 1        01:27.5           218.3        1
## 2        01:27.7         217.586        1
## 3        01:28.1         216.719        1

Clean the data and incorporate all relevant variables in one unique dataset

The dataset are merged, the unnecessary columns are deleted, and some checks are performed to analyse the data type of each column.

## Warning in merge.data.frame(Temp03, Circuits, by = "circuitId"): column
## names 'url.x', 'url.y' are duplicated in the result
## 'data.frame':    23777 obs. of  38 variables:
##  $ circuitId      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ constructorId  : int  1 15 205 6 131 9 5 9 7 3 ...
##  $ driverId       : int  57 64 5 4 3 817 26 14 23 811 ...
##  $ raceId         : int  224 158 338 338 880 926 55 55 71 860 ...
##  $ resultId       : int  4385 2942 20359 20350 21731 22543 795 794 1150 21247 ...
##  $ number.x       : int  7 16 19 8 9 3 21 14 17 19 ...
##  $ grid           : int  5 19 19 3 6 6 18 11 15 14 ...
##  $ position       : int  5 NA 13 4 NA 6 9 8 12 16 ...
##  $ positionText   : Factor w/ 39 levels "1","10","11",..: 29 38 5 28 38 30 33 32 4 8 ...
##  $ positionOrder  : int  5 12 13 4 20 6 9 8 12 16 ...
##  $ points         : num  2 0 0 12 0 8 0 1 0 0 ...
##  $ laps           : int  58 41 56 58 26 57 57 57 56 52 ...
##  $ time.x         : Factor w/ 5759 levels "","+ 1:06.7",..: 1243 1 1 2268 1 1 5630 5162 1 1 ...
##  $ milliseconds   : int  5665471 NA NA 5632835 NA NA 5746670 5721770 NA NA ...
##  $ fastestLap     : int  NA NA 52 47 18 46 26 32 56 49 ...
##  $ rank           : int  NA NA 14 10 18 10 14 12 9 13 ...
##  $ fastestLapTime : Factor w/ 552 levels "","01:07.4","01:07.5",..: 1 1 251 212 238 243 199 198 180 224 ...
##  $ fastestLapSpeed: Factor w/ 5145 levels "","01:42.6","100.615",..: 1 1 2744 3734 3142 2972 4047 4069 4364 3496 ...
##  $ statusId       : int  1 7 12 1 10 11 1 1 11 4 ...
##  $ year           : int  1996 2000 2010 2010 2013 2015 2006 2006 2005 2012 ...
##  $ round          : int  1 1 2 2 1 1 3 3 1 1 ...
##  $ name.x         : Factor w/ 42 levels "Abu Dhabi Grand Prix",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ date           : Factor w/ 997 levels "1950-05-13","1950-05-21",..: 582 647 822 822 879 917 753 753 732 859 ...
##  $ time.y         : Factor w/ 21 levels "","03:00:00",..: 1 1 5 5 5 4 14 14 14 5 ...
##  $ driverRef      : Factor w/ 842 levels "abate","abecassis",..: 337 205 433 17 675 659 732 177 639 120 ...
##  $ number.y       : int  NA NA NA 14 6 3 NA NA NA NA ...
##  $ code           : Factor w/ 82 levels "","ALB","ALG",..: 1 1 38 4 60 59 66 14 64 65 ...
##  $ forename       : Factor w/ 466 levels "Adolf","Adolfo",..: 313 349 175 131 326 83 401 87 363 51 ...
##  $ surname        : Factor w/ 785 levels "Abate","Abecassis",..: 349 209 404 15 619 607 678 164 652 659 ...
##  $ dob            : Factor w/ 824 levels "","01/01/1909",..: 748 563 500 767 708 18 608 700 789 389 ...
##  $ nationality.x  : Factor w/ 41 levels "American","American-Italian",..: 17 8 17 36 19 5 1 9 19 8 ...
##  $ constructorRef : Factor w/ 208 levels "adams","afm",..: 131 166 116 69 136 163 194 163 195 206 ...
##  $ name.y         : Factor w/ 208 levels "Adams","AFM",..: 133 167 111 73 138 164 195 164 196 206 ...
##  $ nationality.y  : Factor w/ 24 levels "American","Australian",..: 6 24 17 15 11 3 15 3 16 6 ...
##  $ circuitRef     : Factor w/ 73 levels "adelaide","ain-diab",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ name           : Factor w/ 73 levels "A1-Ring","Adelaide Street Circuit",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ location       : Factor w/ 70 levels "Abu Dhabi","Adelaide",..: 37 37 37 37 37 37 37 37 37 37 ...
##  $ country        : Factor w/ 32 levels "Argentina","Australia",..: 2 2 2 2 2 2 2 2 2 2 ...

Set new variable for classification: is the driver on the podium for a certain race?

##   circuitId constructorId driverId raceId resultId number.x grid position
## 1         1             1       57    224     4385        7    5        5
## 2         1            15       64    158     2942       16   19       NA
## 3         1           205        5    338    20359       19   19       13
##   positionText positionOrder points laps    time.x milliseconds fastestLap
## 1            5             5      2   58 +1:35.071      5665471         NA
## 2            R            12      0   41                     NA         NA
## 3           13            13      0   56                     NA         52
##   rank fastestLapTime fastestLapSpeed statusId year round
## 1   NA                                       1 1996     1
## 2   NA                                       7 2000     1
## 3   14        01:33.6         203.878       12 2010     2
##                  name.x       date   time.y  driverRef number.y code
## 1 Australian Grand Prix 1996-03-10            hakkinen       NA     
## 2 Australian Grand Prix 2000-03-12               diniz       NA     
## 3 Australian Grand Prix 2010-03-28 06:00:00 kovalainen       NA  KOV
##   forename    surname        dob nationality.x constructorRef  name.y
## 1     Mika  HÌ_kkinen 28/09/1968       Finnish        mclaren McLaren
## 2    Pedro      Diniz 22/05/1970     Brazilian         sauber  Sauber
## 3   Heikki Kovalainen 19/10/1981       Finnish   lotus_racing   Lotus
##   nationality.y  circuitRef                           name  location
## 1       British albert_park Albert Park Grand Prix Circuit Melbourne
## 2         Swiss albert_park Albert Park Grand Prix Circuit Melbourne
## 3     Malaysian albert_park Albert Park Grand Prix Circuit Melbourne
##     country Podium
## 1 Australia      0
## 2 Australia      0
## 3 Australia      0

FEATURE ENGINEERING:

Set up two new variables - Pilot year winning ratio (% of winnings over the times raced, this year) - Pilot lifetime winning ratio (% of total winnings over all races so far)

##   driverId year circuitId constructorId raceId resultId number.x grid
## 1        1 2007         6             1     40      458        2    2
## 2        1 2007        11             1     46      589        2    1
## 3        1 2007         2             1     37      392        2    4
##   position positionText positionOrder points laps time.x milliseconds
## 1        2           12             2      8   78   4109      6033424
## 2        1            1             1     10   70   3775      5752991
## 3        2           12             2      8   56   2346      5552487
##   fastestLap rank fastestLapTime fastestLapSpeed statusId round name.x
## 1         28    2             69             328        1     5     28
## 2         13    2            117            1951        1    11     19
## 3         22    1            282            3063        1     2     26
##   date time.y driverRef number.y code forename surname dob nationality.x
## 1  773     12       340       44   30      271     326 162             9
## 2  779     12       340       44   30      271     326 162             9
## 3  770      6       340       44   30      271     326 162             9
##   constructorRef name.y nationality.y circuitRef name location country
## 1            131    133             6         40   21       40      19
## 2            131    133             6         26   38       10      12
## 3            131    133             6         59   64       29      17
##   Podium yr_count yr_win cumu_count cumu_win year_rwin_ratio
## 1      1       17     12         17       12       0.7058824
## 2      1       17     12         17       12       0.7058824
## 3      1       17     12         17       12       0.7058824
##   life_win_ratio
## 1      0.7058824
## 2      0.7058824
## 3      0.7058824

STRATIFICATION

Remove single occurences: some of the drivers or constructors were only present for very few races. Consequently, when dividing the data into estimation, testing and validation, there are some record of such drivers or constructors that cannot be found in a certain dataset. Those drivers or constructors were removed if there were less than 6 and 10 respectively.

Set up model

Our model is based on a 80% probability threshold of being a Class 1, or in our case to be on the podium.

The Data

Let’s look into the data. This is how the first 10 out of the total of 22100 rows look like (transposed, for convenience):

01 02 03 04 05 06 07 08 09 10
driverId 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
year 2007.00 2007.00 2007.00 2007.00 2007.00 2007.00 2007.00 2007.00 2007.00 2007.00
circuitId 6.00 11.00 2.00 8.00 17.00 5.00 9.00 3.00 18.00 16.00
constructorId 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
year_rwin_ratio 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71
life_win_ratio 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71

Classification and Interpretation

In this model, we will consider two classification methods: logistic regression and classification and regression trees (CART).

Logistic Regression

Logistic Regression is a method similar to linear regression except that the dependent variable is discrete (e.g., 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:

Estimate Std. Error z value Pr(>|z|)
(Intercept) 45.9 23.3 2.0 0.0
driverId2 0.7 0.4 1.6 0.1
driverId3 0.6 0.4 1.5 0.1
driverId4 0.0 0.3 0.1 0.9
driverId5 0.3 0.7 0.4 0.7
driverId6 -15.7 1995.4 0.0 1.0

Given a set of independent variables, the output of the estimated logistic regression (the sum of the products of the independent variables with the corresponding regression coefficients) can be used to assess the probability an observation belongs to one of the classes. Specifically, the regression output can be transformed into a probability of belonging to, say, class 1 for each observation. The estimated probability that a validation observation belongs to class 1 (e.g., the estimated probability that the customer defaults) for the first few validation observations, using the logistic regression above, is:

Actual Class Predicted Class Probability of Class 1
Obs 1 0 1 0.84
Obs 2 0 0 0.23
Obs 3 1 0 0.43
Obs 4 1 0 0.47
Obs 5 0 0 0.23
Obs 6 0 0 0.24
Obs 7 0 0 0.22
Obs 8 1 0 0.21
Obs 9 1 0 0.24
Obs 10 1 1 0.84

CART

Running a basic CART model with complexity control cp=0.001, leads to the following tree (NOTE: for better readability of the tree figures below, we will rename the independent variables as IV1 to IV6 when using CART):

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

The leaves of the tree indicate the number of estimation data observations that “reach that leaf” that belong to each class. A perfect classification would only have data from one class in each of the tree leaves. However, such a perfect classification of the estimation data would most likely not be able to classify well out-of-sample data due to overfitting of the estimation data.

One can estimate larger trees through changing the tree’s complexity control parameter (in this case the rpart.control argument cp). For example, this is how the tree would look like if we set cp=0.008:

Changing tree complexity

Purity of the leaf evaluation

Probability that an observation that “reaches that leaf” belongs to a class.

One can also use the percentage of data in each leaf of the tree to get an estimate of the probability that an observation (e.g., customer) belongs to a given class. The purity of the leaf can indicate the probability that an observation that “reaches that leaf” belongs to a class. In our case, the probability our validation data belong to class 1 (i.e., a customer’s likelihood of default) for the first few validation observations, using the first CART above, is:

Actual Class Predicted Class Probability of Class 1
Obs 1 0 0 0.71
Obs 2 0 0 0.25
Obs 3 1 0 0.20
Obs 4 1 1 1.00
Obs 5 0 0 0.25
Obs 6 0 0 0.25
Obs 7 0 0 0.31
Obs 8 1 0 0.25
Obs 9 1 0 0.20
Obs 10 1 1 0.82

The table above assumes that the probability threshold for considering an observations as “class 1” is 0.8. In practice we need to select the probability threshold: this is an important choice that we will discuss below.

Key drivers of the classification according to each of the methods used

We have already discussed feature selection and complexity control for classification methods. In our case, we can see the relative importance of the independent variables using the variable.importance of the CART trees (see help(rpart.object) in R) or the z-scores from the output of logistic regression. For easier visualization, we scale all values between -1 and 1 (the scaling is done for each method separately - note that CART does not provide the sign of the “coefficients”). From this table we can see the key drivers of the classification according to each of the methods we used here.

Logistic Regression CART 1 CART 2
driverId2 0.05 0.40 0.26
driverId3 0.05 0.03 0.00
driverId4 0.00 0.15 0.00
driverId5 0.01 0.03 0.00
driverId6 0.00 0.00 0.00
driverId7 0.00 0.00 0.00

Validation accuracy

This is the percentage of the observations that have been correctly classified (i.e., the predicted class and the actual class are the same). We can just count the number of the validation data correctly classified and divide this number with the total number of the validation data, using the two CART and the logistic regression above. These are as follows for probability threshold 80%:

For the validation data, the hit rates are:

Hit Ratio
Logistic Regression 88.42476
First CART 87.92149
Second CART 86.81429

For the estimation data, the hit rates are:

Hit Ratio
Logistic Regression 88.17024
First CART 89.29290
Second CART 86.95264

Confusion matrix

The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example, for the method above with the highest hit rate in the validation data (among logistic regression and the 2 CART models), and for probability threshold 80%, the confusion matrix for the validation data is:

Predicted 1 (Podium) Predicted 0 (No Podium)
Actual 1 (Podium) 15.27 84.73
Actual 0 (No Podium) 0.46 99.54

ROC curve

When we change the probability threshold we get different values of hit rate, false positive and false negative rates, or any other performance metric. We can plot for example how the false positive versus true positive rates change as we alter the probability threshold, and generate the so called ROC curve.

The ROC curves for the validation data for the logistic regression as well as both the CARTs above are as follows:

Gains chart

In the betting environment case we are studying, the gains charts for the validation data for our three classifiers are the following:

Notice that if we were to examine cases selecting them at random, instead of selecting the “best” ones using an informed classifier, the “random prediction” gains chart would be a straight 45-degree line.

So how should a good gains chart look like? The further above this 45-degree reference line our gains curve is, the better the “gains”. Moreover, much like for the ROC curve, one can select the percentage of all cases examined appropriately so that any point of the gains curve is selected.

Profit curve

Finally, we can generate the so called profit curve, which we often use to make our final decisions.

We can measure some measure of profit if we only select the top drivers and races in terms of the probability of winning assigned by our classifier.

We can plot the profit curve by changing, as we did for the gains chart, the percentage of cases we select, and calculating the corresponding total estimated profit (or loss) we would generate.

Predict 1 (Podium) Predict 0 (No Podium)
Actual 1 (Podium) 220 0
Actual 0 (No Podium) -100 0

Based on these profit and cost estimates, the profit curves for the validation data for the three classifiers are:

We can then select the percentage of selected races that corresponds to the maximum estimated profit (or minimum loss, if necessary).

Test Accuracy

Having iterated steps 2-5 until we are satisfied with the performance of our selected model on the validation data, in this step the performance analysis outlined in step 5 needs to be done with the test sample. This is the performance that best mimics what one should expect in practice upon deployment of the classification solution, assuming (as always) that the data used for this performance analysis are representative of the situation in which the solution will be deployed.

Let’s see in our case how the hit ratio, confusion matrix, ROC curve, gains chart, and profit curve look like for our test data. For the hit ratio and the confusion matrix we use 80% as the probability threshold for classification.

Hit Ratio
Logistic Regression 87.50566
First CART 87.46039
Second CART 86.28339

The confusion matrix for the model with the best test data hit ratio above:

Predicted 1 (Podium) Predicted 0 (No Podium)
Actual 1 (Podium) 12.87 87.13
Actual 0 (No Podium) 0.63 99.37

ROC curves for the test data:

Gains chart for the test data:

Finally the profit curves for the test data, using the same profit/cost estimates as we did above:

Next steps

Our current model achieves some success in predicting who will be on the podium, but these estimates could be improved with some further feature engineering.

Some paths to explore would be:

Hot streak: some drivers may have a higher chance to win if they have won in the last few races. With the momemtum of past victories, they may be more likely to win.

Home field advantage: a driver may train in the same place as the race, or be from the same country. With the additional encouragement from his/her countrymen, he may win more.

Fatigue: based on the number of races in a set time period, some racers might suffer from fatigue and not perform as well.

Age: Age may play a role in the performance. There could be a peak time where most racers perform well, and then that performance could decrease.

Pit stops: based on the average of past pit stops, we could estimate the number of pit stops per race, and its effect on performance.

Qualifying: some further analysis could be done on the qualifying races data. Based on those races (which determine the start position of the actual races), we could get further insight on performance.

Lap times: average or maximum lap time could be insights for race performance.