It is important to learn what is out there before building our own predictive model.
State-of-the-Art approaches were looked at before diving into modeling the World Cup with the data collected. This was done in order to build a foundation for our model such that it could excel in areas that previous models had trouble with.
Zeileis A, Leitner C, Hornik K (2018). “Probabilistic Forecasts for the 2018 FIFA World Cup Based on the Bookmaker Consensus Model”, Working Paper 2018-09, Working Papers in Economics and Statistics, Research Platform Empirical and Experimental Economics, Universität Innsbruck. https://doi.org/10.1016/j.ijforecast.2009.10.001
This work details the use of a bookmaker consensus model to predict the 2018 World Cup. This model, proposed by Hornik Leitner, calculates a model-based average of the log-odds different teams have of winning. The average is based on quoted bookmakers odds.
The results of the aggregation of the above two models was very good; it predicted the winner of the 2010 World Cup and 3 of the 4 semi finalists at the 2014 World Cup.
One of the limitations I noticed was that, because this model only used past results as predictors, the future performance most likely will suffer, because of the lack of important but constantly changing variables that include player statistics.
Gilch A, Lorenz, Muller, Sebastian (2018-06). “On Elo Based Prediction Models for the FIFA World Cup”, Working Paper 2018-09, Universität Passau. https://arxiv.org/pdf/1806.01930.pdf
This work dives deeper into the concept of pairwise ratings of countries as a basis of simulating the outcome of the 2018 World Cup. It discusses the unpredictability of the events that can determine the outcome of a game (extraordinary individual performances, or injuries and errors). It also talks about how underdogs are known to win tournaments, while powerhouse teams might drop out very early in the tournament.
The main idea of Elo based prediction models is that they calculate the probabilities of winning for each team, against each other team. There is a lot of room for creativity in this type of model, especially with feature representation in the individual predictions.
The foundation of their models were that they used pre-determined Elo rankings from elorating.net. Next, they used a Poisson regression model to determine the outcome of a match between teams A and B, by including several other soccer-specific variables. They were able to get an accuracy of 56%, which is impressive given the aforementioned unpredictable factors that determine the outcome of a match.
Andreas, Groll, Ley, Christophe, Schauberger, Gunther, Eetvelde, Hans V (2018-06-08). “Prediction of the FIFA World Cup 2018 – A random forest approach with an emphasis on estimated team ability parameters”. https://arxiv.org/pdf/1806.03208.pdf
This paper mainly focuses on the Random Forest Classifier approach to predicting the World Cup winner based on a collection of features that the authors of the paper aggregated into one dataset, and trained on. Due to the randomness of the Random Forest approach, they simulated the tournament 100,000 times, and ranked the most likely candidates of being the winner.
In terms of features, the paper nicely listed out 16 features organized by economic factors, sportive factors, home advantage, team structure, and coaching statistics. In the Economic factors, the GDP per capita and population are used as features. In the sportive factors, the FIFA ranking is used and the bookmakers odds are used. In terms of Home advantage, Host dummy variable is used and the Confederation/Continent of is also used. In terms of team structure, the variables used are: maximum number of teammates playing together from the same club in each national team, average age, number of champions league players, and foreign players. Finally, age and duration of tenure are used for the coach statistics.
In terms of data, this paper utilized the available data from FIFA from the past 8 World Cups, with the reasoning that older data is much obsolete and unnecessary. They also use the FIFA ranking dataset. Finally they use a team squad strength dataset, that was calculated with a separate Poisson ranking model.
As for their final model, they are approaching it as a classification problem of determining how many goals will be scored by the home and away team in a simulated World Cup Match. They then created a variable importance chart, and found that the team ability was the most important variable in determining the outcome of a World Cup Match. They predicted that Spain was the most likely to win the world cup with a 17.8% chance, while Germany was less likely with 17.1%, and Brazil was even less likely with 12.3%.
Although most of the features chosen were through EDAs and general knowledge about football, some of our features were informed through the literature review. All three literatures have helped us to understand the significance and limitations of different models. First literature, “Probabilistic Forecasts for the 2018 FIFA World Cup Based on the Bookmaker Consensus Model”, gave us direction of using recent statistic and not relying only on past result statistics. Second literature, “On Elo Based Prediction Models for the FIFA World Cup”, gave us in-site about significance of model by using ranking as a dominant predictor. That helped us in our comparison approach of models with and without rankings. Third literature, “Prediction of the FIFA World Cup 2018 – A random forest approach with an emphasis on estimated team ability parameters”, helped us explore more features which was included in our models like Economic factors, location factors and player/team statistics. Furthermore, as Patrick mentioned, random forest turned out to be the best method for our predictive model, and this was also backed up by the literature.