G’day, my name's Cameron and I work in the marketing department of PI.EXCHANGE. Our product, The AI & Analytics Engine, is a platform where you can make machine learning models without coding, which is handy because I’m not great at coding.
I'm a massive football fan, I started playing when I was four years old and still follow the major competitions. One week ago, I asked my manager if I could start a project, using the Engine to predict the upcoming FIFA World Cup. She thought it was a great idea, and so I got to work looking for data.
Before I get into the data and machine learning and predictions, I’ll quickly explain how the World Cup is organized for those who don’t know.
The group stage is how the tournament begins, with the 32 qualifying national teams being split into eight groups of four teams. Each team plays the 3 teams in their respective groups once, earning 3 points for winning, and 1 for drawing. The top two teams from each group advance to the knockout stage.
The knockout stages are elimination games, there are no draws, games will go to extra time and penalty shootouts if the scores are level. The rounds of the knockout stage are The Round of 16, The Quarter Finals, The Semi Finals, and The Final. All a team has to do to win The World Cup is win 4 games in a row. Easier said than done.
The Fuel (Dataset)
I’m training the machine learning model using a dataset containing 23,000 international football games going back to 1993. Each game has information about who’s playing and what their FIFA ranking was, where it’s being played, what tournament it’s in, and most importantly what the result was.
A classification machine learning model predicts the class label (dependent variable) of a given datapoint. It trains on a dataset and learns how the features (independent variables) affect the class label. You give the model features and it tells you what it thinks the class will be, simple right?
In our case, the features we are giving the model are the details about the upcoming group stage games, like who’s playing, what their FIFA ranking is etc. The class we are predicting is the home team result; Win, lose or draw.

Excerpt of two datapoints from the training dataset
The Technical Part
Now full disclosure, I’ll reiterate that I do work for PI.EXCHANGE, but I do think the AI & Analytics Engine is pretty cool. The first step is to upload the training data, which I did by uploading in CSV format because it’s easy, although files can also be imported from a database if you’re a computer whiz. Next is creating the app, which is mostly specifying that it’s a classification problem and that we’re trying to predict the result column.
Then it’s time to create the models. I trained multiple models which use different classification algorithms. Each algorithm has different methods of predicting the class label, and therefore have different levels of accuracy. Some of the algorithms I tried included K-nearest neighbors, Random Forest, and Logistic Regression. However the best performing model used the LightGBM algorithm which is based on decision trees, so I proceeded with that model.

Machine learning process
It’s important to understand how each model is decides it’s prediction, The Engine helps you understand why a trained model performs as it does, under the feature importance tab. It displays a summary of which features in the training data affect the predicted class the most. For the home team to win, the difference between the FIFA rankings is by far the most important variable. There does seem to be a home team advantage, because the neutral location variable is second most important.

Feature Importance in the Engine
The Evaluation
The model uses 80% of the training data to learn, but saves 20% in order to evaluate itself. We can see this in the displayed confusion matrix. Put simply the confusion matrix visualizes the models performance by comparing the predicted and actual class. The model is most accurate when predicting the label “win” and tends to predict that class most often. The model also rarely picks a draw which can be seen in the group stage predictions.

Multiclass Confusion Matrix in The Engine
World Cup Group Stage Predictions
Now for the good stuff. With the model ready, it was time to upload the data for the group stages to get our predictions for each game. By uploading a CSV for the upcoming games in the same data schema as the training data (the same format of columns for the unacquainted). The model gives a probability of each given class, and chooses the most likely outcome. This means means we can calculate the expected points (XPTS), with the formula XPTS = P(Win) * 3 + P(Draw).
These are the results:

Group Stage Predictions
Big surprise, many of the highest ranked teams are projected to win all of their group stage games, as the FIFA ranking difference is the most predictive feature. Although there was a few exceptions. Some notable upsets include USA (ranked 16th) defeating England (5), Canada (41) defeating Morocco (22), and Germany (11) defeating Spain (7). Finally, we have to mention that my own country, Australia (38th) was predicted to defeat Tunisia (30th), although it wouldn't be enough to make it past the group stage 😔
France has the highest XPTS, meaning that it is strongest team in comparison to its group, closely followed by Brazil and Belgium.
World Cup Knockout Stage Predictions
Now the model has predicted which teams move onto the first round of the knockout stage. I created the test data of games according to the predictions from the previous (group) stage, and repeated the process for the quarter finals, semi finals, and *drumroll* the final.

Batch Predictions in The Engine
Round of 16:
The USA upset the higher ranked Netherlands who topped group A without losing a game. Serbia, the lowest ranked team remaining, were also able to upset Uruguay. Argentina, Brazil, England, France and Belgium were all favorites and progressed, while Croatia was able to defeat similarly ranked Germany.
Quarter Finals:
USA’s streak of luck ended, falling short of 3rd ranked Argentina. The top two ranked teams, Brazil and Belgium were able to defeat Croatia and Serbia respectively. Finally, cross-channel rivals England and France were ranked 4th and 5th respectively, however the French progressed to the Semi finals.
Semi Finals and Final:
The four teams consisted of teams ranked in the top five by FIFA ranking, reiterating how strongly the model considers that feature. Predictably, first ranked Brazil and second ranked Belgium, progressed to the final, where Brazil would be predicted to win.

Knockout stage predictions
So that’s it. With the given data, the Engine predicts Brazil wins the 2022 FIFA World Cup. However sports are notoriously hard to predict, with a million different variables at play. I’d love to improve the model by adding more variables to the training data and making a more sophisticated model.
I’ll be writing a follow up to see how the predictions ended up going, so stick around for that.
Cameron.
Interested in trying the Engine for yourself? Get started on a 2-weeks free trial!
