I recently decided to start a side project which combined my love of rugby with my love of Data Science - and so Mel Rugby was born. Mel is a framework I created in the R programming language which first scrapes data on rugby matches from various sources online, then runs them through an ensemble of pre-trained machine learning models to produce predictions of the final score and finally tweets the outputs and accuracy of the predictions to the @mel_rugby twitter account.

In this article I will discuss some of the important points I had to consider, how I approached the development of the framework (as a non-expert in machine learning) and finally, some lessons I have learned so far.

Motivation

I have played rugby since I was 4 years old, so when I was thinking of a potential project about a month ago it was one of the first topics that came to mind. Given the Rugby World Cup (RWC) was due to start a couple of weeks later it seemed to be perfect timing. So I set about doing some research on how best to approach predicting the results of sports matches- most of this focussed on football (soccer) or the NFL. From previous experience, I knew that the best thing to do next was just start and see how far I could get, so that’s what I did!

Tools

For this project, I used R, for no other reason than it is the language I am most proficient in. The same results could be (perhaps more efficiently) obtained using Python, but personally, that would have taken me far longer. The main packages I used were: rvest, xml2 & RSelenium for scraping the data; dplyr, tidyr, stringr & lubridate for data wrangling; caret & neuralnet for training the model; and rtweet for communicating the results on Twitter. It is also worth noting at this point that I set out to make the whole framework as automated as possible. So the whole project is essentially a set of nested functions which call each other in turn. Below is a sketch of what the flow currently looks like:

At present then, all that is required is to run ‘master()’ once every few days - this can be scheduled or hosted online for full automation.

Getting the data

The first and foremost problem to overcome was to obtain some data I could use to model rugby matches. I decided early on that looking into every rugby match ever would increase the time and effort required considerably, not only because it would include scraping data with a different structure but also it is unlikely that the factors affecting international matches and domestic/club matches are identical, so considering both would likely require separate models. With the World Cup approaching and international match data being the easiest to obtain I decided to only consider international matches, with a view to extending the framework to domestic games at a later date if possible. After looking online I found a website which contained basic data for all international matches going back to the early 20th century.

Choice of variables

There were two general approaches I could have used when considering variables - focussing on either team-level variables or individual variables. Here I was constrained by the data that was available: individual player data for every match is harder to obtain and far less reliable (at least this is the case for the data source I was using). To clarify, what I mean here is information such as: height, weight, recent performances etc. at player level. I was able to obtain information on the number of appearances each player had made for their country before and this was incorporated into my models.

Once I decided to focus on team-level variables I wrangled the various data sources to get the obvious characteristics of a team which might affect the outcome of a match including: their total number of caps (which I eventually split into total caps of forwards and total caps of backs); their recent performance (over the last 5 games); their current ranking (for this I used my own - very similiar but slightly different - version of the official IRB rankings) and whether they were playing at home, away or on neutral ground (as is the case for most of the RWC games), among others.

Choice of models

This part of the process was surprisingly straightforward and quick. I am currently in the process of improving the existing models and hope to have ‘Phase 2’ models implemented soon. I am by no means an expert in Machine Learning so once I had done some cursory research into how the problem had previously been approached, I started playing around with different models and combinations of models until I found a combination that worked relatively well. For those who are interested, I started out using a principal components neural network (PCANNet) and a linear Support Vector Machine (LSVM), both of which took as inputs the characteristics of both teams, then outputted a predicted score for each. I then took a simple average of the two score predictions, as this proved to be most accurate.

Soon after this first phase of models had been trained and tested on a few upcoming matches, it became clear that one area in which the models were failing was in matches where one team ‘runs away with it’. These are matches where one team gains momentum and ends up winning very comfortably (by more than 20 points), these games I termed ‘big victories’. I decided to develop some models in an attempt to redress this, which led to the production of:

A lasso regression model for predicting the score, given the assumption that one of the teams will win by an exceptionally large margin.
A random forest classifier which attempts to determine whether a team is likely to win by an exceptional margin in a given game.

This means that for every game, both a predicted score and a predicted ‘big victory’ score are produced for each team. If the random forest then suggests that one of the teams in a match will completely annihilate their opposition, I use the ‘big victory’ score as the prediction. This has had a positive effect - for example, at time of writing New Zealand has recently beaten Canada 63–0. The original model predicted a 43–5 victory for New Zealand but the ‘big victory’ random forest model classified the game as likely to end in a big victory, so the ‘big victory’ score prediction was selected and with a prediction of 66–0, this considerably improved the models performance.

Communicating outputs

From the beginning of this project I wanted to ensure that there was a public record of predictions the framework made, for transparency and also to increase engagement with the project. I decided on Twitter for a few reasons, most importantly because of how easy the rtweet package makes automating a Twitter account. So I set up the @mel_rugby twitter account and wrote some functions which take the predictions and tweet them, including adding in emojis depending on how well the predictions were doing in order to make Mel seem a bit more sentient and engaging.

Lessons learned

I am now at a point where I manually open up the project every three days and run the master function. I do this still in case an error appears unexpectedly (this hasn’t happened for several runs, so fingers crossed that continues!), but in reality I could set up my laptop to run the framework at scheduled times or even pay for it to be hosted online so I can forget about it and it just runs. One of the key lessons I have learned in this project is something that I have heard time and time again from Data Science experts but have only just come to appreciate: data sourcing and cleaning takes considerably more time and effort than actually training the machine learning model.

Current performance and future additions

Mel is currently a very mixed bag in terms of predictions, many have been bang on and others not so much. I am excited for Phase 2 models to be put into production as I believe it will improve the predictions considerably. For more information on how Mel is doing with predicting results/scores, check out the Mel rugby twitter account!

In terms of future additions to the project, I have started work on a web application which would allow people to interact with Mel and see how score predictions change based on, for example: the exact players that are picked in the matchday squad; the outcome of games to be played between now and the game in question; the weather on the day; and other variables. As mentioned above I am also working on Phase 2 of the models, where I am looking to a) improve the quality and quantity of data being used as inputs, b) improve prediction of ‘upsets’, where a low-ranked team beats a high-ranked team and c) investigate player-level models of performance and how this can be used to augment score predictions. I am also planning to increase the sophistication of Mel the Twitter bot so that it can be interacted with and respond to queries. This will probably culminate in being hosted somewhere so I can leave it to keep running indefinitely!

Conclusion

Thanks for reading, I hope you found this interesting! If you’d like me to write another post elaborating on any of the above then please leave a comment below. I am considering a more detailed post either about setting up a Twitter bot, or going into more detail about the data transformation process. If you found this useful or interesting please do ‘clap’ and share this with anyone else who might enjoy it!

P.S If you’re wondering about the name, I originally called the project rugbyML (Machine Learning) but wanted a human name for the bot in order to give it more personality so ‘ML’ -> Mel!

This article was originally published on Medium.