As explained in my article The CFB Ranking Problem, the nature of college football scheduling and competition makes objectively rating team skill very difficult. Forward Progress is my attempt to create a data driven computer model for ranking CFB teams in the vein of systems used in the now-defunct Bowl Championship series.
The goal of the algorithm, as with most computer football polls, is to order a population of teams by skill. There are a number of strategies for determining this order but my method was to build a model that can predict the winner of a theoretical matchup, then pair up each team with each other team and total the predicted wins for each team. I decided to predict winners by predicting how many points a given team would score against any other team and comparing those scores for each matchup.
The algorithm is built in a scripting language called R which is focused on statistics and data science.
I started by scraping publicly available websites for game level statistics for the previous 5 years. This included basic data like yardage, total passes and completions, rushes, penalties, and scores. As in any analysis project the next step was to fix and clean mistakes, though fortunately the datasets I collected were almost entirely intact with no missing values.
I created a few new features like completion percentage and pass/rush balance that I felt would be informative, then removed features that were too correlated to avoid clouding the final model.
Because Forward Progress is designed to predict scores the model cannot use statistics created in a game to predict the score of that same game. Training a model in this way would predict a game’s score from the stat line, which is interesting but not useful for games that haven’t been played. Instead I created linear models that predicted each feature based on the team that made the data, their opponent, and if the team was at home.
At this point in the model build I was at a crossroads. If I trained these feature models on the entire dataset the models would be highly biased as team performance will wax and wane over multiple seasons. Assuming that a team’s skill in a given feature was constant would introduce significant error in future predictions. On the other hand limiting the feature models to training on the most recent game results will result in overfitting and highly variable prediction results. As a compromise I generated weekly models for each feature that were trained on the previous 10 games for each team and tied those predicted features to that team’s score in their next matchup.
Once the prediction data was created I trained a Least Angle Regression model to predict the scores for each game. The final Root Mean Squared Error for those score predictions was around 12.5, meaning that the model can predict the number of points a team will score against a given opponent within 12.5 points on average. Not bad, but not perfect either. After translating predicted scores into matchup outcomes the model correctly classifies the winner 72.5% of the time at 0.43 kappa. Overall I’m pretty pleased with theses results for a first attempt at modeling something as unpredictable as college football and look forward to seeing how the model performs for the remainder of the 2017 season.