It’s not easy to use machine learning to model natural processes. If you’re a data scientist, you’ll have to spend a lot of time putting together and cleansing data. Additionally, they must examine how to appropriately model the data and what assumptions are being made about it. It’s important to look at the results of the study to see if there are any model biases or if the best data was used.
In this blog, I’m going to give it a whirl and see how it goes. I’ll continue my discussion of machine learning in professional sports, focusing on the NBA (NBA). On this particular day, I’m working on an algorithm for predicting game outcomes using only play-by-play data from the first half.
The outcome of a sporting event can be affected by a wide range of factors, even if the game itself is won or lost. Big plays, penalties, or pure luck can all tip the scales in favor of one team or the other depending on the sport. Could an algorithm possibly see through all of this?
Let us posit a theory.
Let’s begin by establishing a basic grasp of the data that will be used.
Basketball is a simple game: work with your teammates to put the ball in the hoop while stopping the opposing team from doing the same. As the game progresses, organized teams execute “plays” to move the ball closer to their opponent’s goal or to protect their own goal. However, a play can either go in a team’s favor or not at all times. As previously said, teams are vulnerable to major plays from either side, which can drastically alter the course of a game. Even if we only have play-by-play data from the first half of games, it is in our best interest to have as much information as possible to work with to better understand these shifts in the tides..”
NBA plays are documented when one or more of the following occurs:
- Basket is scored (free throws included)
- Shot attempt is blocked or rebounded
- A penalty or foul is called by a referee
- Timeout is called from either team
- Turnover (e.g. a steal)
- Player substitution
Plays can arise by chance or by deliberate planning. However, this is irrelevant here. Whether a play indicates that a team is getting closer to victory or defeat is what matters. The only thing in the above list that isn’t indicative of a win or a loss is a player substitution (although this assumption is made, it is rather safe). The rest are justifiable as well:
- Scoring points puts a team in a better position to win
- Blocking or rebounding shots prevents the other team from scoring
- Penalties put teams in tough situations, as well as provide opportunities for the opposing team
- Teams often call a timeout to ease the pressure on themselves, make adjustments, or slow down the other team
- Turnovers present opportunities for opponents to score
As a result, “winning plays” and “losing plays” are appropriate categories for this data because the outcome of a game is to win or lose. Because Team A is competing against Team B, I opted to set it up as follows:
Team A plays “Winning” (Team B plays “Losing”):
Team A makes a field goal.
A shot is blocked by Team A.
Team A gets a rebound, either offensive or defensive. An offensive rebound signifies that an earlier shot was unsuccessful, therefore it’s important to keep this in mind.) It’s like they’re cancelling each other out. For the sake of having the data, I decided to leave them in.)
A foul or penalty is assessed to Team B. (In most cases this leads to turnovers or scoring opportunities for Team A.)
Team B has called a timeout for this round of play (This assumes the timeout is called in response to Team A gaining an advantage.)
A turnover by Team B results in an opportunity for Team A.
Now, play-by-play data from the NBA can be arranged so that a machine learning model can understand it. That being said, there are a few more decisions and data points I’d want to share from this experiment.
To determine the winner of a basketball game, only points are taken into account. Once the clock strikes zero, it doesn’t matter how many penalties, substitutions, or turnovers occurred. A separate “scoring plays” variable is reserved for the most important types of plays. In other words, if a “winning play” leads to a “scoring play,” it will be counted as a “scoring play.” For this purpose, we’re trying to underline the relevance of scoring as the model is learning.
Depending on how much time is left in a game, some plays have a greater impact than others. In games where the score is tight, a basket or a block can have a significant impact on the outcome. As a result, I’ve added the number of seconds left in regulation as an additional variable (48 minutes).
In the NBA, momentum is extremely tough to measure. When a player has a string of good or bad plays in fast succession, they feel this effect. Research publications have been published on the topic of momentum after years of investigation. However, the number of scoring, winning, and losing plays that happened in the final 60 seconds of play is a more reliable way to gauge momentum than any other. To underline the significance of the many forms of play, they are all scaled to a different degree. For my experiments, I utilised this momentum equation, where 60 seconds equals one second.
It took me a full day to come up with an algorithm that could learn from this dataset. In the first half of an NBA game, I discovered that there were approximately 200 plays. In addition, in a regular season, every club plays at least 82 games, so we have a lot of play-by-play data to work with.
There’s no way to predict the outcome based on a single piece of data. When I saw this, it reminded me of machine learning on image data, because you’re generally dealing with a million or more pixels of data. And a single pixel tells you very little about what an image looks like. The Neural Network is where I begin.
A neural network’s fundamental structure. The author created this image.
The input layer of a neural network is where data is fed into the system, and the network then “mathematically” traverses a network of “hidden” layers made up of nodes. A prediction of the input data (i.e. “dog” or “cat” for an image, or “win” or “loss” for NBA data) is made at the output layer.
Dogs and cats are generally predicted as 70 percent and 30 percent chances, respectively. A procedure is known as “back-propagation” is used to remedy any errors made in the most recent prediction. The training method is exactly what I’ve just started. After training, new data can be fed into the network from the input layer so that a prediction can be produced on the output layer.
Using a Neural Network was a bad decision. For this project, I utilized the scikit-Multi-layer learns Perceptron (MLP) classifier. My model may have under-fitted the NBA data, I believe. Changing the number of hidden layers or the number of nodes in each hidden layer had little effect on the predictions made by the data set.
The amount of data provided also increased from half to three-quarters of the total. A 50% increase in data did not improve the model’s ability to make reliable predictions. If a team had a solid record (winning roughly 65 per cent of games) in the 2020–2021 NBA season, the prediction would be that they will always win. Teams with losing records would also be predicted to lose. This indicates that the data I used for an MLP classifier was insufficient.
I’ll have to start from scratch.
Support Vector Machines
That is why I assumed a machine learning model could distinguish between instances of winning and instances of losing in this experiment because there were only two outcomes that mattered in this study: winning and losing. This leads me to the Support Vector Machine (SMV) (SMV).
In an SVM, the hyperplane separates game results by their margin of victory or defeat.
To learn more about SVMs, check out this excellent post by Rushikesh Purple. To conclude, the SVM algorithm finds the best hyperplane that separates our data into classes. To translate, assume I were to plot out all the data points that resulted in a win as well as all the data points that resulted in a loss. When using the SVM, a plane or line on the plot is found that best distinguishes between wins and losses.
I replaced the MLP classifier with the scikit-LinearSVC learn a classifier. It was a game-changer from the get-go. Only the winning teams, but not the unbeaten teams, were forecasted as winning teams. Results from the first half of the study are shown below (i.e. approximately the first 200 plays in the game). In the absence of adjusting hyperparameters, the following results were obtained:
The 2020–2021 Nets, Pistons, Rockets, Clippers, Thunder, Kings, and Raptors all have high accuracy rates, so things are looking up. 60.56 per cent of the league’s players are accurate. SVM hyperparameters were not tuned, therefore there may be an opportunity for improvement. Before moving on, let’s take a look at an example of a success statistic.
Each of the percentages shown above may be predicted with a 100% accuracy. Because of the nature and context of competitive sports, the use of available data, and the assumptions made, this is highly impractical. Achieving 100% accuracy for certain teams while achieving 20% accuracy for others is similarly incorrect. If this model were to achieve perfect accuracy, it would mean that it would be overly tailored to the current season and would be incapable of extrapolating to other seasons in which rosters and player experience levels have changed. What I did was optimise for was as follows:
Every night in the NBA, a team faces another team from the league in a game. There will always be a winner and a loser in any given situation. That means the League has an average of 41 wins and 41 defeats throughout an 82-game season. If this model is accurate, it will forecast a League average of 41 wins, or a win rate of.500 if we are not predicting off of 82 games. After running, we’ll compare the results to the League’s average number of victories.
Rather than focusing on picking each team’s games exactly, this metric focuses on creating a collective expectation for the entire league. The NBA is what it is because of the interactions between its teams, not because anyone team operates in a vacuum.
Hyper Parameter Optimization (Validation)
An untuned LinearSVC model was used to generate the aforementioned findings. By utilising the SVC classifier from sklearn, we are given numerous parameters to examine. Mohtadi Ben Fraj’s post “In-Depth: Parameter Tuning for SVC and Sweeping Hyper Parameters for Polynomial and RBF Kernels of the SVM” will serve as a guide for me.
C is a regularisation hyperparameter. The strength of the regularisation is inversely proportional to C. Gamma is a non-linear hyperparameter for hyperplanes that are not linear. The higher the value of the gamma parameter in a model, the more it tries to match the training data set as closely as possible. Both numbers are scaled from 0.1 to 1000 on a logarithmic scale.
Using filename conventions for cross-referencing, I did sweeps on the following cases:
- gamma sweep for RBF kernel “svc_rbf_gamma_x”
- gamma sweep for polynomial kernel (for degrees 0–6) “svc_poly_gamma_x_degree_y”
- C sweep for RBF kernel “svc_rbf_c_x”
- C sweep for polynomial kernel (for degrees 0–6) “svc_rbf_c_x_degree_y”
The notebook’s full runtime on my Macbook took roughly 30 hours, resulting in enough data for 16 different charts. Although posting all of them would be overkill, you can see the complete source code, as well as the data and charts that go along with it, by following the link at the end of this page. The more relevant findings will be discussed later.
Using different values of C or gamma, there is virtually no association between in-season prediction accuracy or league victory rate, according to the polynomial sweep for C. This was disheartening, but I decided to dig a little deeper to see if there were any less obvious patterns to be found. This is the overall rate of change over the whole sweep, shown by trend lines. There is a downward tendency in all trend lines for the Average Season Prediction Accuracy (ASPA). Because ASPA increases as C drops (and regularisation is more widespread), lower values of C might produce a better fit for the training data, according to this study.
Three trend lines in League Average Win Rate (LAWR) are declining and four trend lines are rising. In this case, the trend line’s direction is less relevant. This score should be closer to 0.500, which implies that the model predicts that there will be an equal number of winners and losers in the league. The LAWR trend for a 4-degree polynomial is the closest to 0.5. In keeping with the ASPA tendency, the trend line also shows a downward slope. Using a 4-degree polynomial with lower values of C may produce better results.
Now, let’s have a look at the gamma-scanning results.
The model’s gamma parameter controls how closely it tries to match the training data. We discover that the trend line decreases somewhat or remains flat throughout all polynomial sweeps. This tells us that overfitting the training data, in other words, will typically reduce generalization accuracy. In this case, Gamma can either be omitted or included with lower values, depending on the situation.
Other than gamma and C, it is worth noting that LAWR is often higher for polynomials with more degrees. As you go down in degrees, you’ll see trend lines that are closer to 0,500 than those at lower decimal levels, such as 4, 5, and 6.
In addition, the 0-degree polynomials will almost certainly underfit the data, as I’d like to note. That’s because there aren’t nearly enough model parameters (as opposed to hyperparameters) to make sense of the training data. However, their results should be taken with a grain of salt.
Gamma has a strong link with an RBF kernel instead of a polynomial one. Gamma increases by around 2.5 percent for every power of 10 increase in gamma. Gamma=10 is the sweet spot when it comes to hitting the target LAWR of 0.500. Gamma’s usefulness depends on the SVM kernel employed here.
At higher values of C, LAWR approaches a value of 0.500 when sweeping for the RBF kernel. For the RBF kernel, regularisation has the reverse effect of regularisation on the polynomial kernel. In this case, LAWR (and ASPA) improve as regularisation diminishes.
For the sake of testing, I’m going to use data from the 2021–22 season to train an SVM with C=0.01 and gamma=0.001 to see how accurate it is at predicting game results.
Testing is comparable to training in that it requires a similar setup. I built a model based on the first 25 games of the NBA season in 2021–22. Because we’re making predictions for a new season, any predictions should take into account the shifting makeup of the league’s rosters. The ASPA was 0.5842 and the LAWR was 0.499 with the model set up as described above.
At most, this is a modest model. However, it works well for teams with high or low win-loss records. To put it another way, the model’s predictions are in line with the data it has already gathered. Predicting teams with a win-loss ratio near 1.0 is more difficult. The model must rely less on current game patterns and more on play-by-play data in these situations. The standard deviation of test results for each team might be used to quantify this effect.
A team can have a prediction of 0 to 25 wins. The standard deviation for C=0.01 and gamma=0.001 is 12.21. Since it’s near to the mean victory value of 12.5, we may assume that there’s a low amount of noise in the data. A standard deviation of 9.65 was achieved with very modest adjustments to the settings. Performance in preemptive prediction is equivocal due to a LAWR of 0.499, but the results are consistent in terms of how accurate they are.
Next Steps and Conclusion
Using only play-by-play data from the first half of an NBA game, a machine learning algorithm can predict the outcome of the game. While the model can forecast the future, it does so with a degree of uncertainty. Re-evaluating data used in model training would seem to be an appropriate next step after this prediction’s outcome. The model may be able to learn more about the game by extracting more features from the play-by-play data.
When looking at this from the player’s point of view, I’d want to offer a whole new paradigm. Models with this feature would be more stable over multiple seasons and take into consideration a player’s contributions to the squad they are currently on. Jayson Tatum, for example, currently has the most missed shots in the league at the time of this writing. Kevin Durant is the league’s leading scorer with the most attempted shots. Because Kevin Durant will continue to score throughout the rest of the game, the significance of Tatum missing a shot is negligible. The shift from 30 other teams to over 500 players and the maintenance of a rich data set would be a challenge here.
This project, despite its lackluster results, is nonetheless interesting to me because of what I’ve learned thus far. In this challenge, there is a great deal of room for experimentation, from feature extraction to hyperparameter tuning. As far as funding research goes, I think this would be an excellent pick-up experiment to delve into further.:)
This Github repository contains all of the source code that was used to create this website.