This is a group project from Data Science Workflows in UBC’s Master of Data Science program.
We developed machine learning models to predict the court minutes of NBA players in future games. The whole project workflow is reproducible with Make
and Docker
. More details are specified as follows:
Authors
- Jarvis Nederlof, Roc Zhang, Jack Tan
Project’s GitHub repo
- Project’s GitHub repo can be found here.
Project Report
- The final report can be found here.
About
We have built a regression model using a light gradient boosting model to predict the number of expected minutes an NBA basketball player will play in an upcoming game. Our final model performed well on an unseen test data set, achieving mean squared error of 38.24 with a coefficient of determination of 0.65. Both metrics showed better performance compared to a players 5-game average minutes played (our evaluation metric) of 50.24 and 0.55, $MSE$ and $R^2$ respectively. The results represent significant value in the context of Daily Fantasy Sports, and the prediction model could be used as is. However, we note possible areas of further improvement that, if explored, could provide improved predictions, and more value.
The data set used in this project is of the NBA Enhanced Box Score and Standings (2012 - 2018) created by Paul Rossotti, hosted on Kaggle.com. It was sourced using APIs from xmlstats. A copy of this dataset is hosted on a separate remote repository located here to allow easy download with authenticating a Kaggle account. The particular data file used can be accessed here. Each row in the data set represents a player’s box score statistics for a particular game. The box score statistics are determined by statisticians working for the NBA. There were 151,493 data examples (rows).
Usage
You can run this analysis a few different ways. Start by cloning/downloading this repository, and navigate to the root of the project using the command line.
Run with Docker
To run the analysis using Docker type the following (fill
> docker run --rm -v <PATH_ON_YOUR_COMPUTER>:/home/nba_minutes jnederlo/nba_minutes make -C '/home/nba_minutes' all
To clean up the analysis type:
> docker run --rm -v <PATH_ON_YOUR_COMPUTER>:/home/nba_minutes jnederlo/nba_minutes make -C '/home/nba_minutes` clean
The Docker container is hosted on Docker Hub and can be viewed here. The Dockerfile
can be viewed here.
Run with Make
Alternatively, you can use make
commands from the root of the directory of this project to reproduce the analysis. The commands are listed as follows:
##### General commands #####
# Run the whole workflow
make all
# Clean all of the workflow outputs
make clean
##### Run the workflow one at a time in order #####
# Download the data and save to file
make data/2012-18_playerBoxScore.csv
# Wrangle and preprocess the data - generate features and save data to a file
make data/player_data_ready.csv
# Run the Exploratory Data Analysis (EDA) - save results in a file
make results/EDA-correl_df_neg_9.csv results/EDA-correl_df_pos_20.csv results/EDA-feat_corr.png results/EDA-hist_y.png
# Train the models and make predictions - generate figures for final report
make results/modelling-gbm_importance.png results/modelling-residual_plot.png results/modelling-score_table.csv
# Generate the final report
make report.pdf
You can view the Makefile
here.
If running locally, and not with Docker, make sure you have the required dependencies installed.
Dependencies
Python / R / System
- Python 3.7.5 and Python packages:
- pandas==0.25.2
- numpy==1.17.2
- docopt==0.6.2
- requests==2.20.0
- tqdm==4.41.1
- selenium==3.141.0
- altair==4.0.1
- scikit-learn==0.22.1
- matplotlib==3.1.2
- selenium==3.141.0
- termcolor==1.1.0
- jupyterlab==1.2.3
- lightgbm==2.3.1
- xgboost==0.90
- R version 3.6.1 and R packages:
- tidyverse==1.2.1
- docopt==0.6.2
- System requirement:
- ChromeDriver==79.0.3945.36 # $ brew cask install chromedriver click here for more information
- Latex (TeX Live 2019) click here for more information
Licence
The NBA Minutes Predictor materials here are licensed under the MIT License. If re-using/re-mixing please provide attribution and link to this repository.