This is a group project from Data Science Workflows in UBC’s Master of Data Science program.
We developed machine learning models to predict the court minutes of NBA players in future games. The whole project workflow is reproducible with Make and Docker. More details are specified as follows:

Authors

  • Jarvis Nederlof, Roc Zhang, Jack Tan

Project’s GitHub repo

  • Project’s GitHub repo can be found here.

Project Report

  • The final report can be found here.

About

We have built a regression model using a light gradient boosting model to predict the number of expected minutes an NBA basketball player will play in an upcoming game. Our final model performed well on an unseen test data set, achieving mean squared error of 38.24 with a coefficient of determination of 0.65. Both metrics showed better performance compared to a players 5-game average minutes played (our evaluation metric) of 50.24 and 0.55, $MSE$ and $R^2$ respectively. The results represent significant value in the context of Daily Fantasy Sports, and the prediction model could be used as is. However, we note possible areas of further improvement that, if explored, could provide improved predictions, and more value.

The data set used in this project is of the NBA Enhanced Box Score and Standings (2012 - 2018) created by Paul Rossotti, hosted on Kaggle.com. It was sourced using APIs from xmlstats. A copy of this dataset is hosted on a separate remote repository located here to allow easy download with authenticating a Kaggle account. The particular data file used can be accessed here. Each row in the data set represents a player’s box score statistics for a particular game. The box score statistics are determined by statisticians working for the NBA. There were 151,493 data examples (rows).

Usage

You can run this analysis a few different ways. Start by cloning/downloading this repository, and navigate to the root of the project using the command line.

Run with Docker

To run the analysis using Docker type the following (fill with the absolute path to the root of this project on your computer):

> docker run --rm -v <PATH_ON_YOUR_COMPUTER>:/home/nba_minutes jnederlo/nba_minutes make -C '/home/nba_minutes' all

To clean up the analysis type:

> docker run --rm -v <PATH_ON_YOUR_COMPUTER>:/home/nba_minutes jnederlo/nba_minutes make -C '/home/nba_minutes` clean

The Docker container is hosted on Docker Hub and can be viewed here. The Dockerfile can be viewed here.

Run with Make

Alternatively, you can use make commands from the root of the directory of this project to reproduce the analysis. The commands are listed as follows:

##### General commands #####
# Run the whole workflow
make all

# Clean all of the workflow outputs
make clean

##### Run the workflow one at a time in order #####
# Download the data and save to file
make data/2012-18_playerBoxScore.csv

# Wrangle and preprocess the data - generate features and save data to a file
make data/player_data_ready.csv

# Run the Exploratory Data Analysis (EDA) - save results in a file
make results/EDA-correl_df_neg_9.csv results/EDA-correl_df_pos_20.csv results/EDA-feat_corr.png results/EDA-hist_y.png

# Train the models and make predictions - generate figures for final report
make results/modelling-gbm_importance.png results/modelling-residual_plot.png results/modelling-score_table.csv 

# Generate the final report
make report.pdf

You can view the Makefile here.

If running locally, and not with Docker, make sure you have the required dependencies installed.

Dependencies

Python / R / System

  • Python 3.7.5 and Python packages:
    • pandas==0.25.2
    • numpy==1.17.2
    • docopt==0.6.2
    • requests==2.20.0
    • tqdm==4.41.1
    • selenium==3.141.0
    • altair==4.0.1
    • scikit-learn==0.22.1
    • matplotlib==3.1.2
    • selenium==3.141.0
    • termcolor==1.1.0
    • jupyterlab==1.2.3
    • lightgbm==2.3.1
    • xgboost==0.90
  • R version 3.6.1 and R packages:
    • tidyverse==1.2.1
    • docopt==0.6.2
  • System requirement:

Licence

The NBA Minutes Predictor materials here are licensed under the MIT License. If re-using/re-mixing please provide attribution and link to this repository.