Predicting a Fitness Center’s Class Attendance with Machine Learning

For this hands-on exam, I was tasked to analyze and then predict the attendance rates for a Fitness Center’s group classes. Guide questions were provided to move the analysis along. There is a given criteria for each of the columns, so I used Pandas to look out for data points that have errors.

Analysis

I have created an attendance rate column by dividing the attendance number with the class capacity. I then computed the correlation of the numerical features with attendance rate.

I also used qq-plots to see if the columns are normally distributed.

For the categorical columns, I have used a boxplot to show the distribution of attendance rates across each category.

The following were some of my insights:

  • 75% of the class have less than 66% attendance rate (75th percentile)
  • Using the histogram of the pairplot, attendance rates are almost uniformly distributed with a few outlier values. Age, new members, and over 6 month members, follow a normal distribution. The qq-plot also confirm this close to normal distribution as they fit the normal-line closely.
  • From both the scatterplot in the pairplot, and the correlation computation, classes with higher average ages tend to have low attendance rates (negative correlation), and classes with higher sign-ups (both new and over 6 month members) tend to have higher attendance rates (positive correlation).
  • The distribution of attendance rates has been broken down across categories using box plots.
  • Tuesdays and Fridays have higher median attendance rates
  • Attendance rates between AM and PM classes are very similar.
  • In terms of class activity, strength and yoga classes have higher attendance rates while cycling have the lowest.

Modeling & Evaluation

Evaluation Strategy

  • First, I have set aside 10% of my data as a final test set. I then create a naive model that constantly outputs the mean attendance rate of the training data.
  • Next, for validation, I used 10-Fold validation (repeated 3 times). Since the data set is small and the models that are chosen aren’t very computationally expensive, I validated the models on random subsets multiple times to generate a distribution of model performance. This has shown me not only the average error but also spread. I am looking for a model that has low average error and small spread.
  • Finally, I fit the models on the whole dataset and then tested them against the held out test set to see its performance on never before seen data.

Metric

  • I have chosen to use root mean square error as my evaluation metric.
  • It is a very popular metric used for regression problems. It measure the distance of the predictions from the correct values and it penalizes very far-off predictions.

Outcome

  • All models performed better than the naive model.
  • For both the validation and testing, the fine tuned Ridge Regression model comes out on top, but it is only better by about 0.001.
  • The Random Forest model was not able to outperform the linear models.
  • Since we predicted attendance rates, if we transform it back to number of students, our model’s prediction can be off by 2-3 students.
  • I choose the Ridge Regression model as the better performing approach. At the scale of the dataset it is not much more complex as compared to linear regression, and by fine-tuning we can get slightly better performance.

 

Read More

Address
Quezon City, PH

Work Hours
M-F  07:00-16:00