Predicting a Fitness Center’s Class Attendance with Machine Learning

For this hands-on exam, I was tasked to analyze and then predict the attendance rates for a Fitness Center’s group classes. Guide questions were provided to move the analysis along. There is a given criteria for each of the columns, so I used Pandas to look out for data points that have errors.

Analysis

I have created an attendance rate column by dividing the attendance number with the class capacity. I then computed the correlation of the numerical features with attendance rate.

I also used qq-plots to see if the columns are normally distributed.

For the categorical columns, I have used a boxplot to show the distribution of attendance rates across each category.

The following were some of my insights:

75% of the class have less than 66% attendance rate (75th percentile)
Using the histogram of the pairplot, attendance rates are almost uniformly distributed with a few outlier values. Age, new members, and over 6 month members, follow a normal distribution. The qq-plot also confirm this close to normal distribution as they fit the normal-line closely.
From both the scatterplot in the pairplot, and the correlation computation, classes with higher average ages tend to have low attendance rates (negative correlation), and classes with higher sign-ups (both new and over 6 month members) tend to have higher attendance rates (positive correlation).
The distribution of attendance rates has been broken down across categories using box plots.
Tuesdays and Fridays have higher median attendance rates
Attendance rates between AM and PM classes are very similar.
In terms of class activity, strength and yoga classes have higher attendance rates while cycling have the lowest.

Modeling & Evaluation

Evaluation Strategy

First, I have set aside 10% of my data as a final test set. I then create a naive model that constantly outputs the mean attendance rate of the training data.
Next, for validation, I used 10-Fold validation (repeated 3 times). Since the data set is small and the models that are chosen aren’t very computationally expensive, I validated the models on random subsets multiple times to generate a distribution of model performance. This has shown me not only the average error but also spread. I am looking for a model that has low average error and small spread.
Finally, I fit the models on the whole dataset and then tested them against the held out test set to see its performance on never before seen data.

Metric

I have chosen to use root mean square error as my evaluation metric.
It is a very popular metric used for regression problems. It measure the distance of the predictions from the correct values and it penalizes very far-off predictions.

Outcome

All models performed better than the naive model.
For both the validation and testing, the fine tuned Ridge Regression model comes out on top, but it is only better by about 0.001.
The Random Forest model was not able to outperform the linear models.
Since we predicted attendance rates, if we transform it back to number of students, our model’s prediction can be off by 2-3 students.
I choose the Ridge Regression model as the better performing approach. At the scale of the dataset it is not much more complex as compared to linear regression, and by fine-tuning we can get slightly better performance.

Analysis

Modeling & Evaluation

Read More