Classification tasks for machine learning models usually represent a classic scenario, where the models are trained on labeled data, and their performances are evaluated based on some measurements, “metrics”.
Let’s consider the simplest scenario where there are only two “types” of data in our collection and we want to evaluate a model trained on this dataset. For our particular demo, our two “types” of data represent the two broad types of emotion (positive and negative) in movie reviews. Each “type” is called a class and as there are two classes in our data which we want to categorize using our model, this task is known as binary classification.
So, what do we mean by measurements? Well, in simple terms, if the model predicts that a review has a positive sentiment and it actually does, then our model is accurate for that particular review. If we aggregate the performance of the model across all the reviews, then we obtain the performance levels of the model on the entire dataset.
But this description only covers one type of scenario where actual and predicted labels are both 1. There can be three other types of scenarios. The “confusion matrix” shown here illustrates the four possible scenarios and their nomenclatures.
Now, with the background idea that model predictions can be true/false positives/negatives depending on the actual and predicted class labels, we can explore some of the fundamental metrics that are commonly used to evaluate model performance.
The LIT tool for our current demo presents 4 types of metrics. Accuracy, Precision, Recall, F1. All 4 of these can be represented in terms of ratios of the 4 types of quantities that we just described (true/false positives/negatives).
Accuracy is the ratio of correctly predicted observations to total observations. Precision is the ratio of correctly predicted positive observations to total predicted positive observations. Recall is the ratio of correctly predicted positive observations to all observations having class label 1. And F1 score is the weighted average of Precision and Recall.
By default, the Metrics component of the LIT tool displays the calculated values of the metrics across all the samples in the dataset (here N = 872). The correctness of these can also be manually verified using the above formulae, by plugging in the correct values of TN, TP, FP, FN from the confusion matrix. But, LIT Metrics can do more.
For example, we can choose to facet the calculations by label. This would show us the calculations of the metrics for each of the class labels separately apart from showing us the overall values. So, here the accuracy for class label 0 is 0.808 and the accuracy for class label 1 is 0.833. This particular dataset is very balanced with nearly a 1:1 ratio of the number of samples for both the classes, but sometimes the class distribution may not be balanced in the input. Faceting by label would bring out this problem easily because if we do not take any counter-measures to solve the class distribution problem, the model would over-perform or under-perform for one of the classes.
Another thing that we can do with the LIT tool is to check the metrics for individual examples or a small subset of examples. This can be a very useful method to understand if our model learnt useful features during training or not. For example, we can generate a counterfactual statement using one of the many generators like Hotflip, and check if the metrics are comparable even under perturbations. In this demo, I have generated a single hotflip example where the word “affecting” is replaced by the word “worse”, and the model is able to correctly classify this new sample as a negative sentiment.
Also, accuracy as a metric is usually only useful when false positives and false negatives have a similar cost. Other metrics like Precision and Recall are more meaningful when the Cost ratio of FP/FN is not 1. The binary Classifier threshold module can be used to modify the cost ratio of FP/FN and then calculate an optimal threshold based on the new ratio. This offers a very helpful way to try out different desired cost ratios and thresholds and observe their impact on the different metrics for the model.
The choice of deciding which metric to rely on and when is indeed a very difficult one. As this answer from stackoverflow demonstrates, different use cases have different needs which lead to different metrics being preferred.
To conclude, this was a brief intro to metrics and model evaluation with focus on how to utilize the metrics component of the LIT tool to better understand the performance of our trained models. Thank you :)