Model calibration is the process of ensuring that the predicted probabilities produced by a model align with the true underlying probabilities of the data. A well-calibrated model will output probabilities that accurately reflect the likelihood of a certain outcome. For example, if a model predicts a probability of 0.8 for a positive outcome, it should be the case that 80% of the time the outcome is actually positive.

A binary classification model can have high precision and high recall, but if the predicted probabilities produced by the model do not align with the true underlying probabilities of the data, it is not considered well calibrated. An example of this can be a model used for predicting credit default, which is utilized by financial institutions to determine the likelihood of a borrower defaulting on their loan. These models are usually trained on large datasets of historical credit data and various features such as income, credit history, employment status, and more.

However, if the model lacks proper calibration, it may not accurately predict the probability of default, despite being able to accurately separate positive from negative examples. For instance, the model may predict that a borrower has a 90% chance of defaulting, when in reality the probability is only 50%. This discrepancy can lead to poor decision-making, as the model's predictions will not be in line with the true probabilities, resulting in financial losses for the institution and adverse consequences for borrowers.

I recently wrote about a price optimization model that I had developed in the past. In that particular model, it was also critical to ensure we had calibrated scores, as any inaccuracies in the predicted probability of a sale would have led to flawed calculations of the expected revenue (sale probability * potential revenue = expected revenue), ultimately resulting in suboptimal pricing decisions.

To ensure the accuracy of the probabilities and sound decision making, it is often important to calibrate a model. Techniques such as Platt scaling or Isotonic regression can be used to adjust the model's predictions to align with the true probabilities. Platt scaling is a technique that involves fitting a logistic regression model to the predicted scores of a binary classification model, while Isotonic Regression involves fitting a free-form line to the predicted scores. Both techniques aim to discover a transformation of the predicted scores that aligns them more closely with the true probabilities.

In conclusion, model score calibration is a crucial step in ensuring that the predictions made by a model align with the true underlying probabilities of the data. A well-calibrated model will generate probabilities that accurately depict the likelihood of a specific outcome. It is important to consider the usage of the model scores and implement calibration during the development and deployment stages of machine learning models when necessary.

This was always one of the first failure modes I saw in my discipline of Operations Research. They'd never recalibrate.