Classification vs. Prediction

Frank Harrell:

The field of machine learning arose somewhat independently of the field of statistics. As a result, machine learning experts tend not to emphasize probabilistic thinking. Probabilistic thinking and understanding uncertainty and variation are hallmarks of statistics. By the way, one of the best books about probabilistic thinking is Nate Silver’s The Signal and The Noise: Why So Many Predictions Fail But Some Don’t. In the medical field, a classic paper is David Spiegelhalter’s Probabilistic Prediction in Patient Management and Clinical Trials.

By not thinking probabilistically, machine learning advocates frequently utilize classifiers instead of using risk prediction models. The situation has gotten acute: many machine learning experts actually label logistic regression as a classification method (it is not). It is important to think about what classification really implies. Classification is in effect a decision. Optimum decisions require making full use of available data, developing predictions, and applying a loss/utility/cost function to make a decision that, for example, minimizes expected loss or maximizes expected utility. Different end users have different utility functions. In risk assessment this leads to their having different risk thresholds for action. Classification assumes that every user has the same utility function and that the utility function implied by the classification system is that utility function.