We argue that, when establishing and benchmarking Machine Learning (ML) models, the research community should favour evaluation metrics that better capture the value delivered by their model in practical applications. For a specific class of use cases – selective classification – we show that not only can it be simple enough to do, but that it has import consequences and provides insights what to look for in a ``good’’ ML model.