Back to the index

Measuring quality of predictions

How do we know if Cassandra is making good predictions? We have to be careful not to fall into the ‘Trump trap' here. 538's forecast for the 2016 US elections had a 71% probability of a Clinton victory. People were therefore outraged when Clinton did not win. They forgot that things with a 30% chance of happening happen..3 times out of 10 (Brexit was 80% or so on the markets I believe, so 1 times out of 5 for that)

When Cassandra gives a prediction of 80% YES, this means that out of 5 markets with such a prediction, we expect one to resolve NO.

This means that we cannot rely on any one forecast to judge our performance, instead we need to compare our predictions with the frequency the markets are resolving YES or NO. This is our calibration.

If events that we predict will occur 60% of the time occur 60% of the time, and events that we predict will occur 70% of the time occur 70% of the time, and events that we predict will occur 80% of the time occur 80% of the time, etc.. then we have perfect calibration.

It is useful to track calibration to see if we are over or under-confident (often people might be over-confident in some cases and under in others. For example I might be over-confident when I give predictions around 90% and under-confident when I give predictions around 60%. Tracking calibration allows us to notice this and adjust)

However, this is not the only thing we care about. For example if we were predicting chance of rain, and we know that it rains 50% of the time, and we simply always predict 50% chance of rain, then we have perfect calibration. We will indeed have rain 50% of the time. Unfortunately this is not useful information. We want to know which days it will ran on. We also want high discrimination.

Discrimination or resolution is how good the forecast is at assigning high probabilities when the event occurs. Perfect discrimination would require that we assigned 100% probability for days when it rained. The father away in the right direction from the base rate you get (ie from 50%), the better resolution (and the more useful your predictions).

Ideally we want some sort of way of tracking both our discrimination and our calibration, seeing how close we are to perfection (always assigning 100% if the event happens and 0% if it does not! ie. becoming Cassandra herself). A common measure is the Brier score, created to track accuracy of weather forecasts. The brier score of a prediction is the square of the difference between the outcome (YES = 1, NO = 0) and the assigned probability. For example:

If the forecast is 100% and it rains, then the Brier Score is 0, the best score achievable. If the forecast is 100% and it doesn't rain, then the Brier Score is 1, the worst score achievable. If the forecast is 70% and it rains, then the Brier Score is (0.70−1)² = 0.09. If the forecast is 30% and it rains, then the Brier Score is (0.30−1)² = 0.49. If the forecast is 50% then the Brier score is (0.50−1)² = (0.50−0)² = 0.25, regardless of the outcome.

A lower Brier score is better. To get the brier score of Cassandra we would take the mean of all the individual scores.

(If you had no extra knowledge other than the base rate, the best strategy is to predict 50%, which will result in a 0.25 Brier score. If you had access to BBC weather, you might be able to improve on that by giving >50% predictions on days it rains. To get a perfect Brier score of 0, you would need to always predict 100% on days it ended up raining!)

So what is a ‘good' brier score? This can be hard to say. If we are over 0.25 we would be better off just saying ‘everything has 50% probability'. If we are over 0.33 we would be better off just giving a completely random prediction. It will be interesting to track our score over time, and see if we improve as an organisation. (hopefully we are <0.25 at least!)

(You can in fact decompose the Brier score into multiple components which correspond to callibration and discrimination. If you're interested in learning more there is good info at and\_score#Decompositions)

Back to the index

Last modified 2019-07-29 Mon 20:45. Contact