Verification: assessment of calibration and accuracy

In ensemble forecasting, forecast verification methods are needed to diagnose both the need for statistical post-processing, and the effectiveness of the post-processing methods in producing calibrated and accurate forecasts. This chapter discusses an array of techniques that can be used in this context, making the distinction between verification tools that are useful for ranking competing forecasters and those that are more appropriate for improving our understanding of the performance of a single method. With a focus on continuous variables, verification methods for both univariate and multivariate forecasts are discussed, including approaches that are specifically tailored to the evaluation of extreme events. Front page photo by Aaron Burden is from https://unsplash.com.


Introduction
In a discussion article on the application of mathematics in meteorology, Bigelow (1905) describes the fundamentals of modeling in a timeless manner: "There are three processes that are generally essential for the complete development of any branch of science, and they must be accurately applied before the subject can be considered to be satisfactorily explained. The first is the discovery of a mathematical analysis, the second is the discussion of numerous observations, and the third is a correct application of the mathematics to the observations, including a demonstration that these are in agreement." The topic of this chapter are methods for carrying out the last item on Bigelow's list, that is, methods to demonstrate the agreement between a model and a set of observations. Ensemble prediction systems and statistically postprocessed ensemble forecasts provide probabilistic predictions of future weather. Verification methods applied to these systems should thus be equipped to handle both the verification of the best prediction derived from the ensemble and the verification of the associated prediction uncertainty. Murphy (1993) argues that a general prediction system should strive to perform well on three types of goodness: There should be consistency between the forecaster's judgment and the forecast, there should be correspondence between the forecast and the observation, and the forecast should be informative for the user. Similarly,  state that the goal of probabilistic forecasting should be to maximize the sharpness of the predictive distribution subject to calibration. Here, calibration refers to the statistical consistency between the forecast and the observation, while sharpness refers to the concentration of the forecast uncertainty; the sharper the forecast, the higher information value will it provide. The prediction goal of  is thus equivalent to Murphy's second and third types of goodness.
We focus on verification methods for probabilistic predictions of continuous variables in one or more dimensions under the general framework described by Murphy (1993) and . Specifically, we denote an observation by y = (y 1 , . . . , y d ) ∈ Ω d for d = 1, 2, . . . where Ω denotes either the real axis R, the non-negative real axis R ≥0 , the positive real axis R >0 , or an interval on R. A probabilistic forecast for y given by a dis-ical distribution of a series of events against a predictive distribution. The scores may focus on certain aspects of the forecast, such as the tails, and it is important to also assess the uncertainty in the scores. The properties of various univariate scores are compared in a simulation study. While the methods in Section 3 provide a decision theoretically coherent approach to model evaluation and model ranking, they may hide key information about the model performance such as the direction of bias. Additional evaluation may thus be needed to better understand the performance of a single model. Approaches for this are discussed in Section 4. The chapter then closes with a summary in Section 5.

Calibration
Calibration, or reliability, is the most fundamental aspect of forecast skill for probabilistic forecasts as it is a necessary condition for the optimal use and value of the forecast. Calibration refers to the statistical compatibility between the forecast and the observation; the forecast is calibrated if the observation cannot be distinguished from a random draw from the predictive distribution.

Univariate calibration
Several alternative notions of univariate calibration exist for both a single forecast Tsyplakov, 2013) and a group of forecasts (Strähl and Ziegel, 2017). We focus on the so-called probabilistic calibration as suggested by Dawid (1984); F is probabilistically calibrated if the probability integral transform (PIT) F (Y ), the value of the predictive cumulative distribution function in the random observation Y , is uniformly distributed. If F has a discrete component, a randomized version of the PIT given by with V ∼ U([0, 1]) may be used, see Gneiting and Ranjan (2013).
Assume our test set consists of n observations y 1 , . . . , y n . For a forecasting method issuing continuous univariate predictive distributions F 1 , . . . , F n , calibration can be assessed empirically by plotting the histogram of the PIT values F 1 (y 1 ), . . . , F n (y n ).
A forecasting method that is calibrated on average will return a uniform histogram, a ∩-shape indicates overdispersion and a ∪-shape indicates underdispersion while a systematic bias results in a triangular shape histogram. Some examples of miscalibration are shown in Figure 1, including an example of a multiply mis-specified forecast where the left tail is too light, the main bulk of the distribution lacks mass and the right tail is too heavy.
The discrete equivalent of the PIT histogram, which applies to ensemble forecasts, is the verification rank histogram (Anderson, 1996;Hamill and Colucci, 1997). It shows the distribution of the ranks of the observations within the corresponding ensembles and has the same interpretation as the PIT histogram.
The information provided by a rank histogram may also be summarized numerically by the reliability index (RI) which is defined as where I is the number of (equally-sized) bins in the histogram and ζ i is the observed relative frequency in bin i = 1, . . . , I. The reliability index thus measures the departure of the rank histogram from uniformity (Delle Monache et al., 2006).

Multivariate calibration
For assessing the calibration of multivariate forecasts, we focus on four different approaches that follow a general two-step framework formalized by Gneiting et al. (2008). Further approaches are discussed in Gneiting et al. (2008), Ziegel and Gneiting (2014) and Wilks (2016). Let S = {x 1 , . . . x K , y} denote a set of K + 1 points in Ω d comprising an ensemble forecast with k members and the corresponding observation y. The rank of y in S, rank S (y), is calculated in two steps, (i) apply a pre-rank function ρ S : Ω d → R ≥0 to calculate the pre-rank ρ S (u) of every u ∈ S resulting in a univariate value for each u; (ii) set the rank of the observation y equal to the rank of with ties resolved at random.
The difference between the four approaches thus lies in the definition of the pre-rank function ρ S in step (i). The multivariate ranking of Gneiting et al. (2008) is defined using the pre-rank function Verification: assessment of calibration and accuracy where 1 denotes the indicator function and v u if and only if v i ≤ u i in all components i = 1, . . . , d. Gneiting et al. (2008) further consider an optional initial step in the ranking procedure in which the data is normalized in each component before the ranking. The average ranking proposed by Thorarinsdottir et al. (2016) provides a similar ascending rank structure and is given by the average over the univariate ranks. That is, let denote the standard univariate rank of the ith component of u among the values in S. The multivariate average rank is then defined using the pre-rank function Further two approaches assess the centrality of the observation within the ensemble. Under minimum spanning tree ranking, the pre-rank function ρ mst S (u) is given by the length of the minimum spanning tree of the set S \ u (Smith and Hansen, 2004;Wilks, 2004). Here, a spanning tree of the set S \ u is a collection of k − 1 edges such that all points in S \ u are used. The spanning tree with the smallest length is then the minimum spanning tree (Kruskal, 1956); it may e.g. be calculated using the R package vegan (Oksanen et al., 2017;R Core Team, 2016). Alternatively, the band depth ranking proposed by Thorarinsdottir et al. (2016) uses a pre-rank function that calculates the proportion of components of u ∈ S inside bands defined by pairs of points from S. It can be written as If u i = v i with probability 1 for all u, v ∈ S with u = v and i = 1, . . . , d the formula in (3) may be simplified to This implies that the formula in (3) should be used for forecasts with a discrete component, e.g. precipitation forecasts. While all four methods return a uniform rank histogram for a calibrated forecast, the interpretation of the histogram shape for a misspecified forecast varies between the methods as demonstrated in the following example.

Example: Comparing multivariate ranking methods
The four multivariate ranking methods are compared in Figure 2 for several different settings where y ∈ R d can be thought of as a temporal trajectory of a real valued variable observed at d = 10 equidistant time points t = 1, . . . , 10. In the first two examples (row one and two), y is a realization of a zero-mean Gaussian AR(1) (autoregressive) process Y with a covariance function given by  Average ranking (first column), band depth ranking (second column), multivariate ranking (third column) and minimum spanning tree ranking (forth column). 10,000 simulated observations of dimension 10 are compared against ensemble forecasts with 50 members. In the top two rows, the observations are realizations of a zero-mean Gaussian AR(1) process with the covariance function in (5) where τ = 3. The forecasts follow the same model with τ = 1.5 (first row) and τ = 5 (second row). In the bottom two rows, the observations are iid standard Gaussian variables while the forecasts have variance 1.25 2 (third row) and 0.85 2 (forth row). The theoretically optimal histograms are indicated with dashed lines.
The process Y thus has standard Gaussian marginal distributions while the parameter τ controls how fast correlations decay with time lag. We set τ = 3 for Y and consider ensemble forecasts with 50 members of the same type but with a different parameter value τ . That is, we set τ = 1.5 in row one (too strong correlation) and τ = 5 in row two (too weak correlation). It follows from this construction that a univariate calibration test at a fixed time point would not detect any miscalibration in the forecasts.
While all four methods are able to detect the misspecification in the correlation structure, the resulting histograms vary in shape. The shape of the average rank histograms and the band depth rank histograms offer a similar interpretation as that of the univariate rank histograms in Figure 1 with a ∪-shape when the correlation is too strong (underdispersion across components) and a ∩-shape when the correlation is too weak (overdispersion across components). In these 10-dimensional examples, the pre-rank ordering of the multivariate rank histograms (1) is only able to detect miscalibration related to the highest ranks, see also the discussion in Pinson and Girard (2012) and Thorarinsdottir et al. (2016). Under minimum spanning tree ranking, too many observations have high ranks when the correlation in the forecasts is too strong and the opposite holds for the example with too weak correlation in the forecasts.
In the latter two examples in Figure 2 (rows three and four), both observations and forecasts are iid variables in ten dimensions. However, the marginal distributions of the ensemble forecasts are misspecified. The observations follow a standard Gaussian distribution, the forecasts in row three have a standard deviation of 1.25 (overdispersion) and the forecasts in row four have a standard deviation of 0.85 (underdispersion). The shape of the average rank histograms is exactly that of their univariate counterparts in Figure 1, indicating that the ranking method cannot distinguish between miscalibration in the marginals and the higher order structure. For the two ranking methods based on centrality, the marginal overdispersion results in too many high ranks while the marginal underdispersion results in too many low ranks. For this dimensionality, the multivariate ranking is unable to detect the miscalibration.
Further comparison of the four ranking methods is provided in Thorarinsdottir et al. (2016) and Wilks (2016). In general, it is a challenging task to represent and compare a multi-faceted higher order structure with a single value. As the different methods vary in their strengths and weaknesses, it is recommended to apply several of these methods when assessing multivariate calibration. Furthermore, a prior assessment of the marginal calibration may increase the information value in the multivariate rank histograms and ease the interpretation of the resulting shapes, as these methods perform a simultaneous assessment of the marginal and the higher-order calibration.

Accuracy
In this section, we discuss methods for assessing forecast accuracy that are appropriate for ranking and comparing competing forecasting methods.

Scoring rules
Scoring rules assess the accuracy of probabilistic forecasts by assigning a numerical penalty to each forecast-observation pair. Specifically, a scoring rule is a mapping where, in our notation, a smaller penalty indicates a better prediction. A scoring rule is proper relative to the class F if for all probability distributions F, G ∈ F, that is, if the expected score for a random observation Y is optimized if the true distribution of Y is issued as the forecast. The scoring rule is strictly proper relative to the class F if (7) holds with equality only if F = G. Propriety will encourage honesty and prevent hedging, which coincides with Murphy's first type of goodness (Murphy, 1993).
Competing forecasting methods are compared based on a proper scoring rule by comparing the mean score over an out-of-sample test set, and the method with the smallest mean score is preferred. Formal tests of the null hypothesis of equal predictive performance can also be employed, see Section 3.5. While average scores are directly comparable if they refer to the same set of forecast situations, this may no longer hold for distinct sets of forecast cases, for instance, due to spatial and temporal variability in the predictability of weather. For ease of interpretability and to address this issue, verification results are commonly represented as a skill score of the form for the forecasting method A where F ref denotes the forecast from a reference method. The skill score is standardized such that is takes the value 1 for an optimal forecast and the value 0 for the reference forecast. Negative values thus indicate that the forecasting method A is of a lesser quality than the reference forecast. However, it is vital to select the reference forecast with care (Murphy, 1974(Murphy, , 1992 as skill scores of the form (8) may be improper even if the underlying scoring rule S is proper Murphy, 1973a).
The most popular proper scoring rules for univariate real valued quantities are the ignorance (or logarithmic) score (IGN) and the continuous ranked probability score, see  for a more comprehensive list. IGN is defined as where f denotes the density of F (Good, 1952). It thus applies to absolutely continuous distributions only and cannot be applied directly to ensemble forecasts. For a large enough ensemble, the density of the ensemble forecast may potentially be approximated using e.g. kernel density estimation or by fitting a parametric distribution. Alternatively, IGN may be replaced by the Dawid-Sebastiani (DS) score (Dawid and Sebastiani, 1999), where µ F denotes the mean value of F and σ 2 F its variance. While the proper DS score equals IGN for a Gaussian predictive distribution F , it only requires the estimation of the ensemble mean and variance.
The continuous ranked probability score (CRPS; Matheson and Winkler, 1976) is of particular interest in that it simultaneously assesses both calibration and sharpness, and thus all three types of goodness discussed by Murphy (1993). The CRPS applies to probability distributions with finite mean and has three equivalent definitions Gneiting and Ranjan, 2011;Hersbach, 2000;Laio and Tamea, 2007), Here, X and X denote two independent random variables with distribution F , 1{y ≤ x} denotes the indicator function which is equal to 1 if y ≤ x and 0 otherwise, and It follows directly from (12) and (13) that the CRPS is tightly linked to other proper scores that focus on specific parts of the predictive distribution. The form in (12) can be interpreted as the integral over the Brier score (Brier, 1950) which assesses the predictive probability of threshold exceedance. The Brier score is usually written on the form for a threshold u with p u = 1 − F (u). Similarly, the integrand in (13) equals the quantile score (Friederichs and Hense, 2007;, which assesses the predicted quantile F −1 (q) for a probability level q ∈ (0, 1).
When the predictive distribution F is given by a finite ensemble {x 1 , . . . , x K }, the CRPS representation in (11) is equal to For small ensembles, Ferro et al. (2008) propose a fair approximation given by For large ensembles, a more computationally efficient calculation is based on the form in (12) (Hersbach, 2000). Specifically, let x (1) ≤ · · · ≤ x (K) denote the order statistics of x 1 , . . . , x K . Then where for i = 1, . . . , K − 1, and, for i = 0 and i = K, The formula in (18) is implemented in the R package scoringRules (Jordan et al., 2017) together with exact formulas for a large class of parametric families of distributions, see Table 1. Optimal approximations for both IGN and CRPS when the distribution F is the posterior predictive distribution from a Bayesian analysis are discussed in Krüger et al. (2016). If F L with parameters θ is the distribution model used in the likelihood and {θ i } I i=1 is a sample from the posterior parameter distribution, F is optimally approximated by the mixture-of-parameters technique,F

Scoring rules derived from scoring functions
The quality of a deterministic forecast x is typically assessed by applying a scoring function s(x, y), that assigns a numerical score based on x and the corresponding observation y.
As in the case of proper scoring rules, competing forecasting methods are compared and ranked in terms of the mean score over the cases in a test set. Popular scoring functions include the squared error, s(x, y) = (x − y) 2 , and the absolute error, s(x, y) = |x − y|.
A scoring function can be applied to a probabilistic prediction F ∈ F if it is consistent for a functional T relative to the class F in the sense that for all x ∈ Ω and F ∈ F. A consistent scoring function becomes a proper scoring rule if the functional T in (20) is used as the derived deterministic prediction based on F . That is, if S(F, y) = s(T (F ), y). The squared error proper scoring rule is given by where mean(F ) denotes the mean value of F , and the absolute error proper scoring rule becomes where med(F ) denotes the median of F .
One appealing property of scoring rules that derive from scoring functions is thus the possibility to compare deterministic and probabilistic forecasts. See Gneiting (2011) for an extensive discussion of the use of scoring functions to evaluate probabilistic predictions.

Assessing extreme events
Forecasts specifically aimed at predicting extreme events can be assessed in a standard manner e.g. by using the scoring rules discussed in Sections 3.1.1 and 3.1.2 above (Friederichs and Thorarinsdottir, 2012). However, the restriction of conventional forecast evaluation to subsets of extreme observations by selecting the extreme observations afterthe-fact, while discarding the non-extreme ones, and to proceed with standard evaluation tools will invalidate their theoretical properties and encourage hedging strategies .
Specifically, Gneiting and Ranjan (2011) show that a proper scoring rule S is rendered improper if the product with a non-constant weight function w is formed, where w depends on the observed value y. That is, consider the weighted scoring rule Then if Y has density g, the expected score E g S 0 (F, Y ) is minimized by the predictive distribution F with density which is proportional to the product of the weight function w and the true density g. In particular, if w(y) = 1{y ≥ u} for some high threshold value u, then S 0 corresponds to evaluating F only on observed values exceeding u under the scoring rule S.
Instead, one can apply proper weighted scoring rules that are tailored to emphasize specific regions of interest. Diks et al. (2011) propose two weighted versions of the ignorance score that correct for the result in (24). The conditional likelihood (CL) score is given by and the censored likelihood (CSL) score is defined as Here, w is a weight function such that 0 ≤ w(z) ≤ 1 and w(z)f (z)dz > 0 for all potential predictive distributions F ∈ F. When w(y) ≡ 1, both the CL and the CSL score reduce to the unweighted ignorance score in (9).
Gneiting and Ranjan (2011) propose the threshold-weighted continuous ranked probability score (twCRPS), defined as where, again, w is a non-negative weight function, see also Matheson and Winkler (1976). When w(z) ≡ 1, the twCRPS reduces to the unweighted CRPS in (12)  Non-stationarity in the mean climate e.g. due to spatial heterogeneity may render it difficult to define a common threshold value u over a large number of forecast cases. Here, it may be more natural to define a weight function in quantile space using the CRPS representation in (13), where w is a non-negative weight function on the unit interval (Gneiting and Ranjan, 2011;Matheson and Winkler, 1976). Setting w(τ ) ≡ 1 retrieves the unweighted CRPS in (13) while this definition of twCRPS with w(τ ) = 1{τ = q} equals the quantile score in (15). Examples of more general weight functions for this setting include w(τ ) = 1{τ ≥ q} and w(τ ) = τ 2 for the upper tail, and w(τ ) = 1{τ ≤ q} and w(τ ) = (1 − τ ) 2 for the lower tail with appropriate threshold values q, see also Gneiting and Ranjan (2011).
Verification: assessment of calibration and accuracy Lerch et al. (2017) find that there are limited benefits in using weighted scoring rules compared to using standard, unweighted scoring rules when testing for equal predictive performance. However, the application of weight functions as described here may facilitate interpretation of the forecast skill.

Simulation study: Comparing univariate scoring rules
The purpose of this simulation study is to demonstrate a coherent approach to using proper scores and rank or PIT histograms in practice, while highlighting some of the difficulties that might arise when working with limited data sets. In particular, we investigate how different scoring rules rank forecasts according to their skill, and how this differs with the amount of available data.
We start by generating two sets of observation data, drawn randomly from a fixed "true" distribution. The first set consists of 100 values, which will serve as verifying observations, while the second set, the training data, consists of 300 values for each of the 100 observations. Our goal is to issue forecasts matching the observations, based on the information contained in the training data. For the first part of the simulation study, the true distribution is normal, with a random mean µ ∼ N (25, 1) and fixed standard deviation σ = 3. In the second part, the truth is a Gumbel distribution, with the mean following a N (25, 1) distribution and the scale parameter fixed to 3, see Table 2.
Part 1 Normal N µ, σ 2 µ ∼ N (25, 1) σ 2 = 9 Part 2 Gumbel G (µ, σ) µ + σ · γ ∼ N (25, 1) π 2 6 σ 2 = 3π 2 2 Using a method-of-moments approach, we estimate four competing forecast distributions per observation, which are listed in Table 3. The distribution parameters are calculated by plugging the sample mean and sample standard deviation from the training data into the equations for mean and variance. For the non-central t-distribution, the parameters are obtained numerically, while setting the degrees of freedom ν ≥ 3 to ensure that mean and variance exist. As a fifth forecaster, we use the true distribution, from which the observations are generated. An ensemble of 50 members is drawn randomly from each of the forecast distributions, and is then paired with the observations.
The performance of the five forecasters is evaluated using the absolute error, the squared error (both Section 3.1.2), the ignorance score, the CRPS (both Section 3.1.1) and the PIT histogram (Section 2.1). We also produced rank histograms, but they turned out to be almost identical to the PIT histograms. As we encountered variations in the scores depending on the initial random seed, the whole process is repeated 10 times with different initial seeds, so that the final number of forecast-observation pairs comes to 1000.   Figure 3. Top row: Mean absolute error, CRPS and ignorance score, and the 95% bootstrap confidence interval for the five forecast distributions, if the true distribution is normal. Scores are based on 1000 forecast-observation pairs. Bottom row: Same as above, but scores are based on 1 million forecast-observation pairs.
In order to understand the true ranking of the five forecasting methods in terms of skill, we reproduce the simulation study with 10 times 100 000 forecasts. For the case, where the true distribution is normal, Figure 3 shows the mean absolute error, mean CRPS and mean ignorance score, along with a 95% bootstrap confidence interval computed from 1000 bootstrap samples. We have omitted the squared error from this plot, as its values are on a much larger scale than the other scores. Looking at the results for the small sample size (top row), all scores assign the lowest mean value, and therefore the highest skill, to the normal distribution with the true parameters. However, if no knowledge about the true distribution was available, as it would be in a real forecast setting, the absolute error and the CRPS would prefer the log-normal distribution over all other forecasters, while the ignorance score judges the normal distribution with estimated parameters to be the best.
When comparing to the bottom panel of Figure 3, which shows the results from the study with a very large sample size, only the ignorance score manages to identify the correct order of the forecasters. All scores rightly find the Gumbel distribution, which has a completely different shape and tail behaviour than the truth, to be the worst forecast. Due to assigning large penalties to outliers, the ignorance score is able to discriminate between the shapes of the forecast distributions and shows a significant difference at the 95% level between the Gumbel and the normal, log-normal and true distributions.  Judging from Figure 4, which shows PIT histograms for the small-sample study with a normal true distribution, we can not make any statements about the forecast ranking, except that the Gumbel distribution forecast is clearly uncalibrated. Only when looking at the large sample equivalent in Figure 5, we see that the normal and the true forecasters  (27.16, 9) distribution. While the score minima largely coincide for the true and the t-distribution, it becomes clear from the shape of the ignorance score, why it is much better at identifying the Gumbel distribution as an inferior forecaster than the other scores: because of the nonsymmetry, forecasts will receive a much higher penalty, if the observation lies left of the distribution mode than if it lies on the right.  Figure 7. Top row: Mean absolute error, CRPS and ignorance score, and the 95% bootstrap confidence interval for the five forecast distributions, if the true distribution is a Gumbel distribution. Scores are based on 1000 forecast-observation pairs. Bottom row: Same as above, but scores are based on 1 million forecast-observation pairs.
For the second part of the simulation study, we used a Gumbel distribution as truth, where the mean is distributed with N (25, 1) and the scale parameter is 3. The same kind of forecasts are produced again, normal, non-central t, log-normal and Gumbel distributions based on the sample mean and variance of the training data. In Figure 7, the outcome of the study is shown for a small sample size (top row) and a very large sample size (bottom row). As before, the ignorance score is able to reproduce the true forecast ranking, as indicated by the bottom plots. In reversal to the first part of the simulation study, the normal forecaster now has the lowest skill, which is recognized by all scores.
However, while the ignorance score suggests that the true Gumbel distribution and the Gumbel distribution with estimated parameters are of a similar quailty and outperform the other forecasters, the true distribution is only ranked the third highest by the absolute error and the CRPS, behind the estimated Gumbel and non-central t-distributions. This is of course concerning and hints at the fact that even for a data set of apparently sufficient size, like the 1000 50-member ensembles used here, scores don't necessarily provide robust and proper results. Like in the first part, we can not really judge from the PIT histograms in Figure 8, which of the forecasters has the highest degree of calibration, except for the uncalibrated normal distribution. One could make a case that the histogram for the true distribution looks slightly flatter than the other ones, but not with great certainty. It becomes clear, however, from Figure 9, that the forecasts based on non-central t and log-normal distributions also suffer from multiple types of miscalibration. Picking an example forecast from the data set, as in Figure 10, shows that the ignorance score for the two Gumbel distribution forecasters is again non-symmetric, and therefore minimizes at a different value compared to the CRPS. In general, the ignorance score takes its minimum value at the mode of the distribution, and the CRPS at the median.
We can gather from this simulation study that even proper scores can behave very differently, depending on the size of the underlying data set, and aren't necessarily able to rank competing forecasters according to their actual skill. Therefore, we suggest to always use a combination of scoring rules to get a maximum amount of information about the performance of a particular model or forecaster. The ignorance score is more sensitive to the shape of a distribution and thus suitable to check if a chosen distribution actually fits the data. The CRPS is very useful for comparing models when the forecasts don't take the form of a standard probability distribution, or if for a given data set such a distribution can not be perfectly specified.
This also has implications for the ongoing discussion whether to use maximum likelihood methods or minimize the CRPS to estimate model parameters (Gneiting et al., 2005), in that there might not be a definitive answer. Depending on the forecast situation and model choice, it could be preferrable to switch between the two approaches. A case can be made for performing a thorough exploratory analysis of the data at hand before fitting any distributions, so as to find one that matches the data best. If it is difficult to select one distribution over the other, the simpler model should be preferred.
In all circumstances, the ranking of forecasters should not be solely based on the mean score, even if the sample size seems to be sufficiently large, but confidence intervals should be given, e.g. by applying bootstrapping techniques. We found that even for 1 million data points, differences between the forecast scores were often not significant at the 95% level.

Multivariate assessment
Two general approaches can be employed to assess multivariate forecasts with scoring rules: Use specialized multivariate scores or reduce the multivariate forecast to a univariate quantity and, subsequently apply the univariate scores discussed above. For the latter approach, the appropriate univariate quantities depend on the context. Multivariate forecasts of single weather quantities are usually in the form of temporal trajectories, spatial fields or space-time fields. Here, it can, for instance, be useful to assess the predictive performance of derived quantities such as maxima, minima and accumulated totals, all of which depend on accurate modelling of both marginal and higher order structures, see e.g. Feldmann et al. (2015) for an assessment of spatial forecast fields for temperature.
Scores that directly assess multivariate forecasts are rather scarce and, as noted by Gneiting and Katzfuss (2014), there is a need to develop further decision-theoretically principled methods for multivariate assessment. The univariate Dawid-Sebastiani score in (10) can be applied in a multivariate setting with where µ F is the mean vector and Σ F the covariance matrix of the predictive distribution with det Σ F denoting the determinant of Σ F (Dawid and Sebastiani, 1999). However, note that unless the sample size is much larger than the dimension of the multivariate quantity, sampling errors can effect on the calculation of det Σ F and Σ −1 F (see e.g. Table 2 in Feldmann et al., 2015). Similarly, if the multivariate predictive density is available, the ignorance score in (9) can be employed (Roulston and Smith, 2002).  propose the energy score (ES) as a multivariate generalization of the CRPS. It is given by where X and X are two independent random vectors distributed according to F and · is the Euclidean norm. For ensemble forecasts the natural analogue of the formula in (16) applies. If the multivariate observation space Ω d consists of weather variables on varying scales, the margins should be standardized before computing the joint energy score for these variables (Schefzik et al., 2013). This can be done using the marginal means and standard deviations of the observations in the test set. The energy score has been developed with low-dimensional quantities in mind and it may lose discriminatory power in higher dimensions (Pinson, 2013). Scheuerer and Hamill (2015) propose a multivariate scoring rule that considers pairwise differences of the components of the multivariate quantity. In its general form, the variogram score (VS) of order p is given by where X i and X j are the ith and the jth component of a random vector X that is distributed according to F , and ω ij are nonnegative weights. Scheuerer and Hamill (2015) compare different choices of the order p and find that the best results in terms of discriminative power are obtained with p = 0.5. Furthermore, they recommend using weights proportional to the inverse distance between the components unless a prior knowledge regarding the correlation structure is available.

Divergence functions
In some cases, in particular in climate modelling, it is of interest to compare the predictive distribution F against the true distribution of the observations which is commonly approximated by the empirical distribution function of the available observations y 1 , . . . , y n , The two distributions, F and G n , can be compared using a divergence where D(F, F ) = 0.
Assume that the observations y 1 , . . . , y n forming the empirical distribution function G n are independent with distribution G ∈ F. A propriety condition for divergences corresponding to that for scoring rules (7) states that the divergence D is k-proper for a positive integer k if and asymptotically proper if for all probability distributions F, G ∈ F . While the condition in (31)  A score divergence that assesses the full distributions is the integrated quadratic divergence the score divergence of the continuous ranked probability score (12). However, in the setting where the empirical distribution function G n is used, it is not practical to employ the Kullback-Leibler divergence, the score divergence of the ignorance score (9), unless the observations are categorical variables .
Alternative score divergences that assess specific properties of the predictive distribution include the mean value divergence, the divergence associated with the squared error scoring rule (21), and the score diverence of the Brier score (14), for some threshold u.

Testing equal predictive performance
As demonstrated in the simulation study in Section 3.2, the estimation of the mean score over a test set may be associated with a large uncertainty. A simple bootstrapping procedure over the individual scores may be used to assess the uncertainty in the mean score, see e.g. Friederichs and Thorarinsdottir (2012). However, some care is needed if the forecast errors, and thus the resulting scores, are correlated. A comprehensive overview over bootstrapping methods for dependent data is given in Lahiri (2003).
Formal statistical tests can be applied to test equal predictive performance of two competing methods under a proper scoring rule. The most commonly applied test is the Diebold-Mariano test (Diebold and Mariano, 1995) which applies in the time series setting. Consider two competing forecasting methods F and G that for each time step t = 1, . . . , n issue forecasts F t and G t , respectively, for an observation y t+k that lies k time steps ahead. The mean scores under a scoring rule S are given bȳ The Diebold-Mariano uses the test statistic whereσ 2 is an estimator of the asymptotic variance of the socre difference. Under the null hypothesis of equal predictive performance and standard regularity conditions, the test statistic t n in (35) is asymptotically standard normal (Diebold and Mariano, 1995). When the null hypothesis is rejected in a two-sided test, F is preferred if t n is negative and G is preferred if t n is positive. Diebold and Mariano (1995) note that for ideal k-step-ahead forecasts, the forecast errors are at most (k − 1)-dependent. An estimator for the asymptotic varianceσ 2 based on this assumption is given byσ whereγ j denotes the lag j sample autocorrelation of the sequence for j = 0, 1, 2, . . . (Gneiting and Ranjan, 2011). Alternative estimators are discussed in Diks et al. (2011) and Lerch et al. (2017).
In the spatial setting, Hering and Genton (2011) propose the spatial prediction comparison test which accounts for spatial correlation in the score values without imposing assumptions on the underlying data or the resulting score differential field. This test is implemented in the R package SpatialVx (Gilleland, 2017). Weighted scoring rules and their connection to hypothesis testing are discussed in Holzmann and Klar (2017).

Understanding model performance
When assessing the performance of an individual model, e.g. to identify weaknesses and test potential improvements, it might be useful to look at tools which don't necessarily follow the principles of propriety described in Section 3. One of the most popular measures used by national weather services is the anomaly correlation coefficient, a valuable tool to track the gain in forecast skill over time (Jolliffe and Stephenson, 2012).
The ACC quantifies the correlation between forecast anomalies and the anomalies of the observation, typically an analysis. Anomalies are defined as the difference between the forecast or analysis and the climatology for a given time and location. Usually, the climatology is based on the model climate, calculated from the range of values predicted by the NWP model over a long time period.
For a deterministic forecast f i , valid at time i, with a corresponding analysis a i and climate statistic c i , there are two equivalent definitions for the anomaly correlation coefficient (e.g. Miyakoda et al., 1972): Here, f i = f i − c i is the forecast anomaly and a i = a i − c i the anomaly of the analysis, The ACC is a preferred evaluation tool for gridded forecasts and spatial fields, as these are usually compared against an analysis or a similar gridded observation product.
However, there are certain limits and pitfalls one has to be aware of when using this measure. Due to it being a correlation coefficient, the ACC doesn't give any information about forecast biases and errors in scale, so that it can overestimate the forecast skill (Murphy and Epstein, 1989). As such, it should always be used in conjunction with an estimate of the actual bias, or applied to previously bias-corrected data.
It as been established empirically that an anomaly correlation of 0.6 corresponds to a limit in usefulness for a medium-range forecast. Murphy and Epstein (1989) warn, however, that this is rather an upper limit of the actual skill and that the ACC should be seen as a measure of potential skill. Naturally, the ACC relies to a large extent on the underlying climatology used to compute the anomalies.
When evaluating forecast skill with proper scores, it is often useful to get separate indicators for the degree of calibration and the sharpness of the forecast. The well-known and widely used decomposition of the Brier score by Murphy (1973b) separates the score value in three parts, quantifying reliability, resolution and uncertainty.
Consider a forecast sample of size N , where probability forecasts p u = 1 − F (u) are computed for exceeding a threshold u and binary observations take the form o = 1{y ≥ u}. The forecasts are taking K unique values and n k denotes the number of forecasts within the category k. Then the Brier score can be written as whereō k is the event frequency for each of the forecast values andō = 1 N N i=1 o i the climatological event frequency, computed from the sample. The first part of the sum in (37) relates to the reliability or calibration, the second, having a negative effect on the total score, to the resolution or sharpness, and the last part is the climatological uncertainty of the event.
This representation of the Brier score relies on the number of discrete forecast values K being relatively small. If p u takes continuous values, care must be taken when binning the forecast into categories, so as not to introduce biases (Bröcker, 2008;Stephenson et al., 2008). Several analogue decompositions have been proposed for other scores, such as the CRPS (Hersbach, 2000), the quantile score (Bentzien and Friederichs, 2014) and the ignorance score (Weijs et al., 2010). Recently, Siegert (2017) formulated a general framework allowing for the decomposition of arbitrary scores.
While it is common and advisable to look at a model's performance in certain weather situations or for certain periods of time, it is important to be aware of Simpson's paradox (Simpson, 1951). It describes the phenomenon that a certain effect appearing in several subsamples can not be found in a combination of these samples, or that the larger sample may even show the complete opposite effect.
For example, a forecast model can have superior skill over all four seasons, compared to another model, but still be worse when assessed over the whole year. Hamill and Juras (2006) showed this to be true for a synthetic data set of temperature forecasts on two islands. In this case, the climatologies of the two islands were so different, that the values of performance measures were misleadingly improved. Fricker et al. (2013) found that this spurious skill doesn't effect proper scores derived from scoring rules, but care should be taken when using scores derived from a contingency table which are not proper, and skill scores in general.
In general, it is recommended to use statistical significance testing in order to evaluate potential model improvements. Differences in scores are often very small and it is hard to judge if they are caused by genuine improvement or chaotic error growth. Geer (2016) investigate a version of the Student's t-test modified for multiple models and taking account of autocorrelation in the scores. They also found that in order to detect an improvement of 0.5%, at least 400 forecast fields on a global grid would be required. This confirms our findings from Section 3.2 that it is essential to make careful considerations of the experiment sample size in order to generate meaningful and robust results.

Summary
In this chapter, a variety of methods to assess different aspects of forecast goodness were presented and discussed. Calibration errors can be diagnosed with the help of histograms, in both univariate and multivariate settings. It is recommended to use multiple such diagnostics, especially in the multivariate case, as different tools highlight different types of miscalibration.
Scoring rules provide information about the accuracy of a forecast and are valuable tools for comparing forecasting methods. In this context, only proper scores should be used, as they ensure that the forecast based on the best knowledge will receive the best score. There are many such scores available, with the CRPS and the ignorance score being amongst the most popular. However, we found that just looking at the mean of one such score can be misleading, even if the underlying sample seems to be of sufficient size. Therefore it is crucial to also provide information about the error of the mean score, and to base decisions about model preference on the evaulation of multiple scoring rules, if possible. If we don't want to compare models, but rather understand the behaviour of a model, it can be helpful to use measures which are not necessarily proper. Especially skill scores and the anomaly correlation coefficient are widely used.
By adding appropriate weight functions to the CRPS and the ignorance score, it is possible to evaluate extreme event forecasts in a proper way. These weight functions can be designed to emphasize e.g. different parts of the climatological distribution. Scores for multivariate quantities not only give information about the calibration and sharpness of the forecast, but also assess the correct representation of the covariance structure between locations, forecast times or variables. However, some of them have limitations and don't work well if the number of dimensions is large.
Given the multitude of available evaluation tools and scores, constantly growing due to new research and applications, it is essential to be aware of their properties and how to choose a suitable measure. To make sure that all aspects of a forecast's performance are addressed, a number of scores should be calculated and a quantification of the associated uncertainty given. Threshold-weighted continuous ranked probability score, 14

Index
Variogram score, 23 Weighted scoring rules, 14 Verification: assessment of calibration and accuracy