Rapid adjustment and post‐processing of temperature forecast trajectories

Modern weather forecasts are commonly issued as consistent multi‐day forecast trajectories with a time resolution of 1–3 hours. Prior to issuing, statistical post‐processing is routinely used to correct systematic errors and misrepresentations of the forecast uncertainty. However, once the forecast has been issued, it is rarely updated before it is replaced in the next forecast cycle of the numerical weather prediction (NWP) model. This paper shows that the error correlation structure within the forecast trajectory can be utilized to substantially improve the forecast between the NWP forecast cycles by applying additional post‐processing steps each time new observations become available. The proposed rapid adjustment is applied to temperature forecast trajectories from the UK Met Office's convective‐scale ensemble MOGREPS‐UK. MOGREPS‐UK is run four times daily and produces hourly forecasts for up to 36 hours ahead. Our results indicate that the rapidly adjusted forecast from the previous NWP forecast cycle outperforms the new forecast for the first few hours of the next cycle, or until the new forecast itself can be rapidly adjusted, suggesting a new strategy for updating the forecast cycle.


INTRODUCTION
Weather forecasts resulting from numerical weather prediction (NWP) models are traditionally post-processed using statistical approaches in order to correct potential systematic biases in the forecasts (Glahn and Lowry, 1972). Roughly 15 years ago, the first papers on statistical post-processing methods yielding full predictive distributions -correcting both systematic biases and assessments of forecast uncertainty -appeared in the literature Raftery et al., 2005). Since then, approaches of this type have become increasingly more common in both the literature and operational forecasting for NWP forecasts and forecast ensembles (Vannitsem et al., 2018). Originally, the methods applied to marginal predictive distributions of individual weather variables at individual locations Raftery et al., 2005). More recent work has produced consistent probabilistic predictions for temporal trajectories (Hemri et al., 2015), spatial forecast fields (Berrocal et al., 2008;Feldmann et al., 2015) and multiple variables (Schuhen et al., 2012;Möller et al., 2013;Sloughter et al., 2013). Vannitsem et al., (2018) gives a recent overview of statistical post-processing methods for ensemble forecasts. t+1 t+2 t+3 t+4 FC2 t+6 FC1 t+10 t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 t+5 F I G U R E 1 Diagram of a typical forecast cycle for hourly forecasts issued every 6 hr. The MOGREPS-UK version used in this paper is configured in this way The aim of probabilistic forecasting is to "maximize the sharpness of the predictive distribution subject to calibration" . Here, calibration, or reliability, refers to the statistical consistency between the forecast and the observation; a forecast is (probabilistically) calibrated if events predicted to have probability P are realized with the same relative frequency in the observations. A calibrated forecast should then provide as much information regarding future weather as possible; the smaller the forecast uncertainty, or the higher the sharpness of the predictive distribution, the more information regarding future weather is contained in the forecast. In practice, the NWP model outputs a forecast trajectory for multiple lead times. As soon as the model output is available, the forecasts of the entire trajectory are post-processed using the most recent available pairs of previous forecasts and verifying observations to obtain calibrated and sharp forecasts for all lead times. A new, post-processed forecast is then issued for all future time points corresponding to the lead times of the original NWP forecast. An example of such a setting is shown in Figure 1 for an hourly forecast where a new forecast is issued every 6 hr.
In the standard setting demonstrated in Figure 1, the published forecast is not updated until it is replaced in the next forecast cycle of the NWP model. However, new information in the form of new observations becomes available every hour. In the current paper, we propose an approach for Rapid Adjustment of Forecast Trajectories (RAFT), where, in addition to standard post-processing, we regularly update the forecast every time a new piece of information becomes available by utilizing the correlation of the forecast errors within an NWP forecast trajectory. The idea behind RAFT is related to that of data assimilation, for example Mitchell and Houtekamer (2000) who developed a method to account for model error in the context of an ensemble Kalman filter technique. Here, our main priority is computational efficiency to minimize the time needed for each adjustment. We thus propose an efficient adjustment approach that is adapted to each forecast cycle, hour and lead time separately. In a case-study, we apply the method to hourly temperature forecasts from the MOGREPS-UK ensemble from the UK Met Office whose schedule follows the forecast cycle shown in Figure 1.
The remainder of the paper is organized as follows. In the next Section 2, we introduce the MOGREPS-UK (Met Office Global and Regional Ensemble Prediction System) forecast ensemble and the corresponding observations, and review the classical Ensemble Model Output Statistics (EMOS) post-processing method as well as the validation metrics used in our study. We further show the skill of the post-processed EMOS forecasts. In Section 3, we introduce our proposed method for RAFT. Results at Heathrow Airport as well as those over the entire study region are presented in the following Section 4. Finally, the paper concludes with a summary and discussion in Section 5.

2
DATA AND CONVENTIONAL POST-PROCESSING

MOGREPS-UK
Our dataset consists of surface temperature forecasts and observations for 150 locations in the UK and the Republic of Ireland. The forecasts are provided by the UK Met Office's convective-scale ensemble MOGREPS-UK (Hagelin et al., 2017), which has been running operationally since July 2012. The dataset covers a period of 30 months between January 2014 and June 2016, during which the ensemble had a horizontal resolution of 2.2 km and produced hourly forecasts for up to 36 hr. During this time, MOGREPS-UK was run four times daily, at 0300, 0900, 1500 and 2100 UTC. The initial and boundary conditions were originally provided by the global MOGREPS-G ensemble, but since March 2016 the initial conditions have been created by adding the MOGREPS-G perturbations to the analysis of the high-resolution deterministic UK variable-resolution (UKV) model, while the boundary data continue to be provided by MOGREPS-G. The ensemble consists of one control forecast and eleven perturbed members, which we treat as twelve exchangeable ensemble members.
In this study, we consider site-specific data only, interpolated by the Met Office from model grid to observation locations. During this process, forecasts are corrected for local effects and the height differences between station and model orography. The observations are extracted from SYNOP messages at the 150 locations in Figure 2 and Met Office quality controls have been applied. We separate the data into a training set (January to December 2014) with approximately 1,300 forecast trajectories for each location, or a total of 7,018,719 forecast-observation pairs, and a test set (January 2015 to June 2016) with approximately 2,096 forecast trajectories for each location, or a total of 11,320,762 forecast-observation pairs. Although there have been several operational changes to the MOGREPS-UK model during these periods, we treat the dataset as homogeneous over the entire study period. F I G U R E 2 Map of the 150 observation locations in the UK and the Republic of Ireland used in this study. The sites are divided into three categories: coastal (circles), inland (triangles) and mountain (squares) sites. The black triangle marks Heathrow Airport

Ensemble model output statistics
For all their benefits, weather forecast ensembles are usually too confident and produce underdispersed forecasts (Hamill, 2001). This means that the ensemble spread does not cover all sources of uncertainty in a given weather situation and is therefore on average too narrow. Like all weather prediction models, ensembles are also subject to a deterministic bias, depending on the model's skill in varying weather situations. To correct for the bias and the underdispersion, we first apply statistical post-processing to the raw ensemble forecasts before using the new RAFT error correction method. EMOS , sometimes called non-homogeneous Gaussian regression, has successfully been applied to multiple forecast models (e.g., Kann et al., 2009;Scheuerer and Büermann, 2014;Feldmann et al., 2015) and is a suitable method to calibrate MOGREPS-UK forecasts.
We denote a future temperature observation for a specific location and time by Y and the corresponding ensemble forecast members by X 1 , … , X 12 . The EMOS predictive distribution of Y conditional on X 1 , … , X 12 is then defined as a Gaussian distribution: The moments of this distribution are modelled using the ensemble forecast's statistics; the predictive mean is a linear function of the ensemble mean X = 1 m ∑ m i=1 X i and the predictive variance . Here, m = 12 is the number of ensemble members and the coefficients a, b, c and d are real numbers. For estimating a, b, c and d, we use minimum score estimation (Dawid et al., 2016) and optimize the continuous ranked probability score (CRPS; Matheson and Winkler, 1976; based on training data as suggested by Gneiting et al., (2005). Gebetsberger et al., (2018) gives a comprehensive comparison of minimum CRPS and maximum likelihood estimation. The parameters in Equation (3) are squared to ensure that the predictive variance is non-negative. In Equation (2), b is constrained in the same way, making it easier to interpret.
All runs of the NWP model and all forecast lead times are calibrated separately using a rolling training period of 40 days. This means that for each run and each lead time, we collect all forecast-observation pairs from the last 40 days, where the forecasts were initialized at the same time of day and are valid for the same lead time. These data comprise the basis for the estimation of the EMOS coefficients. The current ensemble forecasts are then plugged into Equations (2) and (3) to obtain the full EMOS predictive distribution  ( , 2 ) . We follow the local EMOS approach, in that all stations are treated on an individual basis. This accounts for local effects and turns out to produce much better results than a regional approach, where data from different sites are pooled together. In order to have a full set of training data for the first model runs in 2014, some dates from the end of 2013 are used.

Verification methods and EMOS forecast skill
To evaluate the effectiveness of the EMOS method, we compare the predictive skill of the post-processed forecasts The margin of error based on a 95% bootstrap interval is less than 0.002.
to the raw MOGREPS-UK ensemble. The tools used here, as well as for evaluating the RAFT forecasts in Section 4, are the root-mean-square error (RMSE), the CRPS and the rank and probability integral transform (PIT) histograms. Both the RMSE and the CRPS are proper scoring rules ; they measure the skill of a forecast by assigning a numerical penalty depending on how well the forecasts match the observations. It is essential that they are proper, as this guarantees that the best forecast model will receive the best score and prohibits hedging. While the RMSE assesses the deterministic forecast accuracy of the mean of a predictive probability distribution F, the CRPS evaluates the probabilistic skill of the whole distribution -which can also be represented by a discrete ensemble. The RMSE is defined as the square root of the average squared distance between the mean forecasts and the observations y: where n is the number of data points or forecast cases. In its general form, the CRPS can be expressed as the squared area between a forecast cumulative distribution function (CDF) F and the empirical CDF of the observation y or, equivalently, in terms of two expected values (Thorarinsdottir and Schuhen, 2018): where denotes the expected value with respect to F and X, X ′ are independent random values with distribution F. Here, we use the closed form for a Gaussian distribution (Equation (6)) to evaluate the EMOS forecasts and an approximation for the MOGREPS-UK forecasts, where the distribution is given by an ensemble (Equation (7)): The functions Φ (⋅) and (⋅) in Equation (6) indicate the CDF and the probability density function (PDF) of a standard Gaussian distribution, respectively. As noted by Ferro et al., (2008), the size of the ensemble may influence the CRPS ENS in Equation (7), in that larger ensembles are likely to obtain a better score. Gneiting et al., (2005) gives a derivation of the result in Equation (6) and Grimit et al., (2006) a derivation of the result in Equation (7). Table 1 summarizes both scores for the EMOS and the raw MOGREPS-UK forecasts. We divide the forecast lead times into three categories, early (1 to 12 hr), mid-range (13 to 24 hr) and later lead times (25 to 36 hr) and average the scores over each of the categories. As can be expected, the scores deteriorate with increasing lead time, for both EMOS and raw ensemble forecasts. By applying the EMOS post-processing technique, the probabilistic forecast skill is improved by around 20% and the deterministic skill of the mean forecast by around 10%.
To assess the calibration of the probabilistic forecasts, we use the PIT histogram to check the level of calibration (Thorarinsdottir and Schuhen, 2018). For a perfectly calibrated forecast, the PIT values, computed by evaluating the forecast CDFs at the observations, should form a flat histogram. The equivalent method for discrete ensemble forecasts is the verification rank histogram (Anderson, 1996;Hamill and Colucci, 1997;Talagrand et al., 1997), which measures the distribution of the observation rank in the set of ensemble forecasts. Both histograms are interpreted in the same way.
In Figure 3, the PIT histograms for the EMOS forecasts are shown. Overall, they seem reasonably flat, however it seems that small miscalibrations remain; there are, in particular, too many observations that land in the lower tail of the predictive distribution. There is almost no difference in the degree of calibration for the different lead time categories. These results indicate that a major jump in forecast skill can be achieved by applying EMOS to the raw ensemble. In a next step, the forecast trajectories Forecast cases are aggregated over all sites and forecast runs in the test set for (a) early, (b) mid-range and (c) later lead times provided by the EMOS mean are successively updated using the RAFT technique. Therefore, EMOS forms a baseline against which all further error reduction is measured.

RAPID ADJUSTMENT OF FORECAST TRAJECTORIES
The new RAFT technique is applied directly to the mean of the EMOS forecast distribution, in order to increase the deterministic skill of the EMOS forecasts even further when new information becomes available. This in turn also leads to a reduction in the CRPS (Equation (5)). More specifically, the goal of the new RAFT method is to adjust and improve forecast trajectories over time by using the part of the trajectory that has already verified, in conjunction with the matching observations. First we need to establish the relationship between forecast errors at different lead times. The forecast error e t,l is here defined as the distance of the EMOS mean forecast t,l to the observation y t+l , where the forecast is initialized at time t and valid at lead time l: e t,l = y t+l − t,l .
(8) Figure 4a shows the Pearson correlation coefficient matrix of the forecast errors at Heathrow Airport (marked with a black triangle in Figure 2) for the 0300 UTC model run. To create the plot, the error correlations for all possible pairs of lead times were computed over the training set, as well as the corresponding p-values. Only statistically significant correlations at the 90% level are shown. The correlation between lead times 1 and 36 is slightly negative and significant, but is left out for clarity and ultimately has no relevance for this study.
In all instances, there is a positive correlation between the errors at a certain lead time and its immediate neighbours. This means that the errors at two lead times, if close enough, are so strongly connected that we can make inference about the forecast skill at a future lead time by observing the error at the earlier lead time. Formally, there is a period preceding each forecast t,l , during which the recently measured forecast error e t,l * , with l * < l, provides useful information for a forecast adjustment at time t + l and thus can reduce the subsequent error e t,l .
The size of these temporal neighbourhoods varies greatly with the time of day. At lead times 8 to 11, corresponding to midday, the relationship between the forecasts is weakest with only 4 to 5 hr of significant correlation, while the largest predictability of 15 to even 27 hr can be found at lead times 28 to 31, in the early morning. In the MOGREPS-UK setting, this makes the RAFT method work on a rather short time-scale, adjusting forecasts sometimes at only a couple of hours in advance. However, RAFT adapts to the scale and context of the application; for example, for daily weather forecasts, the potential time range of adjustment increases to a few days.
Based on the correlation structure in Figure 4a, we can now define the RAFT model, establishing the relationship between forecast errors at two different lead times by linear regression. The estimated future errorê l at lead time l = 1, … , 36 is written as a linear function of the observed error at earlier lead times l * : The error term is normally distributed with mean zero and both coefficient estimateŝand̂are determined by the least squares approach based on the training dataset. All lead time combinations, sites and the four NWP model initialization times are treated separately. We omit the index t for the model run from Equation (9) for simplicity. Regression equations using multiple lead times as predictors were also investigated, but did not yield any improvement, as the newest observation always contains the most useful information. Not all of the possible lead time combinations produce valid and useful results. As seen in Figure 4a, the correlation between lead times, and therefore predictability, is irregular and depends on various factors. Consequently, we define for each lead time l an adjustment period of length p, consisting of the preceding lead times for which there is a strong enough correlation to affect the forecast skill. Starting at l − p + 1, the forecast for lead time l is repeatedly adjusted in hourly steps, each time using the most recent available forecast error information. Here, we allow for a processing time of one hour after an observation has been recorded, which means that the final adjustment for a forecast valid at lead time l is made at l − 1, using the error at l − 2.
To establish the length of the adjustment period for each location and lead time numerically, we need an algorithm that ensures that any adjustments are not based on random effects, but genuine additional error information about future lead times. Therefore we look at the coefficient̂in the RAFT model (Equation (9)) and determine for which lead time combinations the estimate is significantly different from zero. This corresponds to a large enough error correlation between the two lead times at hand to justify a RAFT adjustment. As to the level of significance for̂, we want to be a little lenient if the temporal difference between lead times is small, starting with a level of 90%, and become stricter with increasing distance to the predicted lead time, ending with 99%. Experiments have shown that legitimate connections at small lead time differences can be missed if the required level is set too high and spurious correlations at far apart lead times can lead to excessively long adjustment periods without real improvement if the level is set too low.
With our forecast trajectories spanning 36 hr, we need to account for the fact that multiple lead times correspond to the same time of day. As we treat each of the four forecast runs individually, there is for every lead time combination separated by more than 24 hr a corresponding combination from the run initialized one day earlier with a time difference of less than 24 hr, which will on average provide a more skilful forecast. Therefore, the maximum length of the adjustment period is 22 hr, with one hour allowed for processing the observations.
The algorithm for obtaining the optimal length of the adjustment period is then defined as follows: 1. Run the linear regression in Equation (9) using all lead times l * ∈ [l − 23, l − 2] as predictors. For negative lead times, add 24 hours, so that lead time 23 is followed by lead time 0, 1, etc.
2(a) Working backwards, find the first instance of l * in [l − 11; l − 2] where the regression coefficient̂is not significantly different from zero at the 90% level. If a result can be found, we denote it by l p . (b) If such an l p cannot be found, find the first instance of l * in [l − 19; l − 12] wherêis not significantly different from zero at the 95% level. If a result can be found, we denote it by l p . (c) If such an l p cannot be found, find the first instance of l * in [l − 23; l − 20] wherêis not significantly different from zero at the 99% level.
3. Set p = l − l p . If no value for l P is found after Step 2, then p is the average of the adjustment period lengths of the neighbouring lead times l − 1 and l + 1. In case this does not produce a valid number, p is set to 22, the maximum possible length for the adjustment period.
This somewhat arbitrary algorithm was designed so that it works well for a multitude of sites in our dataset with very different correlation patterns. It can be replaced by any other method for identifying a suitable adjustment period. Figure 4b shows the adjustment periods for the 0300 UTC run at Heathrow Airport produced by the algorithm above. It is clear that there is a strong connection between the correlation pattern in Figure 4a and the adjustment period length, in that large p correspond to longer periods of predictability. Note that for a stable estimation, the algorithm is applied only once to the entire training set (data from January to December 2014) with the obtained parameter estimates used for all data in the test set, as opposed to the rolling training period approach used for the EMOS post-processing.
The adjustment period refers to the time points when the observations used in the adjustment are recorded, and not the time points when the adjustments are carried out. As we allow an extra hour for the processing of the observations, the actual correction is made one hour after the observation time, starting at l − p + 1. For example, we see from Figure 4b that the ideal length of the adjustment period for lead time l = 25 here is p = 9. This means that the first correction to a forecast valid at t + 25 is made at t + l − p + 1 = t + 17 using the observation collected 1 hr earlier, at t + 16. From there on, an adjustment takes place every hour, each time using the newest error information available at that moment, until the time t + 24, where we adjust the forecast for a final time based on the error measured at t + 23. Clearly this last observation gives us the most accurate information about the expected forecast error, as it is closest in time to the forecast. This means that we get the most gain in forecast skill if RAFT is applied in the very short term.
Obviously, there is a gap during the first 2 hr of the forecast trajectory, where no forecast data from the current run are available to adjust the forecasts at t + 1 and t + 2. In this case, we instead use forecasts from the run that was initialized 24 hr earlier which are valid at the same time as the missing forecasts. Of course this does not lead to the same kind of improvement in forecast skill, as the current forecast run might exhibit a very different error characteristic from the one from 24 hr ago.
To obtain the size and direction of the forecast adjustment for a certain forecast run t and lead time l, we first calculate the observed error e t,l−k at lead time l − k according to Equation (8), where k ≤ p and the time l − k thus lies within the adjustment period. Then we plug the observed error into the regression equation for the predicted error e t,l at the future time point t + l: The regression coefficient estimateŝand̂are unique for each lead time combination, forecast initialization time and location, and were calculated in the first step of the algorithm to find the optimal adjustment period. Once we have established the predicted error in this way, we add it to the EMOS mean forecast t,l and obtain the adjusted RAFT forecast̂t ,l :̂t ,l = t,l +ê t,l .
The resulting adjusted mean forecast is generated from data that have passed through multiple levels of post-processing. First, while applying EMOS, the performance of the raw ensemble over the past 40 days is analyzed and the results are used to improve the deterministic and probabilistic forecast skill. This post-processing method uses forecasts and observations from a rolling training period and is carried out right after the NWP model run has finished and before the forecast is issued. When the first forecast from the trajectory verifies 2 hr later, we make the first RAFT adjustment and continue in the same manner in hourly intervals ( Figure 5). The level of RAFT error correction only relies on the performance of the EMOS forecast mean during the current forecast run, using very short-term information not available when the NWP model was initialized and when EMOS was applied. The combined EMOS/RAFT predictive distribution consisting of the RAFT forecast as mean and the EMOS variance can produce a more accurate forecast than both the raw ensemble and the unadjusted EMOS forecast, while remaining calibrated.

RESULTS
In the previous section, we described how the RAFT method can be combined with post-processing methods F I G U R E 5 Diagram of a forecast cycle for an hourly forecast issued every 6 hr with rapid adjustment of the forecast trajectory (RAFT) applied as new observations become available. Forecasts in grey are only used as predictors by means of their observed error and are not adjusted themselves like EMOS to provide an additional short-term error correction. We now show comprehensive results, first for Heathrow Airport and then for all sites in the dataset.

Results for Heathrow Airport
As the busiest airport in the UK, accurate weather forecast for Heathrow are of major importance, especially for the very short term (e.g., Ghirardelli and Glahn, 2010). Therefore, we investigate the impact of RAFT on forecast quality at this site separately. From Figure 4, we know how the relationship between forecast errors at different lead times can be used to define the RAFT regression model and corresponding adjustment periods. This analysis is done only once and the parameters are then valid until there are significant changes in the forecast models or the local error characteristics.
In the following example, we illustrate how RAFT works in a real-time setting. Figure 6 is a snapshot, taken at 2300 UTC on 14 January 2016 at Heathrow Airport. The light grey dashed line depicts a forecast trajectory, issued at 0300 UTC the same day and post-processed using EMOS as described in Section 2.2. Over time, temperature values (represented by the black solid line) are observed for the 36 lead times of the trajectory. However, at the time of the snapshot, they are only available up to 1 hr before. The dot-dashed dark grey line is the RAFT forecast and consists of two parts. The trajectory left of the black vertical line is a combination of the most recent RAFT forecasts at each lead time, i.e., the forecast issued 1 hr earlier, using the error information from 2 hr before the valid time. These are the optimal RAFT forecasts, as they contain the most information and are very short-term.
The right side of the black vertical line is the current RAFT trajectory, showing the best possible forecast we can make with the information we have at this point in time. Depending on the length of the adjustment period, the forecasts from here to the end of the original forecast trajectory are adjusted using the most recent error information. For example, the forecast at t + 28 is being adjusted, while the forecast at t + 33 is not. For the first 12 hr, the uncorrected trajectory has a good agreement with the observations and only small corrections are made. Between lead times 15 and 30, corresponding to evening and night-time, the EMOS forecast underpredicts the temperature. As soon as larger errors are observed, the RAFT adjustment to the original forecast also becomes larger and after a short time manages to counter the underprediction. This example illustrates how RAFT is able to quickly correct forecast errors a few hours ahead, whereas the unadjusted forecast would continue to underpredict the temperature for further 15 hr.
To evaluate the performance of RAFT over the entire test period, we look at the root-mean-square error of the RAFT-adjusted forecasts and compare to the unadjusted EMOS mean forecasts. Figure 7 shows the RMSE at Heathrow, averaged over all cases in the test period where the NWP model was initialized at 0300 UTC. In both plots, the solid line is identical and represents the performance of the EMOS-post-processed forecast trajectories, and the dashed line is the RMSE of the RAFT forecasts. The difference between the plots lies in the fact that they are snapshots taken at different points in the forecast cycle. Figure 7a depicts the level of forecast skill if we stopped applying RAFT after lead time t + 15. This would mean that all forecasts to the left of the vertical line have been adjusted according to the forecast error measured 2 hr earlier. As the most recent observed error is registered at t + 14, all forecasts to the right of the vertical line are adjusted using this error information (depending on the length of the respective adjustment periods). This means that on the left side, the difference between the two curves is the maximum improvement obtainable by applying RAFT.
For the first few hours, there is only very little improvement, as we do not yet have any information about the current run's forecast error, and we have to rely on the information from the run started 24 hr earlier. However, as soon as the new error information is available, RAFT shows a considerable reduction in forecast error, even up to 20%. On the right side, the largest benefit can be seen in the next few hours, as the correlation is strongest between close lead times. After about 5 hr, RAFT falls back to the skill level of the EMOS forecasts. Interestingly, for the period between t + 28 and t + 32, there appears to be a significant correlation to the error at t + 14. Thus we see a small error reduction 14 to 18 hr ahead.
In Figure 7b, a different snapshot is shown. Now we apply RAFT to the full forecast cycle, that is, we let it run  perform worse due to more time having passed since the model initialization. With RAFT, however, this forecast was adjusted with a very recently measured observation error, whereas the t + 2 forecast could only be adjusted using the data from the model run initialized 24 hr prior. As a result, the t + 26 error is lower than the one at t + 2 and, consequently, a forecast for t + 26 of an older model run will on average have more forecast skill than the t + 2 forecast from the next (and newer) model run. This means that there is a transition period at the beginning of every NWP model run, where an old run provides better forecasts until the point is reached where the forecasts from the new model run can be used for the RAFT adjustment. Figure 8 illustrates the relationship between all four initialization times, depicting the average RAFT RMSE as a function of the time of day in UTC. The times when a new model run is started are marked by dashed vertical lines. Again, the RMSE is computed using the most recent adjusted and optimal forecast. Here, the mean score is shown, as well as 90% confidence intervals based on 1,000 bootstrap samples.
At first glance, there is a strong diurnal variation in all four runs, with the lowest predictability around midday and the highest during the early morning. We are interested in the ranking of the four runs in terms of forecast skill. Ordinarily, we would expect the newest run to be the best, but as seen in Figure 7b, there is a short period during which an older run produces better forecasts. For the first few hours of the day, the ranking is as expected, in that the 2100 UTC run has the lowest RMSE and the 0300 UTC run the highest. When the first forecast from the new 0300 UTC run comes in at 0400 UTC, the skill decreases considerably, instead of improving. This is due to the fact that there are no recent forecast data available for the RAFT adjustment and we have to rely on the error information from 24 hr before. For 2 hr after the initialization of the 0300 UTC run, the 2100 UTC run remains the best forecast; the score difference between the two runs is actually significant at the 90% level. After 0600 UTC, the model runs rank in the expected order.
A similar pattern can be noticed every time a new model run is produced, with the exception being the 1500 UTC run. This run actually ranks best, or at least close to the others, from the first forecast, coinciding with the increase in predictability in the afternoon. We can conclude that the four daily model runs have comparable forecast quality after applying RAFT, apart from a transition period of about 2 hr. During this period, forecasts from an older run should be preferred to the newest.

Results for all sites
After presenting the results for Heathrow Airport, we now discuss how RAFT performs for all observation sites available. The dataset covers the British Isles ( Figure 2) and displays a wide variety of local characteristics, such as sites in the Scottish mountains at elevations above 1000 m or coastal towns. Figures 9a,b compare the average RMSE of the EMOS and RAFT forecasts for the 2100 UTC model run, similar to Figure 7. Again, they represent snapshots at different times in the RAFT adjustment process. In Figure 9a, we see the maximum achievable RAFT improvement over the RAFT error corrections are carried out only once at lead time t + 1. (b) is as (a), but RAFT is carried out for all lead times until the end of the trajectory EMOS mean if we only applied the adjustment once at the moment the first forecast becomes valid at t + 1. At that time, no observations are available yet for the new run, so we have to rely solely on error information from the run initialized 24 hr earlier. Those RAFT forecasts for which the adjustment period extends beyond the beginning of the run have been adjusted using the observation made at t + 0, combined with the old run's t + 24 forecast. While the benefit from applying RAFT in this way is considerably smaller than the improvement we see as soon as the new forecast data are used, there is still a reduction in the RMSE for the next 12 hr. We notice an interesting detail between t + 20 and t + 23 (corresponding to 1700 UTC and 2000 UTC, respectively). In this period of high predictability, the RAFT scores are actually slightly worse than the EMOS scores, but revert to being equal with the next RAFT adjustment at t + 2 (not shown). This pattern can be observed at a handful of sites, where the error correlation between the lead times is particularly strong and the corresponding adjustment periods quite long. The RAFT algorithm described in Section 3 is applied in the same form to all locations and lead times. This does not take into account any potential stark differences in correlation patterns between the sites which in turn might require slightly different stopping rules or significance levels for an optimal performance. It might therefore be advisable to look into adjusting the algorithm if interest is in optimizing the performance for specific locations.
In Figure 9b, we again see the outcome if RAFT is applied every hour up until the last installment at t + 35. This represents the maximum and most short-term gain in forecast skill achievable at every lead time and is not a continuous trajectory. We will use these forecasts for the entire subsequent analysis. At the beginning of the forecast cycle, there is a sharp drop in the RMSE, immediately after we are able to use data from the current run. Afterwards, the RAFT skill remains relatively constant, with small variations due to the diurnal cycle, whereas the EMOS skill fluctuates considerably. Especially during the last 12 hr of the forecast cycle, the improvement of RAFT over EMOS is quite substantial, as the short-term RAFT forecast corrections manage to cancel out the skill deterioration usually occurring with increasing lead time.
All observation sites in the study can be separated into three categories based on their location: coastal, inland and mountain sites. In Figure 10, the RMSE and CRPS scores for all locations are aggregated over all four model runs and the RAFT scores are plotted against the EMOS scores. The CRPS for RAFT is calculated by plugging the RAFT mean into the EMOS predictive distribution. For both the RMSE and the CRPS, we see an improvement for all sites after applying RAFT, in particular at locations where the error was high in the first place. In fact, the improvements seem to follow the same linear trend, apart from a group of five mountain sites (located in Scotland and Cumbria), which receive a somewhat larger benefit from RAFT than the other sites. This hints at some location-specific issues not resolved by EMOS or the original ensemble.
In Figure 3, we showed that EMOS produces nearly calibrated forecasts and naturally we want to preserve this level of calibration with RAFT. Therefore we compare the rank and PIT histograms of the raw ensemble, EMOS and the distribution consisting of the RAFT mean and the EMOS predictive variance. Figure 11 shows these histograms divided by site type. For all three forecasting methods, there is only very little difference in calibration between coastal, inland and mountain sites. The raw ensemble is, as expected, uncalibrated and very underdispersive, recognizable by the characteristic U-shape. EMOS is fairly calibrated, although there is still some hint of a bias and underdispersion. In contrast, RAFT is slightly overdispersive, meaning that the variance of the distribution is on average too large. This is not surprising, given that the mean of the distribution now has much better deterministic skill, but the corresponding EMOS variance has not changed. An additional adjustment of the EMOS variance to counteract the induced overdispersion is a potential subject for further study.
Another indicator of calibration is the actual coverage of the prediction interval compared to the nominal value. The ensemble members create a prediction interval of 11∕13 ≈ 84.62%, which would correspond to perfect calibration. However, the raw MOGREPS-UK ensemble only reaches a coverage of 52.24%, whereas the EMOS coverage is 79.29% and the RAFT prediction intervals cover 87.31%. Although one is under-and the other overdispersive, both EMOS and RAFT are nearly calibrated, with the coverage for RAFT being slightly closer to the correct value.
Finally, we look at how RAFT performs during different seasons of the year. The test set contains two full spring seasons, and one full winter, summer and autumn. Figure 12 depicts the RMSE skill score, the relative improvement of the RAFT over the EMOS mean,

F I G U R E 11
Verification rank histograms for the raw ensemble (top row) and PIT histograms for the EMOS (middle row) and RAFT (bottom row) forecasts. The RAFT predictive distribution is generated by using the EMOS predictive variance. The histograms are divided by site type and data are aggregated over all dates, lead times and model runs in the test dataset for the four seasons. A score of 1 would mean a perfect forecast and a score of 0 no improvement over the reference forecast. Again, all four runs and all sites have been aggregated. The largest gain in forecast skill occurs during the night and is very similar for all seasons. The same pattern holds for the time between 1200 and 1600 UTC, where the skill score values are very close. In the morning, however, the scores for summer and winter behave very differently; they both decrease, but the summer skill score much faster and further than the winter score. This is due to the fact that in summer, the diurnal cycle plays a much more prominent role (not shown) and the predictability during night is much higher than during the day. In winter, the RMSE is more stable and there is only very little difference in predictability. The deterioration in the skill score during the early morning in summer coincides with a period of large change in predictability. It seems that during this time predictability changes so fast that even the very short-term RAFT adjustment can only improve the forecast skill by a small amount. Therefore, it might be advantageous to look into obtaining separate RAFT coefficients for the different seasons. This is not possible in the context of the current study, however, as a much larger training dataset would be required.

CONCLUSIONS AND DISCUSSION
This paper presents a new post-processing approach for NWP forecasts, rapid adjustment of forecast trajectories (RAFT), which is applied on top of the traditional post-processing approach EMOS once new information pertaining to the current forecast trajectory becomes available. By utilizing the forecast error correlation structure in the post-processed NWP forecast trajectories, the EMOS mean forecasts of the not-yet-realized part of the trajectory are adjusted in every time step of the forecast based on the forecast errors that have already been realized. This computationally efficient approach to make use of the newest available information provides an appealing alternative to computationally costly rapid ensemble cycles (Lu et al., 2007;Benjamin et al., 2016), and the older forecast gains skill in the time between initialization and release of the next NWP forecast cycle. While the precise set-up described here may have some operational restrictions due to computing and observation processing time if applied at a large number of locations, our results provide a convincing proof-of-concept. For example, as shown in Figure 9b, the forecast skill may be improved by over 40% on average in terms of RMSE when a 32-hour-old forecast is supplemented with the most recent available information an hour before it is realized. In an operational setting, the amount of benefit from the RAFT approach will depend heavily on the operational set-up of the forecast system. The MOGREPS-UK data used here were run on a 6-hourly basis, which is quite typical for a NWP system. For this type of set-up, our results at Heathrow Airport suggest a potential new strategy for updating the forecast cycle in that a delay in introducing the new NWP forecast may be preferred if RAFT is employed. Since spring 2019, MOGREPS-UK has changed to run on an hourly-updating cycle, with three members run every hour and an 18-member ensemble formed by time-lagging of six cycles. In such cases, it might be beneficial to apply RAFT to the older members of the time-lagged ensemble; Schuhen (2019) gives an application of RAFT to individual ensemble members.
RAFT is easily implemented at individual locations and could be especially useful for forecast users in applications such as aviation and renewable energy production where decision-making relies on location-specific skilful weather forecasts. Here, the forecast user commonly has access to their own observations in close to real time while the NWP forecast may be delivered with a small time lag, or a decision needs to be made in the middle of a forecast cycle, making the setting ideal for a RAFT application. In such cases, observation frequency may also be higher than the time resolution of the NWP forecast, a situation to which RAFT can easily be adapted.
In the EMOS post-processing procedure, each lead time is corrected independently based on forecast errors pertaining to that same lead time in older forecasts. As noted by (e.g.) Schefzik et al., (2013), this may lead to physical inconsistencies between lead times so that the EMOS mean trajectory over all lead times may not be a physically consistent forecast trajectory. One potential inconsistency is unrealistically large jumps in the temperature between lead times. Using the convergence index proposed by Ehret (2010), we compared the temporal stability, or the jumpiness, of the EMOS mean trajectory and the last RAFT trajectory and found the RAFT trajectory to be less jumpy for almost all sites than the original EMOS mean trajectory. This indicates that RAFT might correct some of the physical inconsistencies across lead times introduced in the univariate EMOS post-processing. An approach that combines RAFT with the ensemble copula coupling (ECC) approach of Schefzik et al., (2013) to generate physically consistent trajectories for wind forecasts is proposed in Schuhen (2019).
In our analysis, we update only the mean of the EMOS forecasts while the variance remains unchanged. The original EMOS forecasts are slightly underdispersive and biased; a similar effect has been reported in previous applications of EMOS to individual locations (e.g., Thorarinsdottir and Gneiting 2010). The RAFT procedure reduces the bias and improves the overall calibration, while changing the sign of the miscalibration to slightly overdispersive, cf. Figure 11. This effect is robust across all lead times as the EMOS forecast uncertainty is nearly constant across the relatively short lead times of 1-36 hr, except for minor diurnal differences related to the diurnal predictability pattern displayed in Figure 8. Our experiments to update the EMOS spread simultaneously with the mean were not successful in that they did not result in further skill improvement. One potential explanation for this is the consistency of the EMOS spread across the lead times; as the EMOS spread for 1 hr ahead forecasts is similar to that for 36 hr ahead forecasts, we do not necessarily expect to the be able to improve upon the spread for the 36 hr ahead forecasts, even if their means are updated to become 1 hr ahead predictions. However, a joint approach for mean and spread might be worth investigating further in cases where the originally post-processed forecast is nearly perfectly calibrated, or slightly overdispersive.