Prior distributions expressing ignorance about convex increasing failure rates

This paper deals with the specification of probability distributions expressing ignorance concerning annual or otherwise discretized failure or mortality rates, when these rates can safely be assumed to be increasing and convex, but are completely unknown otherwise. Such distributions can be used as noninformative priors for Bayesian analysis of failure data. We demonstrate why a uniform distribution used in earlier work is unsatisfactory, especially from the point of view of insensitivity with respect to the time scale that is chosen for the problem at hand. We suggest alternative distributions based on Dirichlet distributed weights for the extreme points of relevant convex sets, and discuss which consequences a requirement for scale neutrality has for the choice of Dirichlet parameters.

rate." Various parametric distributions have been used in different applications, such as the Gompertz-Makeham distribution (Bowers et al., 1986) customarily used in the insurance industry. A common flexible framework for the various applications can be built on considering each of the mortality or failure rates as a parameter, and to introduce dependence between these parameters by restrictions on the parameter space. Such a model is introduced in Gelman et al. (1996), where in accordance with the increasing convexity of the hazard rate function of the Gompertz-Makeham distribution, the constraint is that the mortality rate curve should be increasing and convex. This would seem to be a natural restriction in many cases. Considering a data set from life insurance, Gelman et al. (1996) perform both a maximum likelihood and a Bayesian analysis based on this model. The Bayesian analysis requires a prior distribution, which is updated by the likelihood according to Bayes' theorem to form the posterior, which is the basis for the inference. But the analysis can be made "objective" by choosing a noninformative prior, expressing ignorance about the mortality rates. How to do this is not at all obvious, however. In the present paper, we examine how a distribution expressing ignorance for a vector of increasing, convex mortality rates can be specified.
Since the mortality rate is the probability of dying within a given time interval for a randomly chosen individual from a given age class, the choice of a noninformative prior distribution for the parameter p of a binomial distribution is particularly relevant in our context. The conjugate prior is the beta distribution beta(p; , ) with density proportional to p −1 (1 − p) −1 . With = = close to 0, the posterior expectation of p based on a data set with y successes among N trials is ( + y)∕( + + N), which approximates the maximum likelihood estimate y∕N. Note that, if we change the time scale by, for example, considering monthly rather than annual mortality rates, the posterior estimate based on this noninformative prior is on average scaled appropriately down in accordance with the smaller number of deaths over the shorter intervals. This is in accordance with the desired goal expressed in Gelman (1996) "that as the scale of discretization changes with fixed data the strength of the prior distribution, the Bayesian analysis would remain roughly constant." In other words, we seek a prior that is approximately neutral with respect to the time scale that is chosen for the given problem. However, even though we may lack useful prior information about the level of each mortality rate individually, it would be unreasonable to use the noninformative beta prior to estimating the mortality rate for each age class separately. The constraint introduced in Gelman et al. (1996) of an increasing and convex mortality rate curve is a natural way of formalizing prior beliefs about the shape of this curve. With such a model constraint the specification of a noninformative and scale neutral prior distribution is not so straightforward. Also, it may be difficult to verify for a particular distribution that "the Bayesian analysis would remain roughly constant" under changes of scale. We cannot expect to find a conjugate prior for which the posterior expectations can be analytically computed, such as in the beta-binomial case, nor is there necessarily an obvious target such as y∕N to aim for. If, for a particular prior, we are nevertheless able to analyze the effect a change in scale would have on the inference. Such an analytic examination can save us from performing numerical computations with an unreasonable prior.
It is natural to consider the uniform distribution as a candidate for a noninformative and scale neutral prior for the convex and increasing mortality rate model, since in general, the uniform distribution is a common choice for a prior distribution intended to be noninformative. A major disadvantage of this choice is that the prior changes under reparameterization, but at least it has the virtue that maximization of the posterior density is equivalent to maximizing the likelihood. Accordingly, the maximum likelihood analysis in Gelman et al. (1996) is followed up by a Bayesian analysis based on a uniform prior on the restricted parameter space. Comparison of draws from the posterior distribution of the mortality rates with the data, especially concerning the higher age classes, suggests that this may not be a good choice. In Theorem 1 of the present paper, we provide a mathematical explanation why the mortality rate of the highest age class, in particular, seems to be overestimated by using the uniform distribution as a prior. An explanation involving the whole vector of mortality rates is given in Gelman (1996), Section 3.3, which attributes the apparent lack of fit to the prior expectation of the mortality rate curve climbing too steeply for higher age classes, as well as to the marginal variances of the prior distribution converging to 0 as the scale of discretization becomes finer, implies that a fixed set of data is eventually unable to move the posterior distribution away from the prior expectation. Such behavior of the marginal prior variances is certainly unsatisfactory. Since we find the argument to be somewhat inaccurate, we provide a more detailed derivation of the limiting behavior under the uniform prior distribution of the expected shape of the mortality rate curve and the marginal mortality rate variances under scale refinement, see Proposition 2, respectively, Corollary 1 in Section 5 of the present paper. A new framework developed in Section 4 of this paper provides the basis for the analysis, and gives rise to a class of prior distributions that according to Theorem 5, with a suitable choice of parameters, contains distributions for which the marginal variances stay bounded away from 0 regardless of the scale of discretization. Hence, these distributions should have a potential for overcoming the difficulties reported in Gelman (1996) and therefore accomplishing the desired goal of approximate insensitivity to change in the scale of discretization. The good properties of these distributions are demonstrated numerically on the insurance dataset in Section 6.
The rest of the paper is organized as follows: Section 2 provides the insurance data and presents formally the model for convex, increasing mortality rates and the uniform prior on this set. In Section 3, we provide new explanations why the uniform prior is problematic, based on establishing the distributions of the initial and final mortality rates, that is, the mortality rates for the lowest and highest age class, respectively, which essentially determine the location of the mortality rate curve. In order to examine whether uniformity can nevertheless be part of a sensible noninformative modeling for this problem, we introduce distributions involving such uniform modeling in Section 4. Motivated by alternative characterizations of these distributions given in Theorems 3 and 4, we suggest in Definition 1 and Definition 2 new classes of prior distributions for convex and increasing mortality rate functions. In Section 5, we analyze the limiting behavior under scale refinement of the prior expectation and variance of the mortality rates under some of these priors. With the uniform prior being a special case, this provides an additional explanation why the uniform prior is unsatisfactory, not only for the location, but also for the shape of the mortality rate curve. This analysis is also the basis for suggesting distributions that have a potential for dealing more satisfactorily with the requirement for scale neutrality. Numerical experiments presented in Section 6 confirm the good behavior of these distributions. Some conclusions are given in Section 7.

THE MODEL AND THE UNIFORM PRIOR FOR CONVEX, INCREASING MORTALITY RATES
In a Bayesian model evaluation context, Gelman et al. (1996) reanalyzes a dataset from life insurance originally studied in Broffitt (1988), containing the number N t of insured and the number y t of deaths (t = 1, 2, … , 30) in 30 consecutive age classes, ranging from 35 to 64, under a certain life insurance policy. The data are displayed in Table 1 and are henceforth referred to as TA B L E 1 Mortality rate data (LIN) from Broffitt (1988)  the LIN dataset. Bayesian models for analysing these data were also discussed in Carlin (1993). For this dataset, the observed mortality rates y t ∕N t show a generally increasing trend, but jump quite irregularly with occasional downward jumps from one age class to the next. Attributing this irregular pattern to random fluctuations, Gelman et al. (1996) want to build into their prior for the underlying mortality rates t the constraint that these rates are increasing and convex. Having also performed a maximum likelihood analysis, it is stated in Gelman (1996) that "since we were willing to use the maximum likelihood estimate, it seems reasonable to use a uniform prior distribution on ." Being uniform, this prior distribution might naively be perceived as noninformative, although it is not portrayed as such in Gelman et al. (1996) or Gelman (1996. In the next section, we investigate the potential for the uniform distribution on the set of an arbitrary number n increasing and convex mortality rates for being noninformative, by analyzing the consequences of this choice for the marginal prior distribution of 1 and n , that is, the initial (first) and final (last) mortality rates, respectively, and hence for the location of the mortality rate curve under the uniform prior. Some possible consequences for the posterior distribution of these mortality rates in general and in particular with the LIN data are discussed. In Section 6, a numerical investigation of the posterior distribution for these data under the uniform and other priors is provided.
We phrase the problem in somewhat more general terms. The data are assumed to arise from recording failures of objects or deaths of individuals belonging to a certain homogeneous population, occurring at ages contained in an interval (T 0 , T n ]. In a laboratory experiment studying for example, the effect of exposure to some chemical substance, "age" may refer to the time since onset of exposure rather than physical age. Also, age could be replaced by some relevant measure of accumulated load such as, for example, total mileage of a car. The data are discretized by counting the number of deaths occurring at ages in intervals (T t−1 , T t ] , t = 1, 2, … , n, with an arbitrary number n (to be chosen by researchers or decision-makers) of equidistant measuring points T 1 , … , T n in the interval (T 0 , T n ], the common distance T t − T t−1 = (T n − T 0 )∕N between consecutive measuring points hence also being arbitrary. We denote by N t the number of individuals at risk at age T t−1 and by Y t the number of deaths at ages in (T t−1 , T t ]. Some adjustments of N t may be made due to individuals joining the population or leaving the population without dying during the age interval (T t−1 , T t ]. In the LIN dataset (Table 1), such individuals are counted as 0.5. These adjustments are assumed not to depend on the underlying mortality rates. An arbitrary degree of right censoring may take place also due to individuals surviving the age T n or not reaching this age before the collection of data is completed.
The probability of a random individual having survived age T t−1 of dying before T t is denoted by t , t = 1, 2, … , n. Hence, Y t is binomially distributed with parameters N t , and t . These mortality rates are collected in the vector . The mortality rates 1 and n are referred to as the initial and the final mortality rates, respectively. Since under the assumption of increasing mortality rates all the rates must lie between 1 and n , the value of these parameters essentially determine the location of the mortality rate curve, and they are the focus of the next section. The remaining rates describe the shape of the curve. This aspect of the model will be discussed in Section 5. Assuming that Y 1 , … , Y n are independent and binomially distributed given N 1 , … , N n and , the likelihood for a dataset with observed number of deaths given by y = (y 1 , … , y n ) is For any integer p we denote by m p the Lebesgue measure on R p . Let J n be the set of convex, increasing , that is, J n = { ∈ R n ∶ 0 ≤ 1 ≤ · · · ≤ n ≤ 1 and t+1 + t−1 ≥ 2 t for t = 2, … , n − 1}. (1) Denoting by I the indicator function, the uniform prior distribution used in Gelman et al. (1996) is hence GMS ( ) = I j n ( )∕m n (J n ). (2)

3.1
The marginal prior for the final mortality rate n Using GMS ( ) as a prior, based on simulation experiments it was observed in Gelman et al. (1996) that "posterior predictive data sets were mostly higher than the observed data for the later ages." One possible explanation mentioned by the authors is that this could be a selection effect destroying the convexity, due to the insurance company having "screened out some high risk older people." As also indicated by Gelman (1996), it is reasonable to ask if this could instead be related to the behavior of the prior for higher age classes. It is therefore natural to examine, in particular, the marginal distribution for the final mortality rate n under the uniform prior on J n . The following theorem suggests that the problem is indeed due to the inherent mathematical properties of this distribution linked to the geometry of the set J n . Essentially, as reflected in the proof, the problem is that the (n − 1)-dimensional volume of the set of ∈ J n such that n = x is proportional to x n−1 , resulting in increased probability weight being put on higher values of n as n increases.
Theorem 1. Under GMS the rate n has the marginal density beta( n ; n, 1), that is, is beta-distributed with parameters n and 1.

Remark:
According to Gelman (1996), Section 3.1, the beta distribution also characterizes the marginal distributions for the parameters of a discrete approximation to an increasing but not necessarily convex function under a uniform prior.
An immediate consequence of Theorem 1 is that the expected mortality rate for the highest age class under GMS is n∕(n + 1), hence increases with n. This is quite unsatisfactory, since all the mortality rates ought to decrease rather than increase if the interval length T t − T t−1 decreases, and as a result n increases. Moreover, since we want a prior distribution suitable also with an arbitrary degree of right censoring at the final measuring time T n , the highest mortality rate n may be small and hence the prior expectation n∕(n + 1) unreasonably large even for small values of n. This is relevant for the special case analyzed in Gelman et al. (1996), with n = 30 representing age 64. According to GMS ( ), a person being lucky enough to have survived to the age of 64 will have only a 3% chance of surviving another year. We think this is a rather disturbing consequence, considering that we believe the data to come from a 20th century western human population which is not known to have been exposed to any kind of exceptionally high-risk factors. This is in accordance with Hjort et al. (2006), section 9, which mentions some other gloomy facts facing persons whose survival distributions follow the uniform prior. However, despite its peculiar properties when taken literally as a joint distribution for the vector of mortality rates, Hjort et al. (2006) consider it as a reasonable choice of a vague prior leading to an "objective analysis of the data," and discards it only in the context of prior predictive model evaluation. We note, however, that the marginal variance of n is n∕((n + 1) 2 (n + 2)). In the context of the LIN data, with n = 30, the resulting small SD of 0.03 could hardly be considered as an indication of noninformativity. Since the marginal variance is of order 1∕n 2 , a change in scale and a corresponding increase of n, for instance, considering monthly mortality rates, would aggravate this problem.
At the same time, due to the reduced size of the basic time intervals, the true value of n would decrease with such a change in scale, and hence become more and more out of reach for a posterior distribution based on a fixed dataset. This behavior under a change of time scale indicates that the desired scale neutrality is not achieved by the prior (2).
It is worth noting that with the binomial likelihood y n n (1 − n ) N n −y n for n , the marginal prior for n would be updated to a beta distribution with parameters n + Y n and N n − y n + 1, if we were to use only data from age class n. For the LIN data with n = 30, y n = 10 and N n = 594, this would yield the posterior expectation E( n |y n ) = (y n + n)∕(N n − y n + 1) = (10 + 30)∕625. This is almost four times higher than the posterior expectation when using a standard noninformative beta prior, which is approximately 10∕594. The numerical results given in Section 6 indicate that the consequences may not be quite that bad when taking data for all age groups into consideration, but confirm the predicted overestimation of the mortality rate for the highest age class. Hence it is reasonable to conclude from Theorem 1 that prior (2) is not as neutral as might naively be thought, and is a bad choice of a non-informative prior.

Can the problem be remedied by restricting the support of n ?
The mortality rate model of Gelman et al. (1996) is also considered in Johnson (2007) in the context of Model evaluation. In order to obtain a sensible prior predictive sample, Johnson (2007) suggests limiting the support of the mortality rates to the interval [0, 0.05]. A similar suggestion appears in Hjort et al. (2006). Forcing the posterior distribution of the mortality rates downward by restricting the support in this way could conceivably remedy the problems with the uniform prior demonstrated above. We can, however, show that the pressure toward higher values for n remains. Updating the marginal prior density, which under the restricted support is proportional to I [0,0.05] ( n ) n−1 n , only with data y n from the nth age group, we can show that the posterior expectation of n given y n is close to the support limit of 0.05. To see this, define integers k, r such that k − 1 + r = n − 1 + y n and r is minimal with respect to satisfying the inequality r ≥ 0.05(N n − y n + r). Then the posterior density for n can be written as is increasing on [0, 0.05] and by the proposition given below it follows that the posterior expectation is at least 0.05k∕(k + 1). For the LIN data n = 30, y n = 10 and N n = 594, so we can take r = 31, giving k = 9. Consequently, the posterior expectation is at least 0.045, which is well above the observed mortality rates for the highest age classes.
Proof. Using first integration by parts, then the inequality k+1 ≤ a k , and finally integration by parts again, we obtain The support [0, 1] for n implied by the prior distribution GMS ( ) is needed to cover any possible application of the convex and increasing mortality rate model, although it is unreasonably wide in the insurance application. But even if the support is drastically shortened, this proposition indicates that the uniform distribution with the restricted support [0, a] for n is still not a satisfactory choice of a noninformative prior distribution. An increase in the number n of measuring points would aggravate the effect, provided that the number of deaths y n in the highest age class decreases more slowly than the increase in n. Indeed, writing n − 1 + y n as k − 1 + r and g( n ) ∝ r n (1 − n ) N n −y n as above, we see that with a nonincreasing population size N n , the conditions of the proposition are satisfied with an increasing value of k when n increases. This pushes the lower bound ak∕(k + 1) for the posterior expectation of n toward the support limit a when n increases, whereas in order to be scale neutral it ought to decrease when n increases.
The suggestion to restrict the support of n is also not entirely satisfactory from a puristic Bayesian point of view, since it seems to be based on an initial inspection of the data revealing that the maximum observed mortality rate is less than 0.02. In fact, Carlin (1993)

3.3
The marginal prior for the initial mortality rate 1 Before turning to the introduction of alternative priors, it is natural to consider the possibility that the prior (2) could lead to biased estimation also for the initial mortality rate 1 . The following counterpart to Theorem 1 indicates that this could happen in some cases.
Theorem 2. Under GMS the rate 1 has the marginal density beta( 1 ; 1, n), that is, is beta-distributed with parameters 1 and n.
The proof is similar to the proof of Theorem 1. Parallel to the set K n of (3) it builds on the set The proof is a little more complex and is given in the Appendix. It follows from Theorem 2 that the marginal expectation of the uniform distribution on J n is 1∕N, and hence decreases at a pace that matches the decrease in the mortality rates that we would expect as the basic time intervals become shorter. This should indicate a less problematic prior modeling under (2) for the initial than for the final mortality rate. However, if we were to update the prior with data only from age group 1, the posterior expectation would be E( 1 |y 1 ) = (y 1 + 1)∕(N 1 + n + 1). If 1 is very close to 0 and as a result y 1 is very small, and if not only absolute but also relative error is important, the extra 1 added to y 1 could lead to an undesirable relative overestimation of 1 when combining GMS ( ) with the likelihood. When n increases, relatively more weight is put on this extra 1 in this posterior conditional expectation, since, for a fixed data set, the value of y 1 decreases with decreasing the length of the basic time interval. Depending on the true shape of the mortality rate curve, this could to some extent be compensated for by data from the next age classes. Nevertheless, it seems reasonable to conclude that also for the initial mortality rate the desired scale neutrality is not achieved in general.

ALTERNATIVE NONINFORMATIVE PRIORS
Our desired goal is to create a scale neutral noninformative prior or family of priors, equally well suited for arbitrary populations including nonhuman ones, as well as arbitrary time intervals between measuring points. Consider, for example, a laboratory experiment testing the toxicity of a certain chemical substance to a population of some animal species, measured on a suitable time scale (e.g., hours, days, or weeks). In some of such cases, one might very well be in a state of complete ignorance prior to the experiment, and a scale neutral noninformative prior distribution is called for. In view of the problems demonstrated in the previous section, we propose in general to first specify a noninformative marginal prior distribution for the final and/or initial mortality rates, enabling the data to locate the curve in the correct region. Thereafter a joint conditional distribution is specified for the remaining mortality rates, aimed at enabling the data to determine the correct shape of the curve. In particular, we will examine to what extent the uniform prior distribution, found inappropriate when applied to the entire -vector, can be more successfully applied for this more restricted task. This examination is performed both theoretically and numerically, and is reported in Sections 5 and 6, respectively. Both the theoretical analysis and the Monte Carlo simulation needed for the numerical investigation built on alternative characterizations of the relevant distributions given in Theorems 3 and 4. More flexible alternative suggestions presented in Definitions 1 and 2 are also motivated by these characterizations.
In the following subsection, we focus specifically on the problems discussed in Section 3.1 concerning the final mortality rate. By choosing a noninformative prior for n the magnitude of this parameter is determined within a Bayesian framework, and we avoid the non-Bayesian, data-based specification of an upper bound for the support used by Carlin (1993).

4.1
Modifying the marginal prior for the final mortality rate As a marginal prior for n , it is natural to choose a beta distribution, which is a conjugate prior for the factor y n n (1 − n ) N n −y n of the binomial likelihood. As a start, we combine it with a uniform conditional distribution for the rest of the parameters. Recalling the definition of K n in (3), We thus introduce the prior (the superscript F indicating "final"): Various degrees of weaker or stronger informativity based on subjective knowledge can be accommodated for by different choices of the parameters and . The uniform distribution on J n is obtained as the special case = n, = 1, since then (5) is proportional to From the point of view of scale neutrality and noninformativity, as indicated in the introduction, the choice = = close to 0 is preferable. If the sample size N n is not too small, and the time scale is sufficiently coarse, data y n from the nth age group alone should then suffice to locate the marginal posterior distribution of n adequately. If the number n of measuring points is drastically increased, and as a result, the true value of n and the observed y n become very small, data from lower age classes should still contribute to a sensible location of this distribution.
However, we also want the prior to be noninformative with respect to the shape of the mortality rate curve, which is determined by the other parameters. In Section 5, we will discuss whether the conditional uniform distribution on these parameters is a satisfactory prior choice from this point of view. The basis for this discussion is Theorem 3 below, expressing the prior distribution (5) in terms of familiar random variables. This builds on the following lemma: Lemma 1. Denote by K n the set of convex and increasing functions f on {1, 2, … , n}satisfying 0 ≤ f (1) ≤ f (n) = 1. Then K n is a convex set spanned by n extreme points. Identifying f with the vector (f (1), … , f (n)) T , the extreme points are Proof. It is straightforward to verify that the set K n is convex. Clearly w 1 = 1 n is an extreme point, since, for all t, u t ≤ 1 for every u ∈ K n . Consider w i for i ≥ 2. If w i = au 1 + (1 − a)u 2 with u 1 and u 2 ∈ K n and 0 < a < 1, then clearly u j t = 0 for 1 ≤ t ≤ i − 1, j = 1, 2. Since the graph of the function represented by w i is a straight line from (i − 1, 0) to (n, 1), the convexity assumption implies that u 1 and u 2 have to follow this straight line as well. Hence, u 1 = u 2 = w i , and it follows that w i is an extreme point. Now let u ∈ K n be arbitrary. Note that the n × n matrix W n formed by the column vectors w 1 , … , w n is lower triangular with non-zero values on the diagonal. Hence, its determinant is positive, and it follows that w 1 , … , w n are linearly independent. We may therefore write u uniquely as a linear combination u = a 1 w 1 + · · · + a n w n . Since u n = 1 and w i n = 1 for i = 1, … , n, it follows that a 1 + · · · + a n = 1. It remains to prove that a i ≥ 0 for i = 1, … , n. Clearly, a 1 = u 1 ≥ 0. Since u 2 − u 1 = a 2 ∕(n − 1) and u is increasing, it follows that a 2 ≥ 0. It is not too hard to see more generally that for i ≥ 2 we have u i − u i−1 = a 2 ∕(n − 1) + · · · + a i ∕(n − i + 1).
Proof. Note that the Dir(n; 1, 1, … , 1) distributed vector (X 1 , … , X n−1 ) T is uniformly distributed on the set Recall that W n is the matrix formed by the column vectors w 1 , … , w n of Lemma 1. Let W n −n;−n be the matrix obtained from W n by deleting the last row, consisting of 1s, and the last column, consisting of 0s except for the last entry. Define the n − 1-dimensional vector variable = W n −n;−n (X 1 , … , X n−1 ) T = X 1 w 1 1∶n−1 + · · · + X n−1 w n−1 1∶n−1 .
Since the first n − 1 entries of the last column of W n are 0, it follows that consists of the first n − 1 entries of W n (X 1 , … , X n ) T = X 1 w 1 + · · · + X n w n . The last entry of this vector is 1, and consequently X 1 w 1 + · · · + X n w n = ( T , 1) T . By Lemma 1, the set K n is spanned by such vectors, and accordingly the set K n of (3) is spanned by the vectors . By (6), the variable is the image of (X 1 , … , X n−1 ) T under the linear transformation defined by the matrix W n −n;−n . Since this transformation has a constant Jacobian determinant, and since (X 1 , … , X n−1 ) T is uniformly distributed on S n , it follows that the distribution of on K n is uniform as well. Scaling down by multiplying with n distributed according to beta(⋅; , ) results in a uniform distribution for 1∶n−1 given n on n K n , in accordance with (5).
The posterior distribution resulting from updating this prior with the likelihood is proportional to the resulting posterior does not belong to the same class of distributions as the prior, but can be obtained using MCMC. In Section 6, we use Stan (Carpenter et al., 2017) for the analysis of the LIN data. The results show that using Definition 1 allows significant improvements over the uniform distribution on J n and even over (5) with = close to 0. The choice of parameters is motivated by the theoretical analysis of Section 5, where it is demonstrated that it is possible to choose Dirichlet parameters for F , , ( ) in such a way that the marginal prior variances for the mortality rate parameters are bounded away from 0, regardless of the number n of measuring points. This should allow data to adjust the prior, which indicates that this generalisation of (5) is useful from the point of view of scale neutrality. However, for completeness, we first consider parallels to (5) where either 1 or both 1 and n follow noninformative beta priors, and define variants of these distributions corresponding to the prior distributions of Definition 1.

Modified priors involving the initial mortality rate
The discussion at the end of Section 3.3 suggests that the possibility of overestimation of a very small 1 could be a matter of concern. This could be handled by an approach parallel to that used in the previous subsection. Recalling Equation (4), which defines the set L n , the idea leading to (5) suggests the prior (the superscript I indicating "initial"): To avoid the possible disadvantage of the GMS ( ) prior discussed at the end of Section 3.3, we suggest to choose , close to 0. The concerns raised with respect to the marginal distributions for 1 and n under the GMS ( ) prior can be handled simultaneously by combining (5) and (7) by specifying a joint prior for these parameters, and a uniform conditional distribution for 2∶n−1 given 1 and n . Define M n = { 2∶n−1 ∈ R n−2 ∶ (0, T 2∶n−1 , 1) T ∈ J n }. Similarly to the proofs of Theorems 1 and 2, it can be seen that ∈ J n if and only if 0 ≤ 1 ≤ n ≤ 1 and 2∶n−1 ∈ 1 1 n−2 + ( n − 1 )M n . This leads to the following definition, (the superscript IF indicating "initial and final"): The following lemma describes the sets of extreme points needed to characterize the distributions (7) and (8) analogously to the characterization of (5) in Theorem 3.

Lemma 2.
1 is a convex set spanned by the n − 1 extreme points w 2 , w 3 , … , w n .
The proof is given in the Appendix.
We can now give the following characterization of the distributions (7) and (8): Theorem 4.
The proof is given in the Appendix. Based on Theorem 4, and in analogy with Definition 1, it is natural to introduce the following generalizations of (7) and (8):

MARGINAL PRIOR EXPECTATIONS AND VARIANCES FOR MORTALITY RATE PARAMETERS
We think it is fair to claim that the wish for scale neutrality for the initial and final mortality rates are met by choosing the respective and -parameters of Definitions 1 and 2 close to 0. In this section, we will discuss how the parameters i of these definitions, describing the conditional distributions for the remaining mortality rates, should be chosen in order to obtain a noninformative and approximately scale neutral modeling also for the shape of the mortality rate curve. For simplicity, we will limit attention to the priors discussed in Section 4.1. As a byproduct, we will also derive the limiting behavior of the uniform distribution on J n , see Proposition 2 and Corollary 1. These results confirm that this distribution is far from scale neutral. We start by considering the expected shape of the mortality rate curve.

Prior expectation of the mortality rate curve
For convenience, we first assume that the prior distribution (5) is chosen. By Theorem 3, this means that each i of Definition 1 equals 1. We are interested in the effect on the expected shape of the mortality rate curve of changing the scale by increasing n. A standardized representation of the mortality rate curve is given by the vector defined in (6), shown in Theorem 3 to follow the required uniform distribution on K n when the vector (X 1 , … , X n−1 ) T follows the Dir(n; 1, … , 1) distribution. This standardization simplifies the comparison of the curves for different values of n.
To emphasize the dependence on the number n of measuring points, we denote the components t of defined in (6) by n t , t = 1, … , n − 1. In addition, we define n n = 1. Now, for any positive real number r, let [r] be the smallest integer for which [r] ≥ r. Then n [n ] is the standardized mortality rate n t for an index t which approximately satisfies t∕n = . For any integer n define the step function f n ( ) = E , ∈ (0, 1]. Note that by Lemma 1 the tth row of the matrix W n is given by Hence, it follows from (6), (9) and the fact that E(X i ) = 1∕n for every i, that The sum in (10) is an approximation to the integral Hence, as n increases the step functions f n ( ) concentrate around the curve We may define the random vector n through (6) with (X 1 , … , X n−1 ) Dirichlet distributed with any Dirichlet parameter vector used in Definition 1. Since E(X i ) = 1∕n also under the Dir(n; , … , )-distribution for any value of > 0, the argument leading to (11) also applies to the prior distributions of Definition 1, if all components of are identical. Hence, rescaling essentially does not influence this part of the prior modeling significantly. The result is also valid for the original uniform prior on of Gelman et al. (1996). This can be proved by a slightly modified argument involving the extra extreme point vector 0, as well as a Dirichlet distribution with marginal expectations 1∕(n + 1) rather than 1∕n. Hence, we have the following proposition:

Proposition 2. Under the uniform prior distribution on J n , it holds that for any
Interestingly, Gelman (1996) claims that, in our notation, the limiting curve is instead the quadratic 2 . The argument is that under convexity, the differences n t − n t−1 form a positive increasing sequence, and under the uniformity assumption such sequences approach linearity as n increases. Hence, the sequence n t itself should approach a quadratic. The reason that this argument is not valid is that the restriction ( n 2 − n 1 ) + · · · + ( n n − n n−1 ) = 1 − n 1 ≤ 1 is not taken into consideration. The vector ( n 1 , ( n 2 − n 1 ), … , ( n n − n n−1 )) is a one-to-one linear transformation of ( n 1 , … , n n ) and is hence uniformly distributed if this latter vector is assumed to have a uniform distribution. But since the sum of the components is bounded by 1, this distribution is restricted to a proper subset of the set of increasing functions. Hence, the models for this vector and a general uniformly distributed increasing vector are not identical, as asserted in Gelman (1996). In particular, the curve does not necessarily become concentrated around a straight line in the limit.
The perceived quadratic form of the mortality rate curve in Gelman (1996) is taken as a partial explanation for the apparent overestimation of the mortality rates for higher age classes; the form of the curve forcing these mortality rates upwards. To the extent that such an effect is important, it is even stronger because of the logarithmic shape of f ( ), having the derivative − log(1 − ) and the second derivative 1∕(1 − ). However, we doubt that there exists a "neutral" shape, ideal for being updated by data toward a true mortality rate curve irrespective of the true shape of this curve. But if such a shape is believed to exist, one could obtain an a priori expected curve of this shape by changing the parameters of the Dirichlet distribution for the X i . The parameters should be chosen such that the weights E(X i ) are equal to (or proportional to) the components of the vector obtained by applying the inverse of W n to a vector of discrete approximations to the desired curve. This strategy could also be followed if it is desirable to build into the prior an informative preference for a certain shape, such as the shape of the failure rate of a Gompertz-Makeham distribution.
The overestimation for higher age classes is also partially attributed by Gelman (1996) to small marginal prior variances for the mortality rates with the rather large (n = 30) number of measuring points. Undoubtedly, this is a very significant point, and we think choosing Dirichlet parameters giving large enough variances for each n t to allow the data to determine the shape of the posterior expected mortality rate curve, irrespective of the shape of the prior expectation, is likely to be more important than this shape. The limiting behavior of the marginal variances with increasing n is analyzed in the next subsection. How to choose such parameters is discussed in Section 5.3.

Prior variances of the standardized mortality rates
In an example discussed in sections 3.1 and 3.2 of Gelman (1996), it is shown that under a uniform prior the marginal variances of parameters representing a gridded approximation to a continuous increasing curve converges to 0 as the distance between the grid points converges to 0. This is undesirable behavior, since it may imply that a fixed data set is unable to pull the curve which has a linear expectation under the uniform prior, toward the true shape if a too fine grid is chosen. In the words of Gelman (1996): "The strength of the prior distribution thus depends on the discretization, with potentially grave consequences." Based on considering the increasing and convex mortality rate model "as a slight variant of the example of sections 3.1-3.2," Gelman (1996) in Section 3.3 argues that this model is subject to the same problem. Part of the argument is based on the incorrect expected linear shape of the curve of mortality rate differences mentioned in the previous subsection, but the result is confirmed in Corollary 1 below. The corollary follows from a special case of Theorem 5 below. Under the prior distributions F , , , … , of Definition 1, the standardized mortality rate variances v n ( ) , ∈ (0, 1], can be computed using (6) and (9) in a similar way as in the computation of the expectations f n ( ) in (10) of the previous subsection. After carrying out this calculation and going to the limit as n → ∞, we arrive at the following theorem, where the main point is the existence of a limiting curve rather than its exact analytic form: Theorem 5. Under the prior distributions F , , , … , of Definition 1, the standardized mortality rate variances satisfy Proof. Noting that var(X i ) = (n − 1)∕(n 2 (n + 1)) and, for i ≠ j, cov(X i , X j ) = −1∕(n 2 (n + 1)) we obtain from (6) v n ( ) = ((n − 1)∕(n 2 (n + 1))) [n ],i } + (−1∕(n 2 (n + 1))) For convenience we add the sum ∑ n i=1 w 2 [n ],i ∕(n 2 (n + 1)) to the first summand and subtract it from the second. Using (9) and then (10), this yields v n ( ) = (1∕(n(n + 1))) where we have used (10) in the last step. The sum in (12) is an approximation to the integral Since the limit of f n ( ) as n → ∞ is f ( ) given by (11), it follows that It follows from Theorem 5 that for a fixed > 0 the standardized mortality rate variance v n ( ) converges to 0 as n → ∞. Hence, using a common fixed regardless of n is subject to the above-mentioned concern expressed in Gelman (1996). This applies also to the marginal variances for the mortality rate vector when distributed according to the uniform distribution on J n : Corollary 1. Let ∈ (0, 1) be arbitrary. Then under the uniform prior on J n we have ) .
Since by Theorem 3 n and n [n ] are independent, we obtain by conditioning on n var ( n n [n ] ) 2 var( n ) ≤ var ) + var( n ).

Consequences for the choice of Dirichlet parameters
We knew already from Theorem 1 and the subsequent discussion that the uniform distribution is not satisfactory as a noninformative prior on J n . The main lesson to learn from Theorem 5 is therefore to guide us in choosing Dirichlet parameters in Definition 1. Consider for instance F , defined in (5), which is a special case of Definition 1 with ( 1 , … , n ) = (1, … , 1). The conclusion of Corollary 1 does not hold when choosing = = close to 0, since by (10), (11) and the above proof we then in the limit have var( [n ] ) ≈ (1∕4)f ( ) 2 for large n. But for the shape of the curve, the distribution (5) becomes increasingly problematic when n increases, regardless of the choice of and , due to Theorem 5. Hence, it seems unlikely that uniformity can play a useful role in non-informative prior modeling of convex, increasing mortality rates. By Theorem 5, the only choice of for which v n ( ) stays bounded away from 0 for every value of n is = 0, corresponding to the improper Dirichlet density proportional to x −1 1 x −1 2 · · · x −1 n . But for a given value of n we may approximate this arbitrarily well by choosing sufficiently small. It would seem natural from the point of view of scale neutrality to choose inversely proportional to n. The standardized variance v n ( ) then approaches a limit proportional to (13), and is of the same order of magnitude as this limit. Hence, the desired stability of the prior variances of the standardized mortality rates under rescaling can be achieved.
Because the limit of (13) as → 0 is 0, implying that, under the conditions of Theorem 5 with a fixed value of , we have lim n→∞ var( n 1 ) = 0, it might be a good idea to choose a parameter vector for the Dirichlet distribution of (X 1 , … , X n−1 ) T of the form ( , ∕n, … , ∕n) instead of ( , , … , ). Since by (6) n 1 = X 1 , this choice would give a beta distribution with parameters and (n − 1) ∕n for n 1 . This will ensure var( n 1 ) ≈ 1∕4 if is small enough, allowing data to adjust the distribution of n 1 . If another limiting shape for the prior expectation of the mortality rate curve than (11) is preferred, this can be achieved in our framework by choosing a non-constant vector of Dirichlet parameters, as explained in Section 5.1. Also in this case sensible scaling is important. Fortunately, the argument leading to (13) can be generalized to the situation when the Dirichlet parameters 1 , 2 , … , n are allowed to be unequal. We conclude this section with a sketch of the proof. Assume that these parameters under different scalings are derived from a common continuous bounded function g( ) on the interval [0, 1]. For a given value of n, define accordingly scaled parameters i = (1∕n)g(i∕n), i = 1, … , n. If we require ∫ 1 0 g( )d = 1, we obtain that Γ def = ∑ n i=1 i ≈ 1. The case treated in Theorem 5, combined with the requirement that the Dirichlet parameters add up to 1, independently of n, corresponds to g( ) = 1. We now have var(X i ) = i (Γ − i )∕(Γ 2 (Γ + 1)) and cov(X i , X j ) = − i j ∕(Γ 2 (Γ + 1)). These quantities can be incorporated in the expansion of var as in (12). When adapting the proof to this situation, the sums corresponding to the sums in (12) converge to integrals involving g( ), and the sum Γ corresponds to the quantity n appearing in Theorem 5. Since requiring g to have unit integral implies that is close to 1, the required stability under rescaling of the marginal prior variances of the standardized mortality rates can be obtained also in this case.

APPLICATION TO THE LIFE INSURANCE DATASET
In this paper, we first address a LIN dataset (Broffitt, 1988) on the graduation of mortalities from Table 1. Additionally, this dataset was extended to monthly observed data (LINM) through a uniform upsampling procedure to allow for monthly data over ages from 35 to 64 resulting in 360 observed time points in LINM instead of 30 in LIN. LINM dataset is available in Table 2 in the Appendix. For LIN and LINM data, we follow Gelman (1996) in assuming that the observed deaths y t at each age t follow independent binomial distributions with rates equal to the unknown mortality rates t and known population sizes N t . Further, consistently following Gelman (1996) due to the large population sizes for each age group and due to low empirical death rates, we use the Poisson approximation for mathematical convenience. Thus, the following model is assumed: Then, we performed inference with the uniform GMS ( ) prior (Gelman, 1996) specified in Equation (2), as well as the suggested in this paper F , ( ), I , ( ), and I,F 1 , 1 , n , n ( ) priors from Equations (5), (7), and (8), respectively.
We used Stan (Carpenter et al., 2017) for Bayesian inference. In all of the experiments, 20 parallel chains were run for 100,000 iterations each with a warm-up of 10,000 iterations. Otherwise, the default tuning parameters from Stan (Carpenter et al., 2017) were used.
In Figure 1, we present extensions of figure 2 from Gelman et al. (1996) and figures 2 and 3 from Gelman (1996). Figure 1 presents posterior means under the studied priors as well as 95% credible intervals obtained within the same settings. These figures show that replacing the marginal prior beta( 30 ; 30, 1) with a noninformative beta prior while retaining the uniform prior for the rest of the parameters results in a drastic improvement of the posterior distribution. On the other hand, a corresponding change in the marginal prior for 1 seems to have a negligible effect. The same conclusions with an even stronger improvement are valid for the LINM data as shown in Figure 2.
Further, for both of the addressed datasets, we performed inference with a I,F 1 , 1 , n , n ( ) prior and different values of 2 , … , n for the Dirichlet distribution of (X 2 , … , X n−1 ) T . More specifically, we varied the values of all 's within I,F 1 , 1 , n , n ( ) using the following sequence (i) = (1 − exp(−1)) i . For LIN, we allowed i ∈ {1, … , 20}. For LINM, for computational reasons we decided to only use i ∈ {1, 4, 7, 11, 14, 17, 20}. The results for different values of (i) are shown in the left and right panels of Figure 3 for LIN and LINM datasets, respectively. For the LIN dataset from Broffitt (1988), we clearly see three clusters for three ranges of (i) . The general tendency as expected is that for the smaller values of (i) we are shifting the mortality rates closer to zero. The same conclusions are overall valid for LINM data. Thus, as expected, for both datasets the overall tendency is that the curve becomes flatter and seems to fit the observed data better with smaller values of (i) .

CONCLUSIONS
Results from the analysis of a specific life insurance dataset in Gelman et al. (1996) indicate that a uniform prior on the set of increasing, convex mortality rates is not a satisfactory choice of a noninformative prior distribution on this set. Theorems 1 and 2 of the present paper provide theoretical support for this empirical evidence, and together with Proposition 2 and Corollary 1 indicate an underlying mathematical explanation. By establishing the marginal distributions for the final and the initial mortality rates, and the marginal expectations and variances for parameters characterizing the shape of the mortality rate curve, these results also show that the uniform prior falls short of a requirement for scale neutrality, which is an attractive goal to pursue noninformative priors. Our results are in line with the explanation given in section 3.3 in Gelman (1996), which is that the "prior distribution becomes ever stronger as the scale of the time intervals becomes smaller." Defining noninformative marginal distributions for the initial and/or final mortality rates directly, uniformity could still potentially be used in the conditional distribution for the rest of the mortality rates. Alternative characterizations of such distributions given in Theorems 3 and 4 motivate a whole new class of priors described in Definitions 1 and 2 through the choice of parameters for Dirichlet distributions. The uniform distribution corresponds to all the Dirichlet parameters being 1 and as a consequence their sum equals the number n of measuring points. Theorem 5, as well as the numerical results from Section 6, indicate that such a uniform conditional distribution on 1∶n−1 given n is still not satisfactory from the point of view of scale neutrality. The subsequent discussion suggests that the desired scale neutrality may be obtained by instead requiring the sum of the Dirichlet parameters to stay constant, regardless of the number of measuring points. Hence, the prior distribution has to be changed according to the number of measuring points. This is in accordance with the goal expressed in Gelman (1996), Section 5: "Instead of requiring that a single probability model be invariant under scaling … we demand a family of models, indexed by scale, that are mutually consistent." Alternatively, the improper Dirichlet distribution with all parameters equal to 0 might serve as a single default option suitable for any number of measuring points.

ACKNOWLEDGMENTS
We are very grateful to professor Ida Scheel and two referees for very valuable advice and many helpful suggestions.

ORCID
Using that 1 ≤ t ≤ 1 implies that 0 ≤ t − 1 ≤ 1 − 1 , t = 2, … , n, it is not too hard to see that ∈ J n if and only if = 1 n or 0 ≤ 1 < 1 and (1∕(1 − 1 ))( 2∶n − 1 1 n−1 ) ∈ L n , or equivalently, 0 ≤ 1 ≤ 1 and 2∶n ∈ ( 1 1 n−1 ) + (1 − 1 )L n . Hence, calculating the volume of the set of ∈ J n such that 1 ≤ x by integrating along the 1 -axis, we obtain By the uniformity assumption, the probability under (2) that 1 ≤ x is proportional to this volume, and it follows that the density for 1 under (2) is proportional to (1 − 1 ) n−1 , proving the theorem. ▪ Proof of Lemma 2. For both sets, the fact that the given vectors are extreme points are proved by the same argument as in the proof of Lemma 1. Let u = (u 1 , … , u n ) T be an arbitrary element of L n . By adding 1 − u n to each coordinate of u we obtain a vector belonging to the set K n of Lemma 1. Formally, u + (1 − u n )1 n ∈ K n . This vector can be written uniquely as a convex combination of the extreme points given in Lemma 1, that is, u + (1 − u n )1 n = a 1 w 1 + · · · + a n w n .
Since w 1 = 1 n , this implies u = (a 1 − 1 + u n )1 n + a 2 w 2 + · · · + a n w n . Since u 1 = 0, we must have a 1 = 1 − u n , and hence u = a 2 w 2 + · · · + a n w n = a 0 w 0 + a 2 w 2 + · · · + a n w n , where a 0 = 1 − (a 2 + · · · + a n ) ≥ 0. This proves the first part of Lemma 2. Now suppose instead that u ∈ M n . Then u is also a member of K n , and has a unique expression as a convex combination u = a 1 w 1 + · · · + a n w n by Lemma 1. Since u is also a member of L n , and since w 1 = 1 n , we must have a 1 = 0, and the assertion follows.