Statistical Inference for k ( k ≥ 3 ) Lognormal Means from Left Censored Data

The occurrence of censored data due to less than detectable measurements is a common problem with environmental data such as quality and quantity monitoring applications of water, soil, and air samples. The log-normal distribution is one of the most common distributions used for modeling skewed and positive data. Over the past decades, various methods for comparing the parameters of two lognormal distributions in the presence censored data have been proposed. Some of them are differing in terms of how the statistic test adjust to accept or to reject the null hypothesis. As a model distribution of measured environmental and/or biomedical data, log-normal distribution is considered. Logmormal means can be compared either by confidence intervals or hypothesis testing procedures. In this article, a new test procedure for comparing the means of k (k >= 3) lognormal distributions in the presence of left-censored data is introduced and evaluated. Asymptotic chi-square test is used in the proposed test procedure. A simulation study was performed to examine the power and the size of the proposed test procedure introduced in this article utilizing a computer program written in the R language. We find analytically that the considered test procedure is doing well through comparing the size and power of the statistic test.


Introduction
Left-censored data commonly arise in environmental contexts. Left-censored observations (observations reported as leass than a detection limit DL can occur when the substance or attribute being measured is either absent or exists at such low concentrations that the substance is not present above the DL. Data sets containing left-censored observations are referred to as left-censored data. In many environmental applications the distribution of variables such as concentration, inhalation, digestion, and consumption rates are positive and skewed to the right. Hence, censored observations occur between zero and DL. In some instances a log transformation can provide a more natural scale to analyze such measurements. Many environmental data sets are characterized by a small number of high concentrations and a large number of low concentrations and are often right-skewed (Shumway et al., 1989). The log-normal distribution is positively skewed and hence can incorporate the few unusually high measurements of such environmental data in its long right-hand tail. For this reason the log-normal distribution is often applied to environmental data (Gilbert, 1987). While analyzing environmental and exposure data, a very common phenomenon is the occurrence of non-detects, i.e., observations below an analytical detection limit (DL), resulting in Type I singly left censored samples. Detection limit is the lowest concentration level that can be determined to be statistically different from a blank. The presence of observations below the DL significantly complicates the data analysis. Faced with such data, several strategies have been recommended for data analysis. One approach consists of replacing the below DL values with a constant such as DL 2 , and using methods available for complete samples. It is easy to demonstrate that the conclusions resulting from this routine practice can be seriously imperfect; in fact, the conclusions may depend on the substitution value used for replacing sample values below the DL. In general, censoring means that observations at one or both tails are not available. Left-censored data commonly arise in environmental contexts. Left-censored data (data reported as less than detection limit) can occur when the substance or attribute being measured is either absent or exists at such low concentrations that the substance is not present above the DL level. Data sets containing left-censored observations are referred to as left-censored data. A sample is multiply censored if there are several detection limits. When more than two distinct detection limits DL 1 , DL 2 , ..., DL mc (m c ≥ 3) are reported, the data are said to be multiply-left-censored. Samples to be considered in this paper are those that are Type I multiple-left-censored. Suppose that a sample of n data points is given of which m data points are non-censored (fully measured), and the remaining m c = n − m observations are left-censored with multiple detection limits DL 1 , DL 2 , ..., DL mc . In such Type I censored samples detection limits are fixed, whereas m and m c are random. It is common to have environmental data contains detection limits. Multiple censoring commonly occurs with environmental data because detection limits can change over time (e.g., because of analytical improvements), or detection limits can depend on the type of sample or the background matrix. Millard and Deverel (1988) give three possible causes for multiple censoring on the left when measuring the concentration of zinc in shallow groundwater. First, there may be more than one method available, and each method may be optimal in different ranges of zinc concentration. A second cause involves the amount of dilution that a lab technician may use. Note that the detection limit depends on the number of dilutions. A third cause may be decreasing detection limits over time as the measurement technique improves. In many environmental applications the distribution of variables such as chemical concentration, inhalation, digestion, and consumption rates are positive and skewed to the right. Hence, censored observations occur between zero and DL. In some instances a log transformation can provide a more natural scale to analyze such measurements.
Nondetect values can cause an especially difficult problem when the goal is to compare k(k ≥ 3) different populations. There has been a great deal of literature on the subject of the statistical inference of the parameters of normal and log-normal populations from both fully measured and censored data. Gupta and Li (2006) developed a score test for testing the equality of the means of two independent log-normal populations from fully measured data. Zhou et al (1997) considered two methods for comparing the means of two independent log-normal non-censored samples. Harris (1991) considered two parametric and two non-parametric methods for testing the equality of medians of two independent log-normal distributions when some data are left-censored. Paul and Gary (2007) compare the performance of several methods for statistically analyzing censored data sets when estimating the 95th percentile and the mean of right-skewed occupational exposure data.  proposed tests and confidence intervals for the ratio of the two means of two lognormal distributions, based on pivotal quantities involving the maximum likelihood estimators. Other suggested methods for comparing the means of two log-normal distributions are discussed in Krishnamoorthy et al ( , 2011Krishnamoorthy et al ( , 2007Krishnamoorthy et al ( , 2006Krishnamoorthy et al ( , 2003. Some of these methods are based on the generalized p-value and generalized confidence intervals, and others are based on the generalized test variable. Aboueissa (2015) introduced a test procedure for comparing the means of two independent lognormal populations when data is singly censored. Abdollahnezhad et al (2012) introduced a new method of test for comparing the means of two log-normal populations through the generalized measure of evidence to have against the null hypothesis. Prentice (1978) developed linear rank tests with right censored data. Millard and Deverel (1988) adapted several existing right censored non-parametric procedure so that they can be used in environmental setting with left-censored data. Methods for the estimation of the log-normal parameters for one-sample cases where there may exist left-censored data are discussed by El-Shaarawi (1989). Stoline (1993) extended results first suggested by Harris (1991) and proposed a procedure for comparing medians of two independent log-normal distributions where some data may be left-censored. Stoline (1993)used the Expectation Maximization (EM) algorithm introduced by Dempster et al. (1977) to calculate the maximum likelihood estimates of population parameters µ and σ. Other suggested methods for estimating population parameters from censored samples are discussed in Marco (2005), Jin et al (2011), Gibbons (1994), Gleit (1985), El-Shaarawi and Esterby (1992), Elshaarawi and Dolan (1989), Gilbert (1987), Stavros (2004 and Schneider (1986).
The purpose of this paper is to develop a parametric procedure to test the equality of k (k ≥ 3) lognormal distribution means when data are multiply left-censored. This procedure may be used to compare the concentration of a pollutant in shallow groundwater among k ≥ 3 geological zones found in different geographical areas. For example, the pollutant may be copper in low concentrations (micrograms per liter of water) in different geographical areas that have different types of soil. The EM algorithm will be used to obtain the maximum likelihood estimates of population parameters under different hypotheses. A simulation study was performed to inspect the size and the power of the proposed test procedure. To facilitate the application of this procedure, a computer program is written in the R language which calculates the maximum likelihood estimates, and asymptotic chi-square test statistics and their p-values.

Assumptions and Notations
Assume that there exists k random samples of n i data values: y i1 , y i2 , ..., y im i , y im i +1 , ..., y in i taken from k independent log-normal populations LN (µ i , σ i ) for i = 1, 2, ..., k. Where LN (µ, σ) denotes a log-normally distributed variable y with the probability density function where −∞ < µ < ∞ and σ > 0. For convenience, for each sample i let us assume that the first m i observations y i1 , y i2 , ..., y im i are non-censored (fully measured) and the remaining m c i = n i − m i observations are left-censored for i = 1, 2, ..., k. For left censored observations, it is assumed that for each sample i it is only known that y ij < LDL ij for j = m i + 1, ..., n i (or j = 1, 2, ..., m c i ) and i = 1, 2, ..., k. The parameters for the i th log-normal population can be expressed as functions of the parameters µ i and σ i as: for i = 1, 2, ..., k and j = m i + 1, m i + 2, ..., n i .
where LDL ij are the detection limits in the i th log-normal sample and m i + m c i = n i and j = m i + 1, m i + 2, ..., n i for i = 1, 2, ..., k.
To simplify the presentation in this paper, the analysis is described and illustrated by reference to the analysis of normally distributed data, though this condition occurs infrequently in typical environmental data analysis. However, it is frequently necessary to transform real environmental data before analysis; typically the logarithmic transformation of x ij = log(y ij ) is used, although other transformations are possible. When the logarithmic or other transformation is used prior to censored data set analysis, it is necessary to transform the analysis results back to the original scale of measurement following parameter estimation. For each sample i let be the sample mean and sample variance of the m i non-censored observations x i1 , x i2 , ..., x im i , for i = 1, 2, ..., k. Let the functions φ(.) and Φ(.) be the pdf and cdf of the standard unit normal. Define We also define .., k and j = 1, 2, ..., m i .
The likelihood function of the samples under consideration is given by: which can be written as Four hypotheses are possible: The k log-normal population means are confirmed equal whenever no evidence was available to reject the null hypothesis H 0LN :

Maximum Likelihood Estimates of Population Parameters
In this section the maximum likelihood estimates of population parameters µ i and σ i , for i = 1 and 2, are derived under each of the hypotheses H 0N , H A1N and H A2N . The derivations of these estimates are now described.

Maximum Likelihood Estimates under H 0N
Under the hypothesis H 0N , x ij , for i = 1, 2, ..., k and j = 1, 2, ..., n i , are assumed to be normally distributed with mean µ and standard deviation σ. That is, it is assumed that there exists a random sample of n = n 1 + n 2 + ... + n k data values taken from a normal population with mean µ and standard deviation σ. For convenience, for each sample i let us assume that the first m i observations x i1 , x i2 , ..., x im i are non-censored (fully measured) and the remaining m c i = n i − m i observations are left-censored for i = 1, 2, ..., k. For left censored observations, it is assumed that for each sample i it is only known that x ij < DL ij for j = m i +1, ..., n i (or j = 1, 2, ..., m c i ) and i = 1, 2, ..., k.
Hence, the corresponding log-likelihood function H 0N (µ, σ) = log(L H 0N (µ, σ)) of (3.1) is given by: be the sample mean and sample variance of the m = k i=1 m i non-censored observations, respectively.
The maximum likelihood estimatesμ andσ of µ and σ are the solutions to equations (3.3) and (3.4), the partial derivatives for the log-likelihood equation with respect to µ and σ: The expectation maximization (EM) algorithm will be used iteratively to obtain the solutionsμ andσ to the maximum likelihood equations (3.3) and (3.4). The EM algorithm was proposed by Dempster et. al. (1977) for calculating the maximum likelihood estimated from censored samples. The procedure consists of alternately estimating the censored observations from the current parameter estimates and estimating the parameters from the actual and estimated observations. The EM algorithm can be used to calculate the maximum likelihood estimates for the mean µ and standard deviation σ of a normal distribution from both singly-and multiplycensored samples. A brief description for the EM algorithm is given here.
At step 0 of the EM algorithm all non-censored observations are used to calculate the initial estimates of µ and σ as follows: Letμ s andσ s be the maximum likelihood estimates of µ and σ at step s of this procedure. At step s + 1, each censored observation x ij (where i = 1, 2, ..., k; j = m i + 1, 2, ..., n i ) is replaced by an estimate ofμ s −σ s W ( Let the values u ij be calculated at step s + 1 as follows: where the function γ(t) is defined as: More details about the EM algorithm procedure can be found in Wolynetz (1979). Convergence is achieved if both |μ s −μ s+1 | < 0.00001 and |σ s −σ s+1 | < 0.00001 occur. When these convergence criteria are met, the maximum likelihood estimates for µ and σ are then given byμ =μ s andσ =σ s , respectively.
The single sample EM algorithm estimation method can be used to obtain the maximum likelihood estimatesμ i andσ i of µ i and σ i for i = 1, 2, ..., k as follows. At step 0 of the EM algorithm all non-censored observations are used to calculate the initial estimates of µ i and σ i for i = 1, 2, ..., k as follows: For i = 1, 2, ..., k letμ is andσ is be the maximum likelihood estimates of µ i and σ i at step s of this procedure. At step s + 1, each censored observation x ij (where i = 1, 2, ..., k; and j = m i + 1, ..., n i ) is replaced by an estimate ofμ is −σ is W ( Let the values t ij be calculated at step s + 1 as follows: for i = 1, 2, ..., k and j = m i + 1, ..., n i So for i = 1, 2, ..., k the updated estimatesμ is+1 andσ is+1 of µ i and σ i are given bŷ For i = 1, 2, ..., k, convergence is achieved if |μ is −μ is+1 | < 0.00001, and |σ is −σ is+1 | < 0.00001 occur. When these convergence criteria are met, the maximum likelihood estimates for µ i and σ i are then given byμ i =μ is andσ i =σ is , respectively.

2) and
In general, the asymptotic α−level chi-square test used to test the null hypothesis H 0 : θ = 0 versus the alternative hypothesis H a : θ = 0 is defined by where χ 2 0 has a chi-square distribution with degrees of freedom df , which is defined by the number of free parameters under the alternative hypothesis H a minus the number of free parameters under the null hypothesis H 0 , and χ 2 (α,df ) is the upper α-point value obtained from the chi-square table with degrees of freedom df .
The asymptotic α−level chi-square tests used in both Test 1: H 0N versus H A1N , overall homogeneity versus overall heterogeneity, and Test 2: H 0N versus H A2N , overall homogeneity versus mean heterogeneity and variance homogeneity are now described.
Computer Programs: To facilitate the application of the test procedure and parameter estimation method described in this article, a computer program called "K.Lognormal.Estimation" is written in the R language to automate parameters estimation from multiply left-censored data sets that are normally or log-normally distributed and to obtain the estimated values of the log-likelihood functions under the null and the alternative hypotheses . In addition, this computer program will be used to obtain the asymptotic α−level chi-square test statistic and its p-value. A Copy of the source code is given in the Appendix section and is available upon request.
For the sake of simplicity, in the remaining part of this article Test 1 (H 0N versus H A1N ) will be considered. Test 2 can be easily programmed and computed.

Example:
The following data sets are simulated from a lognormal distribution with mean µ = 3 and standard deviation σ = 1. Data are given in Table 1. Each data set is artificially censored at the 10th, 20th and 30th quantiles. Table 2 contains the censored data sets and censored indicators (0 = noncensored, 1 = censored). The first data set contains three distinct detection limits 4, 10 and 13, and has censoring level of 24%. The second data set contains three distinct detection limits 6, 9 and 13, and has censoring level of 30%. The third data set contains three distinct detection limits 7, 9 and 13, and has censoring level of 30%. Accordingly, the pooled data contains six distinct detection limits 4, 6, 7, 9, 10 and 13, and has censoring level of 28%. Table 3 contains estimates of the normal and log-normal population parameters. The p-value results associated with the application of the recommended asymptotic chi-square test to the simulated censored data presented are also included in Table  3. The p-value of testing the null hypothesis H 0N : µ 1 = µ 2 = µ 3 and σ 1 = σ 2 = σ 3 versus the alternative hypothesis H A1N : µ 1 = µ 2 = µ 3 and σ 1 = σ 2 = σ 3 is 0.8834. Therefore the hypothesis of equal normal (lognormal) parameters is accepted at significance level of α = 0.05.

Simulation Study
In this simulation study, type I error rates and power of the test procedure introduced in this article are investigated. A computer program was written in the R language for this purpose. For each combination of the population parameters µ 1 , µ 2 , µ 3 , σ 1 , σ 2 and σ 3 described below, two sample size cases were considered: in case one, n 1 = n 2 = n 3 = 25 and in the second case, n 1 = n 2 = n 3 = 75. The first case will be referred to as the small sample size case and the second as the large sample size case. Censoring at three different detection limits was used in each simulated sample. The simulation study was performed with 10,000 repetitions (N = 10, 000) of sample normal distributions for each combinations of n, µ 1 , µ 2 , µ 3 , σ 1 , σ 2 , σ 3 , and censoring levels. Simulated data were artificially censored twice at the 10th, 20th, 30th,and at 30th, 40th, and 50th percentiles as shown in Tables 4 and 5. In order to check the Type I error, the population parameters were specified as µ 1 = µ 2 = µ 3 = 0, and σ 1 = σ 2 = σ 3 = 1 as shown in Table 4. In order to check the power, the population parameters were specified as µ 1 = −1.0(0.1) − 0.1, µ 2 = 0, µ 3 = 0.1(0.1)1.0, σ 1 = 1 , σ 2 = 1.1(0.1)2.0, and σ 3 = 1.2(0.2)3.0 as shown in Table 5.
The following observations and conclusions are made from an examination of the simulation results reported in Tables 4 and 5.
From Table 4, one can see that the estimated simulated Type I error rates are slightly higher than 0.05 (0.0534, 0.0516) for the small sample size case, and slightly less than 0.05 (0.0482, 0.0469) for the large sample size case. The censoring levels do not seem to affect the value of Type I error rate, α.
In summary, the test procedure introduced in this article maintains its stated significance level and has much power with larger sample size and a bit less power with greater censoring levels. In addition, the power decreases when the censoring levels moves from 0.10, 0.20, and0.30 to 0.30, 0.40and0.50. Also, the power increases greatly when the sample size moves from the order of 25 to the order of 75.

Conclusions and Remarks
The k-sample lognormal model provides an alternative to the nonparameteric models for testing the equality of the parameters of k(k ≥ 3) independent log-normal populations in environmental settings. The lognormal model provide additional information as to the overall homogeneity (σ 1 = σ 2 = ... = σ k ) or heterogeneity (σ 1 = σ 2 = ... = σ k ) of the k lognormal populations, which is important in the interpretation of the differences among medians. It is well known that the log-normal distribution is widely used in modelling environmental and biomedical censored data. This article has dealt with the problem of comparing the parameters of k(k ≥ 3) independent log-normal populations in the presence of left-censored data. The EM Algorithm is employed to obtain the maximum likelihood estimates of population parameters under different hypotheses. A parametric test procedure for testing the equality of k(k ≥ 3) independent log-normal parameters in the presence of censored data is presented and evaluated. The performance of the test procedure presented in this article is evaluated by means of simulation studies. We find analytically that the considered test procedure is doing well through comparing the size and power statistic test. To facilitate the application of the new test procedure a computer program is written in the R languages. I hope that my paper would be useful to the researchers who are considering log-normal distribution in their analysis of the left censored data.