The "presence" of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results (Rubin, 1976). Missing data plagues almost all surveys. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of "missingness" mechanisms. When data missing is conditioned by some other variable observed in the data set (Schafer, 1997) the mechanism is called MAR (Missing at Random). Otherwise, when the missingness mechanism depends on the actual value of the missing data, it is called NCAR (Not Missing at Random). This last is the most difficult condition to model. In the last decade interest arose in the estimation of missing data by using regression (single imputation). More recently multiple imputation has become also available, which returns a distribution of estimated values (Scheffer, 2002). In this paper an automatic methodology for estimating missing data is presented. In practice, given a gauging station affected by missing data (target station), the methodology checks the randomness of the missing data and classifies the "similarity" between the target station and the other gauging stations spread over the study area. Among different methods useful for defining the similarity degree, whose effectiveness strongly depends on the data distribution, the Spearman correlation coefficient was chosen. Once defined the similarity matrix, a suitable, nonparametric, univariate, and regressive method was applied in order to estimate missing data in the target station: the Theil method (Theil, 1950). Even though the methodology revealed to be rather reliable an improvement of the missing data estimation can be achieved by a generalization. A first possible improvement consists in extending the univariate technique to the multivariate approach. Another approach follows the paradigm of the "multiple imputation" (Rubin, 1987; Rubin, 1988), which consists in using a set of "similar stations" instead than the most similar. This way, a sort of estimation range can be determined allowing the introduction of uncertainty. Finally, time series can be grouped on the basis of monthly rainfall rates defining classes of wetness (i.e.: dry, moderately rainy and rainy), in order to achieve the estimation using homogeneous data subsets. We expect that integrating the methodology with these enhancements will certainly improve its reliability. The methodology was applied to the daily rainfall time series data registered in the Candelaro River Basin (Apulia - South Italy) from 1970 to 2001.
A REGRESSIVE METHODOLOGY FOR ESTIMATING MISSING DATA IN RAINFALL DAILY TIME SERIES
BARCA E;PASSARELLA G
2009
Abstract
The "presence" of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results (Rubin, 1976). Missing data plagues almost all surveys. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of "missingness" mechanisms. When data missing is conditioned by some other variable observed in the data set (Schafer, 1997) the mechanism is called MAR (Missing at Random). Otherwise, when the missingness mechanism depends on the actual value of the missing data, it is called NCAR (Not Missing at Random). This last is the most difficult condition to model. In the last decade interest arose in the estimation of missing data by using regression (single imputation). More recently multiple imputation has become also available, which returns a distribution of estimated values (Scheffer, 2002). In this paper an automatic methodology for estimating missing data is presented. In practice, given a gauging station affected by missing data (target station), the methodology checks the randomness of the missing data and classifies the "similarity" between the target station and the other gauging stations spread over the study area. Among different methods useful for defining the similarity degree, whose effectiveness strongly depends on the data distribution, the Spearman correlation coefficient was chosen. Once defined the similarity matrix, a suitable, nonparametric, univariate, and regressive method was applied in order to estimate missing data in the target station: the Theil method (Theil, 1950). Even though the methodology revealed to be rather reliable an improvement of the missing data estimation can be achieved by a generalization. A first possible improvement consists in extending the univariate technique to the multivariate approach. Another approach follows the paradigm of the "multiple imputation" (Rubin, 1987; Rubin, 1988), which consists in using a set of "similar stations" instead than the most similar. This way, a sort of estimation range can be determined allowing the introduction of uncertainty. Finally, time series can be grouped on the basis of monthly rainfall rates defining classes of wetness (i.e.: dry, moderately rainy and rainy), in order to achieve the estimation using homogeneous data subsets. We expect that integrating the methodology with these enhancements will certainly improve its reliability. The methodology was applied to the daily rainfall time series data registered in the Candelaro River Basin (Apulia - South Italy) from 1970 to 2001.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.