The development and evaluation of new methods for editing and imputation
With financial support from the IST Programme of the European Union
Key action CPA4: New indicators and statistical methods
Total cost 3 627 844 Euro, Commission funding 2 100 000 Euro
Contract no IST-1999-10226
Duration: 1 March 2000 - 28 February 2003
Project main goals
Develop a methodological evaluation framework and develop evaluation criteria for data editing and imputation | |
Produce a standard collection of data sets | |
Establish a baseline by evaluating currently used methods. | |
Develop and evaluate a selected range of new techniques. | |
Compare and evaluate the different methods and establish best methods for different types of data. | |
Disseminate the best methods via a single computer package and publications. |
Participating organisations
Participating organisation |
Country |
Contact-Person |
Office for National Statistics (coordinator) | UK | John Charlton |
Royal Holloway College Univ. London | UK | Alex Gammerman |
University of Southampton | UK | Ray Chambers |
University of York | UK | Jim Austin |
The Numerical Algorithms Group Ltd | UK | Geoff Morgan |
Centraal Bureau voor de Statistiek | Netherlands | Ton de Waal |
Tilastokeskus (Statistics Finland) | Finland | Seppo Laaksonen |
University of Jyvaeskylae | Finland | Pasi Koikkalainen |
Swiss Federal Statistical Office | Switzerland | Beat Hulliger |
Qantaris GmbH | Germany | Phil Kokic |
Istituto Nazionale Di Statistica | Italy | Giulio Barcaroli |
Statistics Denmark | Denmark | Peter Linde |
Key issues
Imputation-based methods for dealing with incomplete or inconsistent data are used in virtually all National Statistics Institutes (NSIs), and in academic and business research. Currently, these methods are typically based on simple statistical ideas (e.g. nearest neighbours). Also, little is known about the comparative performance of each method, across the wide variety of data sources being used.
Recent, advances in computing capabilities have made possible the application of the more complex statistical modeling techniques. The EUREDIT project will combine recent developments in statistical and computer science to develop and evaluate novel edit and imputation methodologies, focusing on the use of new statistical, neural network and related methods for edit and imputation in large-scale statistical data sets.
Eurostat recognises the importance of editing and imputation in the document "EPROS: European Plan for Research in Official Statistics" (7 April 1999). The Euredit project will address, in varying degrees, a number of the data editing and imputation research issues mentioned in that document, especially the following:
Techniques for determining how to limit selectively the edits that really affect data quality. | |
Generalisation and extension of algorithms, such as the Fellegi-Holt methodology, for automatic error detection and correction in a way that is suited for both continuous and categorical variables. | |
The development of general-purpose, application-independent macro-editing software. | |
Identification of the relative merits of different imputation methodologies. | |
Assessment of the robustness of different imputation methodologies and their effect on the outcomes of multivariate analyses of the imputed data. | |
The development of simultaneous imputation techniques that ensure consistency with all edit rules specified. | |
Use of knowledge discovery methods to better understand datasets in relation to background information |
Technical approach
In Euredit the fundamental approach adopted involves identifying sound scientific and technical, user-oriented criteria to enable a meaningful comparison of current and new promising methods for data editing and imputation. (Note: In Euredit, the term editing is taken to have the more narrow meaning of error localisation, i.e. identifying doubtful or erroneous data values.)
Representative data sets arising in household surveys, business surveys, censuses, panel surveys, time series and business registers will be selected, which provide a sufficiently broad coverage of the range of error-attributes based on those below.
Attribute |
Some possible instances of attribute |
Type of error | Inconsistencies, missingness and amount of missingness, "outlyingness" and amount of "outlyingness" |
Nature of error |
Systematic, stochastic |
Type of variable |
nominal, ordinal and continuous variables |
Degree of non-response |
item non-response, unit non-response |
Type of data set |
social surveys, business surveys, censuses, panel data, administrative registers |
The integrity and validity of experimental work in Euredit will by achieved through the development of a methodological framework early in the project. This framework will prescribe a set of common experimental procedures to be followed in the EUREDIT project.
Expected achievements/impact
The provision of a standard collection of data-sets, presented both as "clean" data, and data with a broad range of error types. This will provide a single source for comparative studies of different edit and imputation techniques. |
No such compilation is currently available.
The definition of quality and evaluation criteria by which each technique may be judged, and the provision of a methodological framework within which the evaluation may take place. |
Measurement of edit and imputation quality is currently an open research question.
The adaptation and application of a diverse range of new methods (multi-layer perceptron, correlation matrix memory, self-organising maps, support vector machines) to data editing and imputation. These powerful techniques have been applied successfully in many other areas. | |
The development of new statistical techniques for multivariate edit and imputation based on application of outlier robust methodology to detection and modification of representative outliers in survey data. | |
The investigation of editing techniques that can handle mixed data types. Current techniques based on the Fellegi-Holt procedure are restricted to either qualitative or quantitative data, but not mixtures of both. | |
Development of fuzzy logic and non-parametric regression techniques for edit and imputation, particularly in the context of temporal (panel) data series. | |
An overall comparison of all methods evaluated in Euredit, identifying the weaknesses and strengths of each, with particular reference to error attributes. | |
The development of an overall framework which identifies recommended strategies for data editing and imputation, according to known or expected error attributes of the data set in question. |
Coordinator Contact
John Charlton, Office for National Statistics, 1 Drummond Gate, London SW1V2QQ.
Email: John.Charlton@ons.gov.uk
Acknowledgements
The Euredit project takes place with financial support from the IST Programme of the European Union.