The development and evaluation of new methods for editing and imputation

 

With financial support from the IST Programme of the European Union

Key action CPA4: New indicators and statistical methods

Total cost 3 627 844 Euro, Commission funding 2 100 000 Euro

Contract no IST-1999-10226

Duration: 1 March 2000 - 28 February 2003

 

Project main goals

Develop a methodological evaluation framework and develop evaluation criteria for data editing and imputation
Produce a standard collection of data sets
Establish a baseline by evaluating currently used methods.
Develop and evaluate a selected range of new techniques.
Compare and evaluate the different methods and establish best methods for different types of data.
Disseminate the best methods via a single computer package and publications.

 

Participating organisations

Participating organisation

Country

Contact-Person

Office for National Statistics (coordinator)  UK John Charlton
Royal Holloway College Univ. London UK Alex Gammerman
University of Southampton UK Ray Chambers
University of York UK Jim Austin
The Numerical Algorithms Group Ltd  UK Geoff Morgan
Centraal Bureau voor de Statistiek Netherlands Ton de Waal
Tilastokeskus (Statistics Finland) Finland Seppo Laaksonen
University of Jyvaeskylae Finland Pasi Koikkalainen
Swiss Federal Statistical Office Switzerland Beat Hulliger
Qantaris GmbH Germany Phil Kokic
Istituto Nazionale Di Statistica Italy Giulio Barcaroli
Statistics Denmark Denmark Peter Linde

Key issues

Imputation-based methods for dealing with incomplete or inconsistent data are used in virtually all National Statistics Institutes (NSIs), and in academic and business research. Currently, these methods are typically based on simple statistical ideas (e.g. nearest neighbours). Also, little is known about the comparative performance of each method, across the wide variety of data sources being used.

Recent, advances in computing capabilities have made possible the application of the more complex statistical modeling techniques. The EUREDIT project will combine recent developments in statistical and computer science to develop and evaluate novel edit and imputation methodologies, focusing on the use of new statistical, neural network and related methods for edit and imputation in large-scale statistical data sets.

Eurostat recognises the importance of editing and imputation in the document "EPROS: European Plan for Research in Official Statistics" (7 April 1999). The Euredit project will address, in varying degrees, a number of the data editing and imputation research issues mentioned in that document, especially the following:

Techniques for determining how to limit selectively the edits that really affect data quality.
Generalisation and extension of algorithms, such as the Fellegi-Holt methodology, for automatic error detection and correction in a way that is suited for both continuous and categorical variables.
The development of general-purpose, application-independent macro-editing software.
Identification of the relative merits of different imputation methodologies.
Assessment of the robustness of different imputation methodologies and their effect on the outcomes of multivariate analyses of the imputed data.
The development of simultaneous imputation techniques that ensure consistency with all edit rules specified.
Use of knowledge discovery methods to better understand datasets in relation to background information

 

Technical approach

In Euredit the fundamental approach adopted involves identifying sound scientific and technical, user-oriented criteria to enable a meaningful comparison of current and new promising methods for data editing and imputation. (Note: In Euredit, the term editing is taken to have the more narrow meaning of error localisation, i.e. identifying doubtful or erroneous data values.)

Representative data sets arising in household surveys, business surveys, censuses, panel surveys, time series and business registers will be selected, which provide a sufficiently broad coverage of the range of error-attributes based on those below.

Attribute

Some possible instances of attribute

Type of error Inconsistencies, missingness and amount of missingness, "outlyingness" and amount of "outlyingness"

Nature of error

Systematic, stochastic

Type of variable

nominal, ordinal and continuous variables

Degree of non-response

item non-response, unit non-response

Type of data set

social surveys, business surveys, censuses, panel data, administrative registers

The integrity and validity of experimental work in Euredit will by achieved through the development of a methodological framework early in the project. This framework will prescribe a set of common experimental procedures to be followed in the EUREDIT project.

 

Expected achievements/impact

The provision of a standard collection of data-sets, presented both as "clean" data, and data with a broad range of error types. This will provide a single source for comparative studies of different edit and imputation techniques. 

No such compilation is currently available.

The definition of quality and evaluation criteria by which each technique may be judged, and the provision of a methodological framework within which the evaluation may take place.

Measurement of edit and imputation quality is currently an open research question. 

The adaptation and application of a diverse range of new methods (multi-layer perceptron, correlation matrix memory, self-organising maps, support vector machines) to data editing and imputation. These powerful techniques have been applied successfully in many other areas.
The development of new statistical techniques for multivariate edit and imputation based on application of outlier robust methodology to detection and modification of representative outliers in survey data.
The investigation of editing techniques that can handle mixed data types. Current techniques based on the Fellegi-Holt procedure are restricted to either qualitative or quantitative data, but not mixtures of both.
Development of fuzzy logic and non-parametric regression techniques for edit and imputation, particularly in the context of temporal (panel) data series.
An overall comparison of all methods evaluated in Euredit, identifying the weaknesses and strengths of each, with particular reference to error attributes.
The development of an overall framework which identifies recommended strategies for data editing and imputation, according to known or expected error attributes of the data set in question.

 

Coordinator Contact

John Charlton, Office for National Statistics, 1 Drummond Gate, London SW1V2QQ.

Email: John.Charlton@ons.gov.uk

Acknowledgements

The Euredit project takes place with financial support from the IST Programme of the European Union.