The development and evaluation of new methods for editing and imputation

With financial support from the IST Programme of the European Union

Key action CPA4: New indicators and statistical methods

Total cost 3 627 844 Euro, Commission funding 2 100 000 Euro

Contract no IST-1999-10226

Duration: 1 March 2000 - 28 February 2003

Project main goals

	Develop a methodological evaluation framework and develop evaluation criteria for data editing and imputation
	Produce a standard collection of data sets
	Establish a baseline by evaluating currently used methods.
	Develop and evaluate a selected range of new techniques.
	Compare and evaluate the different methods and establish best methods for different types of data.
	Disseminate the best methods via a single computer package and publications.

Participating organisations

Participating organisation	Country	Contact-Person
Office for National Statistics (coordinator)	UK	John Charlton
Royal Holloway College Univ. London	UK	Alex Gammerman
University of Southampton	UK	Ray Chambers
University of York	UK	Jim Austin
The Numerical Algorithms Group Ltd	UK	Geoff Morgan
Centraal Bureau voor de Statistiek	Netherlands	Ton de Waal
Tilastokeskus (Statistics Finland)	Finland	Seppo Laaksonen
University of Jyvaeskylae	Finland	Pasi Koikkalainen
Swiss Federal Statistical Office	Switzerland	Beat Hulliger
Qantaris GmbH	Germany	Phil Kokic
Istituto Nazionale Di Statistica	Italy	Giulio Barcaroli
Statistics Denmark	Denmark	Peter Linde

Key issues

Imputation-based methods for dealing with incomplete or inconsistent data are used in virtually all National Statistics Institutes (NSIs), and in academic and business research. Currently, these methods are typically based on simple statistical ideas (e.g. nearest neighbours). Also, little is known about the comparative performance of each method, across the wide variety of data sources being used.

Recent, advances in computing capabilities have made possible the application of the more complex statistical modeling techniques. The EUREDIT project will combine recent developments in statistical and computer science to develop and evaluate novel edit and imputation methodologies, focusing on the use of new statistical, neural network and related methods for edit and imputation in large-scale statistical data sets.

Eurostat recognises the importance of editing and imputation in the document "EPROS: European Plan for Research in Official Statistics" (7 April 1999). The Euredit project will address, in varying degrees, a number of the data editing and imputation research issues mentioned in that document, especially the following:

	Techniques for determining how to limit selectively the edits that really affect data quality.
	Generalisation and extension of algorithms, such as the Fellegi-Holt methodology, for automatic error detection and correction in a way that is suited for both continuous and categorical variables.
	The development of general-purpose, application-independent macro-editing software.
	Identification of the relative merits of different imputation methodologies.
	Assessment of the robustness of different imputation methodologies and their effect on the outcomes of multivariate analyses of the imputed data.
	The development of simultaneous imputation techniques that ensure consistency with all edit rules specified.
	Use of knowledge discovery methods to better understand datasets in relation to background information

Technical approach

In Euredit the fundamental approach adopted involves identifying sound scientific and technical, user-oriented criteria to enable a meaningful comparison of current and new promising methods for data editing and imputation. (Note: In Euredit, the term editing is taken to have the more narrow meaning of error localisation, i.e. identifying doubtful or erroneous data values.)

Representative data sets arising in household surveys, business surveys, censuses, panel surveys, time series and business registers will be selected, which provide a sufficiently broad coverage of the range of error-attributes based on those below.

Attribute	Some possible instances of attribute
Type of error	Inconsistencies, missingness and amount of missingness, "outlyingness" and amount of "outlyingness"
Nature of error	Systematic, stochastic
Type of variable	nominal, ordinal and continuous variables
Degree of non-response	item non-response, unit non-response
Type of data set	social surveys, business surveys, censuses, panel data, administrative registers

The integrity and validity of experimental work in Euredit will by achieved through the development of a methodological framework early in the project. This framework will prescribe a set of common experimental procedures to be followed in the EUREDIT project.

Expected achievements/impact

The provision of a standard collection of data-sets, presented both as "clean" data, and data with a broad range of error types. This will provide a single source for comparative studies of different edit and imputation techniques.

No such compilation is currently available.

The definition of quality and evaluation criteria by which each technique may be judged, and the provision of a methodological framework within which the evaluation may take place.

Measurement of edit and imputation quality is currently an open research question.

	The adaptation and application of a diverse range of new methods (multi-layer perceptron, correlation matrix memory, self-organising maps, support vector machines) to data editing and imputation. These powerful techniques have been applied successfully in many other areas.
	The development of new statistical techniques for multivariate edit and imputation based on application of outlier robust methodology to detection and modification of representative outliers in survey data.
	The investigation of editing techniques that can handle mixed data types. Current techniques based on the Fellegi-Holt procedure are restricted to either qualitative or quantitative data, but not mixtures of both.
	Development of fuzzy logic and non-parametric regression techniques for edit and imputation, particularly in the context of temporal (panel) data series.
	An overall comparison of all methods evaluated in Euredit, identifying the weaknesses and strengths of each, with particular reference to error attributes.
	The development of an overall framework which identifies recommended strategies for data editing and imputation, according to known or expected error attributes of the data set in question.

Coordinator Contact

John Charlton, Office for National Statistics, 1 Drummond Gate, London SW1V2QQ.

Email: John.Charlton@ons.gov.uk

Acknowledgements

The Euredit project takes place with financial support from the IST Programme of the European Union.