Error and Bias in Geocoding School and Students' Home Addresses
Environ Health Perspect. doi:10.1289/ehp.11542 available via http://dx.doi.org [Online 28 July 2008]
Referencing: Error and Bias in Determining Exposure Potential of Children at School Locations Using Proximity-Based GIS Techniques
Zandbergen and Green (2007) recently described the effect of positional error on the distance between geocoded addresses and major roads, an often-used proxy for traffic-related exposures. They found a 200–500 m range of mean positional errors in their study of 126 Orange County, Florida, public school addresses, a somewhat higher range than that associated with geocodes assigned by four commercial vendors to a larger variety and number of street addresses in the 48 contiguous U.S. states (Whitsel et al. 2006). In both studies, however, the ranges exceeded commonly used thresholds for identifying those at greatest potential risk of traffic-related exposures, raising due cause for concern.
Zandbergen (2007) found that the use of such low thresholds to define traffic-related exposure surrogates leads to the consistent overestimation of the number of Orange County school children at risk. In this recent study (Zandbergen and Green 2007), the finding has been extended to the schools the children attend. To explain the overestimates, Zandbergen and Green illustrated the idiosyncratic positioning of schools and homes—both within land parcels and along street segments—and the uniformly higher percentage of false positive versus negative determinations of whether the geocoded locations were inside or outside the 50–1,000-m buffer radii examined in their studies.
The collective findings of Zandbergen and Green (2007) nonetheless differ from those based on a previously described 5% random sample of 2,608 street addresses from the Environmental Epidemiology of Arrhythmogenesis in WHI (EEAWHI) (Whitsel et al. 2006). In that study, we found that the fraction of participants' addresses determined to be < 100 m from the nearest highway was relatively constant across mean positional errors of 150–600 m, a finding driven by the counterbalance of approximately equal false positive and negative rates over the same range. The sensitivity and specificity of the 100-m threshold tested in EEAWHI—one-fifth the minimum distance to schools deemed acceptable by Zandbergen and Green—were also around 90% at positional errors of 250–300 m. Moreover, even when the sensitivity and specificity of the 100-m threshold exceeded 90%, its strength of association with coronary heart disease was still underestimated, albeit in the absence of confounding and under the assumption of nondifferential misclassification.
It is tempting to generalize about the magnitude of error and direction of bias observed by Zanbergen and Green (2007)—to students' school and home addresses outside Orange County, or more generally to epidemiologic measures of environmental exposure–health outcome association—but the most prudent course of action may be to wait until the external validity of their potentially important findings is established.
The author declares he has no competing financial interests.
Eric A. Whitsel
Department of Epidemiology
School of Public Health
University of North Carolina, Chapel Hill
Chapel Hill, North Carolina
E-mail:
ewhitsel@email.unc.edu
References
Whitsel EA, Quibrera PM, Smith RL, Catellier DJ, Liao D, Henley AC, et al. 2006. Accuracy of commercial geocoding: assessment and implications. Epidemiol Perspect Innov 3:8; doi:10.1186/1742-5573-3-8 [Online 20 July 2006].
Zandbergen PA. 2007. Influence of geocoding quality on environmental exposure assessment of children living near high traffic roads. BMC Public Health 7: 37; doi: 10.1186/1471-2458-7-37 [Online 16 March 2007].
Zandbergen PA, Green JW. 2007. Error and bias in determining exposure potential of children at school locations using proximity-based GIS techniques. Environ Health Perspect 115:1363–1370.
Geocoding School and Student's Home Addresses: Zandbergen Responds
Environ Health Perspect. doi:10.1289/ehp.11542R available via http://dx.doi.org [Online 28 July 2008]
Our research (Zandbergen and Green 2007) strongly suggests that the positional error in street geocoding is not random in direction and that the displacement along the street segment often occurs toward one side of the street because of incorrect address ranges in the street reference data. This "squeeze" effect is a common observation in geocoding using many different street data sets. The extent to which this occurs will vary among locales due to the varying quality of street reference data. The extent to which this introduces any bias into exposure assessments will vary with the specific pollution source being considered. Proximity to major roads with high traffic counts represents a particular case that is very much influenced by this effect, because many residential streets are perpendicular to major roads and address ranges often start at major roads. For other exposure scenarios, such as air pollution from industrial facilities, the "squeeze" effect will contribute to the overall positional error in geocoding and therefore to any misclassifications, but much less likely to any bias.
Whitsel et al. (2006) determined positional accuracy of geocoding by four commercial vendors through an empirical comparison of criterion locations and vendor-assigned coordinates. In the analysis of the effects of positional error on exposure classification, however, Whitsel et al. (2006) displaced address locations at random over a uniform distribution of the angle of displacement. This assumes there is no direction in the positional error and ignores the "squeeze" effect. Our studies show that the displacement of a street-geocoded location relative to the actual location of the residence is frequently along the street segment, and definitely not random in direction. For a large sample, the distribution of the direction of positional error may appear to be uniform because the directions of street segments often approximate a uniform distribution, unless the street segments follow a very strong grid pattern (e.g., Zimmerman et al. 2007). I therefore argue that the error propagation modeling used by Whitsel et al. (2006) substantially underestimates the effects of positional errors in geocoding on exposure classification for the particular scenario where exposure potential is determined on the basis of distance to major roads. Given the relatively complex nature of the spatial pattern in geocoding errors, we feel that determining misclassification based on actual geocoded locations is more reliable than employing simulated displacements.
I agree, however, that care should be taken in generalizing the results from our studies, and we do not think the 250–500-m range is the lower limit of spatial epidemiologic analysis in general. However, I challenge the commonly held assumption that positional errors in geocoding are relatively small, random in terms of their direction, and without positional bias.
Contrary to other forms of digital spatial data (e.g., land use, roads, census boundaries), geocoding results do not have an implicit scale, and hence the spatial resolution is not known without testing. Certainly, the scale of geocoded locations is not the same as the scale of the street reference data employed. The studies by Whitsel et al. (2006) and my own research represent the few attempts at determining the effective resolution of geocoding; that is, how reliable is spatial analysis of geocoding results at small distances? This effective resolution will depend on several factors, not the least of which is the variation across urban–rural gradients. For Orange County, Florida (Zandbergen 2007), I found that street geocoding of residential addresses using local street centerlines (1:5,000) resulted in a 90th percentile of the error distribution of 100 m. This corresponds very closely to the results of Cayo and Talbot (2003), who found a value of 96 m for urban areas and much larger values for suburban and rural areas. Based on this 90th percentile, typical street geocoding of residential addresses does not meet the positional accuracy standards for a 1:100,000 scale map based on the National Map Accuracy Standards (U.S. Bureau of the Budget 1947).
Higher-quality street reference data is expected to improve the positional accuracy of geocoding results, primarily through improved address ranges. However, I argue that the linear interpolation algorithm used in street geocoding presents inherent limitations, resulting in data that are insufficient for many large-scale applications. Higher-accuracy alternatives will need to be considered, including address points. In the address-point data model, residences and other buildings are represented as single points, with a much greater positional accuracy than is achievable using street geocoding. For a review and comparison of methods, see Zandbergen (2008). Several other jurisdictions, including Australia, Canada, and the United Kingdom, have already developed national address-point databases. In the United States, address-point databases are currently limited to selected areas, but this is expected to change. Epidemiologic researchers that employ geocoding would greatly benefit from being aware of alternatives to traditional street geocoding, in particular when analysis at fine spatial scales is required.
The author declares he has no competing financial interests.
Paul A Zandbergen
Department of Geography
University of New Mexico
Albuquerque, New Mexico
E-mail:
zandberg@unm.edu
References
Cayo MR, Talbot TO. 2003. Positional error in automated geocoding of residential addresses. International J Health Geogr 2:10; doi: 10.1186/1476-072X-2-10 [Online 19 December 2003].
U.S. Bureau of the Budget. 1947. United States National Map Accuracy Standards. Washington, DC:U.S. Bureau of the Budget.
Whitsel EA, Quibrera PM, Smith RL, Catellier DJ, Liao D, Henley AC, et al. 2006. Accuracy of commercial geocoding: assessment and implications. Epidemiol Perspect Innov 3:8; doi:10.1186/1742-5573-3-8 [Online 20 July 2006].
Zandbergen PA. 2007. Influence of geocoding quality on environmental exposure assessment of children living near high traffic roads. BMC Public Health 7: 37; doi: 10.1186/1471-2458-7-37 [Online 16 March 2007].
Zandbergen PA. 2008. A comparison of address point, parcel and street geocoding techniques. Comput Environ Urban Syst 32(3):214–232.
Zandbergen PA, Green JW. 2007. Error and bias in determining exposure potential of children at school locations using proximity-based GIS techniques. Environ Health Perspect 115:1363–1370.
Zimmerman DL, Fang X, Mazumdar S, Rushton G. 2007. Modeling the probability distribution of positional errors incurred by residential address geocoding. International J Health Geogr 6:1; doi:10.1186/1476-072X-6-1 [Online 10 January 2007].