This manuscript was prepared as part of the Environmental
Epidemiology Planning Project of the Health Effects Institute, September
1990 - September 1992.
This work was supported partially by grant CA-53996 from
the National Institutes of Health. The authors wish to thank Professor
John Tukey, the members of the Environmental Working Group, and a reviewer
for helpful suggestions.
Introduction
Nearly all study of the health consequences of environmental and lifestyle
exposures in human populations is purely observational. This means that
the validity of the comparison of disease rates between more exposed and
less or nonexposed persons is dependent on the assumption that disease rates
in the two groups are comparable in the absence of such exposure. This comparability
assumption can be weakened somewhat by the measurement and accommodation
of other factors that are associated with disease risk and that have a different
distribution in the compared exposure groups. If such confounding factors
are accurately measured and adequately acknowledged in the data analysis,
it is then sufficient that in the absence of the exposure of interest, the
groups being compared have common disease rates conditional on the values
of the confounding factors. Lack of validity (i.e., bias) in testing or
estimation can be expected if there are unidentified confounding factors,
if the recorded confounding factors are measured with error, or if the treatment
of individual confounding factors is inadequate (e.g., linear allowance
for confounders having effects that are substantially nonlinear). Bias also
can be introduced if the exposure variables of interest or the health effects
under study are not measured accurately. In practice these sources of bias
can be reduced, but it is unlikely that they will be completely eliminated.
The sources of bias mentioned here are the principal reasons why epidemiologic
cohort studies, among others, may yield inaccurate and conflicting results.
Concern about residual, uncontrolled confounding can never be completely
eliminated in any nonexperimental study. Hence, such studies are most reliable
for the detection of moderate to large health effects (e.g., increase in
disease incidence by a factor of two or more among highly exposed persons)
that are unlikely to be qualitatively affected by modest confounding. There
is also a strong role for the replication of results in diverse populations
that are presumed to have different potentials for severe confounding. It
is worth noting that experimental studies also have important practical
limitations in the context of environmental epidemiology. Data analysis
methods for cohort studies with accurate and complete assessment of exposure
variables, confounding factors, and potential health consequences are well
developed, as summarized in "Exposure-Response Estimate in Cohort Studies,"
below.
Case-control studies in which exposure and confounding factors are assessed
retrospectively are subject to all of the biases noted above, as well as
to recall bias, which occurs when diseased individuals (cases) and disease-free
individuals (controls) differentially recall their exposures, their confounders,
or their health outcome. Aggregate data studies, referred to later in this
paper as ecologic studies, attempt to relate the exposure and confounding
factor experience of groups to their corresponding disease rates. Such studies
may be subject to additional biases if the statistical model for the group
disease rate does not equal the average of valid disease rate models for
the individuals being aggregated.
Apparent disagreement between environmental epidemiologic studies can
also arise, not from bias, but from lack of power combined with attention
to point estimates rather than confidence limits. The ability to detect
an association between the levels of an exposure variable or exposure history
and the risk of a disease depends primarily on the observed number of disease
events in the study sample, on the range of exposures in the sample, and
on the strength of association between exposure and disease. The distribution
of exposures in the study cohort or in the cohort from which cases and controls
are selected for a case-control study also has important influences on study
power. While random measurement error in (univariate) exposure assessment
will not invalidate, under weak conditions, a test for the hypothesis that
no association between exposure and disease exists, test power may be reduced
considerably by such measurement errors. Also, estimates of dose-response
parameters may be substantially distorted (usually biased downward), including
the possibility of a loss of monotonicity of dose-response trends (1).
Thus, the proper analysis and interpretation of environmental epidemiologic
studies rely heavily on the investigator's assessment of the magnitude of
both potential biases and study power in the absence of such biases. For
practical reasons, the power of specific studies will often be rather low,
and knowledge of disease mechanisms and measurement properties will be too
limited to place useful bounds on potential biases. Hence, there are important
uses for formal tests of the equality of exposure-disease associations from
two or more studies in differing populations and for techniques used in
combining the results of several studies. This topic will be discussed in
the section titled "Comparing and Combining the Results of Several
Studies."
The following section describes statistical- and biological-based models
that can serve as the basis for exposure-disease analyses.
Models for Disease Occurrence
The simplest cohort studies occur when exposure takes place in one instant,
as in Japanese atomic bomb survivors, or is constant over the individual's
lifetime, as in some animal inhalation experiments. However, most exposures,
and most confounders, are complex functions of time and demand a more complicated
mathematical description. Our discussion of descriptive disease occurrence
models begins with the over-simplified case.
Let
0(t) denote the instantaneous rate of occurrence of
a study disease or other health-related event for subjects of age t
who have not received the exposure of interest. This means that if N
such persons, all at age t, were observed for a short time dt,
the expected number of disease occurrences would be N
0(t)dt.
If a person of age t received an exposure Z, the instantaneous
occurrence rate would be altered from
0(t) to
(t|Z),
the (instantaneous) relative risk is
(t;Z) /
0(t).
These rates are nonnegative and, provided neither is zero, one can take
the logarithm of this relative risk. It is often convenient and useful to
assume the logarithm of the relative risk to be a linear function of exposure
and confounding factor measurements. This is equivalent to modeling the
relative risk as an exponential function, exp(
ß), where the
vector
= (
1,Š,
p), which replaces the more
general Z, consists of carefully chosen (and usually incomplete)
measures of exposure or confounding factors, with
= (0,Š,0)
corresponding to no exposure and standard values for confounders. The coefficients
(ß1,Š,ßp), regression coefficients
that comprise the vector ß (or, more precisely, its transpose ßT),
then tell us about the impact of each
i on relative
risk when the other
s are held fixed.
The result is a simple proportional hazards (or Cox) model
(t|Z) +
0(t)exp(
ß) [1]
which is used widely in the analysis of failure time data (2).
In order to deal with complications inherent in most environmental epidemiologic
studies, one must generalize this discussion and complicate the appearance
of some formulae, but be careful not to change the essentials of the approach.
Such generalization follows in the next subsection.
Descriptive Relative-Risk Models
As above, let
0 (t) denote the instantaneous rate
of a study disease (or other health-related problem) at age t in
the absence of the exposure of interest. A person of age t may have
received exposures z(u) at certain ages u < t.
One can refer to Z(t) = {z(u), u <
t} as the person's exposure history up to age t. Furthermore,
one can allow the vector z(u) to include the values of confounding
factors at age u, so that Z(t) includes both exposure
and confounding factor histories up to age t. The disease rate at
age t is
{t|Z(t)}, a function of this exposure
and confounder history. The relative risk associated with history Z(t)
is then the ratio
{t|Z(t)}/
0(t).
Because this ratio is nonnegative, it can be, and often is, modeled using
an exponential function exp{
(t)ß}, where
(t) = {
1(t),...,
p(t)}.
This function consists of data-analyst-defined functions of Z(t)
and t, with
(t)
(0,Š,0) again corresponding to
no exposure and standard confounder histories, while ßT
= (ß1,Š,ßp) is a corresponding
vector of relative-risk parameters to be estimated.
This relative risk (RR) regression model
{t|Z(t)} =
0(t)exp{
(t)ß}
[2]
also called the Cox-regression model or (inaccurately called) the proportional
hazards model (2-4) or an approximation to these models, forms
the basis for most descriptive analyses of environmental epidemiologic studies.
As a simple example to illustrate the notation, consider the relationship
between exposure to ionizing radiation and the rate of a certain cancer
in the atomic bomb-exposed populations in Hiroshima and Nagasaki. One could
define
z(u0)T = {z1(u0),z2(u0)},
[3]
as the gamma and neutron exposures for a person at age u0
in 1945 when the exposure occurred and as z(u)
0 otherwise.
A specification
(t)
z(u0) then
assumes a log-relative risk function that is linear in gamma and neutron
exposure levels. The regression model can be relaxed to allow, for example,
the relative risk to depend on age at exposure and time since exposure and
to allow for nonlinear dependencies of the log-relative risk on gamma and
neutron exposure.
As noted above, the histories of potential confounding factors can also
be included in Z(t), in which case
(t) will include
functions of both the exposures of interest and other factors, while product
terms between the two will allow the relative risk associated with a given
exposure history to depend on the value of other variables. This allowance
is termed effect modification in epidemiologic parlance. Confounding factors
may also be controlled by means of stratification rather than, or in addition
to, regression modeling using the descriptive model
{t|Z(t)} =
0s(t)exp{
(t)ß}
[4]
where the baseline rate
0(
) is allowed to vary
across a number of strata defined as functions of age (t) and confounding
factor values.
Relative-risk forms other than exponential also may be considered in
the above models. In particular, the linear form 1 +
(t)ß
often is felt to be theoretically and empiricaly more appropriate for certain
carcinogenic exposures and has been used widely in radiation literature,
sometimes with the addition of quadratic terms. Absolute rather than relative-risk
models, such as
{t|Z(t)} =
0s(t) +
(t)ß, [5]
also have been used in modeling radiation effects, although there is
a consensus that it generally does not fit well without the addition of
terms for the modifying effect of age at exposure and latency. It may also
be useful for modeling certain rare diseases such as mesothelioma, for which
the baseline rate in the absence of asbestos exposure is virtually zero.
In all of these alternatives to the standard exponential relative-risk model,
estimates of the relative-risk parameters and baseline rates are often found
to have poor statistical properties. However, quite general programs that
use likelihood-based methods to obtain appropriate confidence limits (5)
are now available to fit a broad class of relative- and absolute-risk models
with combinations of linear and exponential terms.
Suppose that the regression vector
(t) in the above unstratified
model consists only of functions of the exposure variable under study, and
let pr{
(t)} denote the probability density for value
(t) in the source population of the modeled regression vector.
In addition to estimating the relative-risk function, one may be interested
in the fraction of the disease incidence at age t that may be attributed
to exposure. If the disease rate for all study subjects was reduced to the
baseline rate
0(t), then the overall incidence at
age t would be reduced by the attributable proportion
AR(t) = †
0(t) [exp{
(t)ß}-1]
pr{
(t)}d*(t)/†
0(t)
exp{
(t)ß}pr{
(t)}d*(t).
[6]
A similar expression can be written for the attributable proportion under
the stratified relative-risk model.
In some applications of these relative-risk models, it is convenient
to define the basic time variable t to be chronological time or time
from entry into a certain cohort rather than age, which is accommodated
through stratification or regression modeling. For example, in a cohort
study with covariate information collected at specified points in chronological
time, such a definition can help ensure comparability of the covariate (i.e.,
exposure and confounding) information on all study subjects at a given value
of t.
There are distinct advantages in using hazard rates or instantaneous
disease rates,
{t|Z(t)}, in our formulae rather
thanddisease rates over some specified age or time period, in part because
the interpretation of these latter rates will depend on the duration of
the age period or time period in question, which will vary inevitably from
study to study. Nevertheless, in some studies one observes only whether
disease occurs in a certain time period rather than the actual times or
ages of disease occurrence. Let D = 1 denote disease occurrence during
a prescribed disease ascertainment period for a study and D = 0 denote
lack of occurrence. Ignoring issues such as competing risks and losses to
follow-up, one may choose to model the disease probabilities pr{D
= 1|Z(t0)} by an exponential-form odds-ratio model
in which
pr{D = 1|Z(t0)} / pr{D
= 1|Z(t0) = Z0}
pr{D = 0|Z(t0)} / pr{D
= 0|Z(t0) = Z0}
= exp{
(t0)ß}, [7]
where Z(t0) denotes a subject's exposure and
confounding factor history at age t0 at the beginning
of the ascertainment period, and Z0 denotes the standard,
or base, covariate history. This odds-ratio model can be rewritten as a
logistic regression model
pr{D = 1|Z(t0)}
= exp{a(Z0) +
(t0)ß}
/ [1 + exp{a(Z0) +
(t0)ß}],
[8]
where the function a(Z0) may, for example, be
defined to take value as whenever the study subject falls
in stratum s, which is defined as a function of potential confounding
factor values at t0.
The above relative-risk and odds-ratio models are purely descriptive
models. Their application is intended as an aid for summarizing and displaying
aspects of large, complex data sets. In some situations, such as a regulatory
decision concerning the safe level of a certain exposure, it will be essential
to bring to bear any available biologic or mechanistic knowledge on the
inference problem. Such knowledge could be used, for example, to specify
a form for the relative risk at age t as certain elements of
(t0)
approach zero, where these elements capture the dosage, duration, or other
aspects of the exposure in question. Similarly, knowledge or assumption
about the pertinent biological mechanisms could be used to derive models
for
{t|Z(t)} of forms other than those mentioned above.
The next subsection overviews two classes of carcinogenesis models that
have been proposed on mechanistic or biological grounds.
Mechanistic and Biologically Based Models
Efforts to describe a disease process in terms of deterministic or stochastic
models have focused mostly on models for the spread of infectious diseases
in a population and models for carcinogenesis. Some of the work on carcinogenesis
models, as outlined below, may be pertinent to other diseases.
Much of the early work on mathematical models for cancer was reviewed
in a classic paper by Armitage and Doll (6). Whittemore and Keller
(7) also provide a comprehensive review. A major contribution of
the Armitage and Doll paper is the use of the multistage model of carcinogenesis.
This model is based on the assumptions that cancer results from a single
cell line undergoing a series of discrete, heritable changes (e.g., point
mutations, chromosomal breaks or translocations, or other types of copying
errors) in a particular sequence, and the rates of such transitions do not
depend explicitly on age, although they may be affected by exposure to carcinogens
or by factors that modify the rate of cell division. As a consequence of
these and some additional assumptions, it can be shown that the age-specific
incidence rate is predicted to vary approximately as the (k-1)st
power of age, where k is the number of transitions required
(usually estimated to be about 5 to 7 for adult tumors). If a carcinogenic
exposure occurs at a constant rate over time, the incidence will vary approximately
as a polynomial function of dose rate of order equal to the number of dose-dependent
transitions. If exposure is instantaneous or varies over time, the incidence
rate will be modified by age at exposure and/or time since exposure, depending
upon which stage(s) is dose-dependent.
Until recently, most of the empirical tests of these predictions have
been done by fitting the model to aggregate data on population age-specific
rates, or to broadlygrouped data on cohorts, stratified by dose, age at
exposure, or time since exposure. A problem with this approach is the difficulty
of separating the effects of dose rate, age at first exposure, duration
of exposure, time since last exposure, and attained age, all of which influence
the predictions of the model. Simple comparisons of one factor without controlling
the other factors can be misleading. This is less of a problem when animal
bioassay data are used, as these are usually limited to constant, lifetime,
dose regimens. However, such data are not informative about whether the
carcinogen acts at an early or late stage. Nevertheless, the approach has
been used for risk-assessment purposes by many regulatory agencies. The
default approach advocated by the U.S. Environmental Protection Agency (EPA)
and others involves fitting the multistage model to available epidemiologic
or toxicologic data and using an upper confidence limit on the estimated
slope coefficient (scaled for species differences in weight and life span)
to compute the lifetime excess risk in humans. The scientific and statistical
validity of this approach is controversial (8,9).
With the development of general relative-risk models ("Descriptive
Relative-Risk Models" above), it has become possible to test the multistage
and other models by fitting them directly to data on individuals. This offers
great advantages for dealing with time-dependent exposures, which are the
most informative about the stage at which a carcinogen acts. This approach
has been applied to data on occupational exposures to asbestos (10),
arsenic (11), and benzene (12); on the atomic bomb survivors
(13); and on smoking (14), with varying results. The three
occupational applications all were consistent with a single stage of action
(relatively late for asbestos and arsenic, early for benzene), while the
radiation and smoking data both showed signs of two stages being affected.
The multistage model has several important limitations, including its
inability to account for leukemia and childhood cancers, the genetics of
cancer, and the distinction between mechanisms of initiation and promotion.
It also has been criticized for its need for as many as 5 to 7 stages to
account for the steep age dependence, when only two or three have been established
in experimental systems. Moolgavkar and Knudson (15) have proposed
an alternative model that addresses these issues. This model assumes that
two mutational events are required and the cell lines that have experienced
the first event may be at a competitive advantage (proliferation) or disadvantage
(repair) relative to normal cells. Carcinogens might act by affecting either
mutation rates or proliferation rates. Major gene effects are accounted
for by assuming that individuals who inherit the gene begin life with all
cells in the intermediate stage. This model has been successful in fitting
epidemiologic data on smoking (15,16), breast cancer (17),
and radon (18). In the latter example, data from an experimental
study of rats exposed to radon were fitted to the model and radon was found
to have an effect on both the mutation and proliferation rates. However,
the interpretation of this result is complicated by the authors' use, for
both of these dependencies, of a power function dose-response relationship
with a very low exponent rather than a simple linear dose-response. Thomas
(19) has proposed a variant of this model that adds an additional
stage to the process to try to explain the difference in the modifying effect
of dose rate and the duration of exposure for different types of radiation;
so far, no attempt has been made to test this model.
With the rapid growth in our understanding of the fundamental biology
of cancer, further development of methods to validate these mechanistic
ideas and, where appropriate, to incorporate them into the analysis of epidemiologic
data would be worthwhile. Most of the models that have been considered seriously
are sufficiently general that some parameter values can be found to provide
an adequate fit to epidemiologic data sets. Thus, these models are not easily
falsified as a class, and it is unlikely that one could choose among them
on purely statistical grounds. Instead, their utility lies in the types
of comparisons that can be made within the context of a particular model--whether
a carcinogen acts at an early or a late stage in the multistage model or
as an initiator or a promoter in the two-stage model, for example. Their
real value, therefore, lies in their ability to organize a complex set of
hypotheses into a unified framework and to suggest empirical tests, in populations
of humans or animals, of mechanistic ideas suggested by observations at
the cellular level. Research efforts to identify and measure the assumed
biological entities on the pathway to cancer cell formation seem particularly
well motivated.
Exposure-Response Estimation in Cohort Studies
Relative-Risk and Odds-Ratio Estimation
Consider the unstratified relative-risk regression model of "Relative-Risk
Models." A cohort study involves the selection of a sample of individuals
from the population under study, succeeded by a follow-up to observe disease
occurrence. The relative-risk parameter ß can be estimated by maximizing
a partial likelihood function L(ß) that is a product over all
disease occurrence times (ages) that appear in the sample of the ratio of
the relative risk for the subject developing disease to the sum of the relative
risks for all subjects at risk at that time (20). The corresponding
likelihood function under the above stratified relative-risk model is simply
the product over strata of the stratum-specific likelihood functions. Note
that this estimation procedure is quite general in that exposure variables,
confounding variables, and stratum assignments each can vary with follow-up
time. The principal assumption underlying this estimation procedure requires
the set of subjects at risk for disease at any follow-up time to be representative
of the base population, conditional on the covariate history and stratum
assignment. This assumption will be satisfied, for example, if study subjects
are sampled randomly and independently from the study population, and if
rates of censoring (e.g., losses to follow-up) at a given follow-up time
depend most on the covariate histories and stratum assignments at that time.
Also, under weak conditions, L(ß) can be manipulated as if
it were an ordinary likelihood function for asymptotic inference on ß
(21,22). Various computer programs are available now for the
estimation of ß, and, therefore, also of the relative-risk process
exp{
(t)ß}.
The score statistic U(ß0), defined as the value
at ß = ß0 of the derivative with respect to ß
of log L(ß0), can be used to test ß = ß0.
If ß0 = 0 and
(t) consists only of indicator
variables to distinguish exposure groups, then U(ß0)
= U(0) is known as the log rank statistic. Other choices of
(t)
yield other familiar, censored data test statistics, including generalizations
of the Wilcoxon statistic.
Suppose now that there is no possibility of early censorship in the cohort
study throughout the follow-up. The odds-ratio parameter ß in the
logistic regression model of "Relative-Risk Models," along with
the location parameters
s =
(Z0),
can be estimated by a likelihood function L(ß) that is simply
the product over all study subjects of the logistic regression probabilities
pr(D = 1|Z(t0)) for subjects developing
disease, and one minus such probabilities for other study subjects. Computer
programs are widely available for inference on ß from this likelihood
function. If there are few disease events in stratum s, it is preferable
to eliminate *s by conditioning on the number of such events prior
to applying standard likelihood procedures for the estimation of ß
(23).
The likelihood functions just described may seem esoteric to readers
not having a statistical background. The main point to note, however, is
that estimation of relative-risk and odds-ratio parameters in the very flexible
models of "Relative-Risk Models" is now routine, and suitable
software is available. Of course, the odds-ratio parameter will approach
the relative-risk parameter if the disease acquisition period dt
becomes short. This occurs because the odds of disease,
pr{D = 1|Z(t)} / [1 - pr{D
= 1|Z(t)}]
[9]
then typically approaches
{t|Z(t0)}dt,
from which the exponential-form odds ratio approaches a corresponding exponential-form
relative risk with identical regression parameter.
Estimation of the relative-risk regression parameter ß may be computationally
demanding if there are many distinct disease incidence times and if the
regression vector and stratum assignment depend on time. However, if each
0s(t) is defined to be constant over a partition
of the time axis and
(t) is restricted to be constant within
the elements of this partition, then ß can be estimated in a computationally
simple fashion using Poisson regression methods. See Preston et al. (24)
for application of such methods to radiation dose-response estimation from
the Hiroshima and Nagasaki cohorts.
Particular care is required if these estimation procedures are applied
to cohorts having few cases or if most cases occur within a small portion
of the overall range of exposures. Asymptotic formulae for interval estimation
on ß may then be inaccurate and more specialized procedures (e.g.,
resampling methods) may be required. In fact, there has been little study
of cohort data configurations under which such asymptotic formulae will
provide adequate approximations.
Kalbfleisch and Prentice (3) and Cox and Oakes (4) provide
detailed accounts of the theory and application of relative-risk regression
models.
Disease Rate Estimationand Graphical Models
Denote by
0s(t) = †tt0s
0s(u)du, [10]
the cumulative baseline disease rate in stratum s in the stratified
model of "Relative-Risk Models" over the range of ages t >=
t0s represented in the cohort. A simple nonparametric
estimator of
0s(t) can then be defined as
the sum over all disease occurrence times in stratum s of the ratio
of the number of stratum s failures to the sum of the relative risks
for all subjects at risk in stratum s at that time, with all relative
risks evaluated at that ß which maximizes L(ß).
As with ordinary regression methods, model-checking procedures are important
to the application of relative-risk and odds-ratio models. Such procedures
naturally focus on the assumed relative-risk process, exp{
(t)ß},
because other aspects of the model essentially are nonparametric. For example,
the postulated relative-risk function can be generalized by adding well-selected
additional elements to
(t) and testing the hypothesis that
corresponding coefficients equal zero. Computationally feasible methods
also have been developed for approximating the influence of each study subject
or each age group on ß-estimation, in order to highlight questionable
data points and to highlight vulnerabilities of the inference to model assumptions
(25). Graphical procedures particularly are useful. In addition to
the usual types of plots of influence (i.e., sensitivity) values and residuals,
plots of separate estimates of
0s(t) for subsets
of the cohort can provide useful visual checks on proportionality and other
relative-risk assumptions (3).
The fact that the baseline rates
0s(
)are unrestricted
is an important source of robustness in respect to ß-estimation. Specifically,
relative-risk estimation is unlikely to be affected much if the intensity
of ascertainment of disease events in the cohort varies somewhat across
time or among strata. Similarly, location shifts in the modeled regression
vector
(t) across different values of t
would not affect ß-estimation in the exponential-form relative-risk
model. However, more general measurement error in the ascertainment of
(t)
may have a profound effect on relative-risk estimation.
Measurement Error in Exposure Variables and Confounding Factors
Epidemiologists have long recognized that errors in the measurement of
the study variables, including misclassification in the case of categorical
variables, can lead to biased tests and estimates of the associations under
study. Measurement error in the exposure histories or confounding factor
histories may be of particular importance in environmental epidemiologic
applications. Unfortunately, the methodology for avoiding bias due to measurement
error is still at a rudimentary stage of development.
Consider the unstratified relative-risk regression model of "Relative-Risk
Models" and suppose that rather than the covariate history Z(t)
one observes an estimate W(t). The disease rate function at
age (or chronological time) t, given the observed covariate history
W(t) can then be written (26)
{t;W(t)} =
0(t)E[exp{
(t)ß}|W(t)],
[11]
where the expectation also is conditional on lack of disease occurrence
or censorship prior to t. In fact, this induced relative-risk model
also requires
{t;Z(t),W(t)} =
{t;Z(t)}
[12]
so that the W(t) is unrelated to disease risk, given the
true covariate history Z(t). Unfortunately, the expectation
in
{t;W(t)} generally depends on the baseline rates
0(u), u
t, which complicates
the estimation. However, in cohort studies in which the cumulative probability
of disease occurrence is small, this dependence usually can be ignored and
estimation of ß can be based on a likelihood function in the form
described above upon specifying a measurement error distribution for
(t)
given W(t), from which
{t;W(t)} can
be calculated.
Specification of the distribution of
(t) given W(t)
would seem to be a hazardous undertaking unless there is a subsample in
which both Z(t) and W(t) are available. In the
presence of such a validation sample, simultaneous inference on relative-risk
parameters and measurement error distribution parameters is possible (27),
though further development is necessary before such estimation can be viewed
as routine. More difficult issues arise if a true validation sample is not
available. A reliability sample, in which separate estimates W1(t)
and W2(t) of Z(t) are obtained on
a cohort subsample at two (or more) points in time, permits insight into
some aspects of measurement error distribution, but additional strong assumptions
are required for the estimation of ß.
Even if the exposures under study are precisely estimated and pertinent
confounding factors are identified, severe confounding may occur if confounding
factor histories are measured with error (28), as is obvious if one
considers an extreme situation in which measurement error produces a totally
useless confounding factor estimate. This bias is likely to be more acute
if the exposure and confounding factor values appearing in X(t)
are highly correlated.
A hypothetical cohort study of prenatal exposure to passive smoking in
relation to the risk of lower respiratory disease during the first 3 years
of life provided illustration in Morgenstern and Thomas, this volume. Any
elevation in the odds of lower respiratory disease among more heavily exposed
neonates may be severely attenuated by inaccuracies in exposure assessment
in such a study. An analysis that controls for passive smoke exposure during
the first 3 years of life, an exposure that would often be highly correlated
with prenatal exposure, may be dominated by measurement error and be totally
unreliable. A more practical illustration of the impact of measurement error
is seen in the analysis of the mortality rates of various cancers in relation
to gamma and neutron exposures in the Hiroshima and Nagasaki cohorts. Individual
exposure estimates were constructed based on each study subject's location
and shielding information as early as 1960. These estimates have continued
to be refined in succeeding years through the use of improved models for
the yields of the two bombs and more sophisticated models for the formation,
transmission, and attenuation of gamma and neutron radiation. Many of the
analyses of these cohorts simply combine gamma and neutron exposures into
a single total dose estimate. The corresponding cancer mortality analyses
have been affected somewhat by the changes in total dose estimates from
one dosimetry system to the next (e.g., in the magnitude of elevated relative
risks and the apparent shape of the dose-response curves), whereas analyses
that attempt to estimate simultaneously the effect of gamma and neutron
exposures on relative risk have been completely changed by dose estimate
modifications. This illustrates the difficulty of reliably estimating exposure-disease
associations when there are two or more exposure variables that are each
measured with error (random or systematic) or, analogously, when there are
exposure and confounding variables each measured with error. Very similar
issues arise in epidemiologic studies of nonenvironmental factors; for example,
they arise in attempts to separate the effects of fat and calories on cancer
risk in nutritional epidemiology, or to separate the effects of types of
fat by degree of saturation on cancer risk in nutritional epidemiology (29).
Some recent work has concentrated on developing methods to adjust associations
for the effects of measurement errors when their distributions are known.
A very general framework for attacking this problem has been outlined by
Clayton (30), who specifies the problem in terms of component models:
the disease model describes the dependence of disease risk on true exposures
and other factors; the measurement error model describes the relationship
between true and measured exposures and any modifying factors; and the exposure
model describes the population distribution of true exposures. These three
models are combined in a maximum likelihood framework, and approaches to
estimating the parameters of the disease model are described. Unfortunately,
the approach is mathematically intractable in its general form, but useful
progress has been made in some special cases. For categorical variables,
Greenland and Kleinbaum (31) described a method based on applying
the inverse of a matrix of known misclassification rates to the subject
counts by measured exposure and disease classifications. Hui and Walter
(32) have considered the case in which replicate measurements of
exposure are available, and they use a form of log-linear model for the
resulting four-way contingency table (counting true exposure as an unobserved
dimension). For continuous variables, Prentice (26), Pierce et al.
(33), Whittemore (34), Sposto et al. (35), and others
have discussed approaches that replace the measured doses with empirical
Bayes estimates of the true dose and use these in standard analyses. For
a general review of these approaches, see Armstrong (36) and Thomas
et al. (37). Another recent development involves combining nonparametric
density estimation techniques with a computational device known as Gibbs
sampling to overcome the tractability problems in the Clayton approach and
avoid the need for parametric assumptions about the distribution of true
doses. This method has been applied to data on studies of leukemia and thyroid
disease in Utah residents downwind of the Nevada Test Site (38,39).
These approaches are in an early stage of development, but they offer the
prospect of removing the bias due to misclassification, correcting the shapes
of dose-response curves, adjusting for covariates, and examining interaction
effects, all while allowing for the additional uncertainties due to uncertainties
in exposure estimates. Further developments along these lines are highly
desirable.
Most of the literature on correcting for measurement errors has assumed
that the misclassification rates were known and were constant across subjects.
In practice, only estimates of these error distributions are available,
either from earlier validation studies, from replicate measurements, from
gold standard measurements on a subset of the subjects, or from theoretical
uncertainty analysis. Methods need to be developed to account for uncertainties
in the estimates of these misclassification rates (40). As a design
issue, the optimal allocation of resources between high-quality measurements
on a subset and larger numbers of approximate measurements should be considered
(41,42). A unique aspect of the Utah fallout studies is the
availability of individual-specific uncertainty estimates based on elaborate
sensitivity analyses of the exposure pathways. This has allowed subjects
with more precise exposure estimates to be given heavier weight in the analysis.
Whether such efforts are warranted in terms of improved precision needs
to be considered.
In summary, covariate measurement errors can bias severely the results
of environmental epidemiologic studies. Improved analytic methods for accommodating
random, nondifferential covariate measurement errors are required. Such
methodologic developments might naturally focus on the potential for obtaining
a true validation sample, on validation study design, and on the incorporation
of validation study data in the overall estimation procedure (27).
Exposure-Response Estimation Under Case-Control and
Other Sampling Procedures
Relative-risk and odds-ratio estimation often can be carried out more
economically by sampling only subjects developing the study disease (the
cases) or a random sample thereof, along with a suitably matched sample
of subjects without disease (the controls). Typically covariate histories
Z(t), where t is the age (time) of case or control
ascertainment, then have to be obtained retrospectively.
Consider the stratified relative-risk model of "Relative-Risk Models"
and suppose that each case has one or more randomly selected controls that
are matched on age at ascertainment (t) and stratum (s). Given
the covariate histories {Z1(t),Š, Zm(t)}
for a case and its (m-1) age- and stratum-matched controls, the probability
that exposure history Z1(t) corresponds to the
case is simply the relative risk at t for the case divided by the
sum of such relative risks for the m-matched subjects (including
the case). Hence, the relative-risk parameter ß can be estimated by
maximizing the likelihood function L(ß), which is formed by
multiplying these ratios for all matched case-control sets (43).
To avoid strict matching on (t,s), relaxations of this sampling
scheme are possible.
Similarly, the odds-ratio parameter ß in the logistic regression
model of "Relative-Risk Models" can be estimated under case-control
sampling by maximizing the resulting logistic regression likelihood function
by acting as though a prospective study had been conducted, though the estimates
of *s no longer reflect disease incidence probabilities (23).
In fact, the baseline rates
0s(
) and *s
in the relative-risk and odds-ratio regression models of "Relative-Risk
Models" cannot be identified from case-control data in the absence
of additional information on case and control sampling fractions.
In general, relative-risk and odds-ratio parameter estimates from case-control
studies will be subject to the same biases as cohort studies. They also
may be subject to recall bias if exposures or other covariate histories
are differentially recalled by cases and controls or if they involve measurements
that are affected by disease occurrence or its sequelae. There are often
various practical steps that can be taken to minimize bias in ascertaining
the covariate histories Z(t) (e.g., interviewers blinded to
case or control status), but usually it is not possible to identify residual
recall bias because the requirement to obtain prediagnosis and postdiagnosis
covariate histories on a sufficient sample of cases would often eliminate
much of the efficiency of the case-control design.
As with the cohort study design, nondifferential measurement errors lead
to the expectation
E[exp{
(t)}|W(t)], [13]
where
(t) is the true and W(t) is the measured regression
vector at age t, as the identifiable relative-risk function under
age- and stratum-matched case-control sampling. To the extent that a representative
validation sample can be ascertained retrospectively, there will be a potential
to conduct valid relative-risk estimation from this type of study without
making further assumptions.
A case-cohort (case-base) sampling procedure can also be considered as
a means of reducing the cost or simplifying the logistics of a cohort study.
With this design, covariate histories Z(t) are assembled only
for cases and a (stratified) random sample of the study cohort. This sampling
procedure has advantages if several end points (diseases) are to be studied
in relation to an exposure. Also, the subcohort may be used to monitor exposures
and other variables during the study's follow-up. However, estimation may
be less efficient than estimation based on a case-control study with a comparable
number of study subjects if cases and subcohort members are not well matched
(44,45), and recall bias typically will be an issue. Prentice
(46) has developed a procedure for estimating the relative-risk and
odds-ratio parameters from case-cohort samples, and, in contrast to case-control
sampling, baseline rates also can be estimated without external information.
Comparisons and refinements of these sampling procedures are worthwhile
research activities. Note also that the use of so-called two-stage designs
(47,48) can lead to further valuable efficiency gains in some
case-control study applications.
Exposure-Response Estimation in Aggregate Data (Ecologic)
Studies
As discussed previously, sometimes it will be economical and convenient
to examine an exposure-disease association by relating the disease rates
among several groups of individuals to aspects of the exposure experience
of each group. Such studies can be referred to as aggregate data studies
since they involve the disease rates and exposures for the aggregate, rather
than for individuals. These studies also are commonly referred to as ecological
studies since groups having differing exposure histories are sometimes defined
on an ecologic or geographic basis.
Denote by
ki(t) the age- and sex-specific
disease rate in the kth group during (chronological) time
period t. A multiple group study involves the analysis of estimates
of
ki(t), k = 1,Š,K during
a fixed time period; a time trend study involves estimates of
ki(t),
t = 1,Š,T in a single population, while a mixed study
involves estimates of
ki(t) at several values
of both k and t. An exponential-form relative-risk model for
ki(t) can be written, in the notation of "Relative-Risk
Models," as
ki(t) =
k0(t)exp{
ki(t)ß},
[14]
from which the average disease rate
k(t)
for the nk(t) individuals in group k during
time period t is
k(t) =
k0(t)[nk(t)…i=1exp{
ki(t)ß}/nk(t)]
=
k0(t)exp{
¯k(t)ß}
[nk(t)…i=1exp{dki(t)ß}/nk(t)],
[15]
where
¯k(t) = nk-1(t)…
ki(t)
[16]
and
dki(t) =
ki(t) -
¯k(t) [17]
Let yk(t) denote the observed age- and sex-specific
disease incidence rate in group k during time period t, as
may be available from a disease register or other admininstrative source.
From the above expression for
k(t), one expects
a regression of log yk(t) on
¯k(t)
for various values of k or t (or both) to yield biased extimates
of the relative-risk parameter ß, because of the influences of the
residuals dki(t), even if the logarithms of the
baseline rates
k0(t) can be regarded as independent
random variables with a common mean. This specification bias will be small
if the dki(t) values are small, that is, if the
exposure and other regression variables have little variation within groups.
Such bias presumably can be reduced by extending the regression equation
to include averages of squares and of higher powers of the dki(t)
terms, though there does not appear to have been specific study of this
approach. A closely related approach would replace the exponential-form
relative-risk model by a linear-form model, so that
ki(t) =
k0(t){1
+
ki(t)ß} [18]
and
k(t) =
k0(t){1
+
¯k(t)ß} [19]
from which the regression of yk(t) on Xk(t),
under certain random-effects assumptions on the baseline rates {
k0(t)},
will yield valid estimates of the linear relative-risk parameters (49).
Note, however, that an exponential-form relative-risk model often might
be more parsimonious than a linear-form model in environmental epidemiologic
applications so that the regression vector in a linear relative-risk model
may need to be quite lengthy and involve, for example, the average of product
terms between exposure and potential confounding factors in order to adequately
describe the data. In a multigroup study, it may be sensible to assume the
0(t) terms are independent random variables with a
common mean for k = 1,Š,K, thought it often may be useful
to allow for the possibility of correlation among groups in a similar geographic
area. In time-trend and mixed studies, however, it will typically be essential
to model, or otherwise accommodate, the correlation structure among
k0(t),
t = 1,Š,T at any fixed k. Inadequate modeling
of the {
k0(t)} may lead to aggregation bias. These
types of data analysis methods have received very little attention in the
scientific literature and constitute an important gap in the collection
of methods pertinent to environmental epidemiologic applications.
Aggregate data studies involving the simple linear regression of disease
rates or the logarithm of disease rates on average exposures and average
values of potential confounding factors can often be conducted quickly and
cheaply and can play a useful role in hypothesis generation. It is obvious,
however, that more comprehensive data sources and more sophisticated data
analyses typically will be required if aggregate data studies are to contribute
reliably to the identification and estimation of exposure-disease associations.
Better data could come from randomly sampling each of the compared groups
in order to obtain estimates, Xk(t) of acceptable
precision for use in a linear relative-risk model or to obtain estimates
of the average of exp{Xki(t)ß}, i
= 1,Š,nk(t) for use in an exponential relative-risk
model. Random measurement error in the ascertainment of individual exposure
and confounding factors could impact substantially survey design. Better
data analyses may arise from the application of so-called marginal methods
(50,51) to mean and covariance models for the set of yk(t)
or log yk(t) values being analyzed.
Most effort to date concerning aggregate data studies has been directed
to identifying the biases that may arise from aggregation, confounding,
and other sources (52,53). It seems timely to direct a major
effort to the development of procedures to prevent (or greatly reduce) such
biases and, hence, to evaluate whether aggregate data studies can play a
more fundamental and useful role in environmental epidemiologic studies
and in epidemioligic research more generally.
Comparing and Combining the Results of Several Studies
Studies of a certain exposure-disease association may, for a variety
of practical reasons, be lacking in power, and they may be subject to biases
that can differ according to the population under study, the type of study
design, and the rigor of the investigation. It follows that tests of agreement
among the results of various studies and the formal combining of results
from pertinent studies can play an important role in an overall exposure-disease
association assessment.
Under ideal conditions, each of the types of studies described above
can yield a valid estimate ˆß of the logarithm of the relative
risk associated with a specified exposure history, as well as an estimate
ˆ
2 of its variance. The logarithm is used here, because its
estimate is likely to adhere more closely to a normal distribution (with
mean ß) than the estimate eˆß of
the relative risk itself. Suppose m-independent studies yield (scalar)
log-relative risk estimates of ˆß1,Š,ˆß2with
corresponding variance estimates ˆ
1,Š, ˆ
2
|ß = … ˆ
-2i ßi
/ … ˆ
-2i
[20]
estimates a weighted mean of ßi's, which reduces
to a common ß if all ßi's are identical. To
obtain the most stable estimate of this common mean, one can follow developments
arising from Cochran's (54) introduction of partial weighting, thereby
avoiding weights ˆ
-2i ‚
-2i , which may be too small.
If all the ßis are the same and the ˆßi
are independent and normally distributed, then
X2 = m…i=1 ˆ
i
(ˆßi - |ß)2 [21]
will have a chi-square distribution with m-1 degrees of freedom,
thereby giving a simple test of "all ßi = ß"
(assuming each ˆßi is distributed normally).
If the ßi's are not identical, then a t-procedure
can be used to set confidence limits for the weighted mean
ˆß = … ˆ
-2i
ßi /… ˆ
-2i
[22]
Confidence limits on ß- are approximately
|ß ± tv(ˆ… ˆ
-2i
)-1-2 [23]
where t
is a critical value of t on
(somewhat
less than m) degrees of freedom. These limits are often conservative,
particularly when the ˆßi follow longer-tailed
distributions.
There are various reasons why the chi-square test described may provide
evidence of heterogeneity of the relative-risk estimates from the m
studies. For example, studies of the same type (e.g., m-cohort studies)
may have differentially controlled for confounding or may have defined and
measured exposure differently. Studies of different types (e.g., m-cohort,
case-control and aggregate studies) have different sources of potential
bias, for example, recall bias for case-control studies and aggregation
bias in ecologic studies. Hence, it may be useful first to contrast and
combine studies of the same type and then to examine whether the summary
estimates of ß from each study type are heterogenous. In respect to
studies of the same type, the overview, or metanalysis, may be strengthened
by analyzing the raw data from each study in a uniform format, which would
maximize their comparability in terms of confounding control and exposure
modeling. A fundamental principle of such analyses is that the parameter
estimate ˆß is based only on the combination of within-study
information, as is the case for the heterogeneity test and the log-relative
risk estimate described above.
Measurement error in exposure and in covariate assessment may be a particularly
important source of heterogeneity among relative-risk estimates. For example,
random measurement error may attenuate severely or otherwise distort relative-risk
estimates in a cohort or case-control study if, for example, exposure assessment
is based on data provided by individual interviews (e.g., location and shielding
information in the Hiroshima and Nagasaki cohorts), but such attenuation
may not be an issue in an aggregate data study if the desired averages (see
"Exposure-Response Estimation in Aggregate Data Studies") can
be estimated precisely. In this circumstance, some effort to deattenuate
the analytic study relative-risk estimates, or to attenuate equally the
aggregate data relative-risk estimates, is essential prior to the comparison
of these estimates. See Prentice and Sheppard (55) for a recent attempt
to study the consistency of international disease rate, time-trend, case-control
and cohort studies in the dietary carcinogenesis area. Note also that |ß
will be biased as an estimator of ß if the available log-relative
risk estimates ˆß1,Š,ˆßm
are a biased sample of estimates from existing studies, which may arise
if there is so-called publication bias in which relative-risk estimates
that are significantly different from unity are more likely to be reported
in the scientific literature. See Yusuf et al. (56) for a discussion
of some issues in the conduct of such metanalyses.
Other Data Analysis Topics
The above presentation emphasized time to disease endpoints and corresponding
relative-risk and odds-ratio models. In some areas of environmental epidemiologic
research (e.g., respiratory epidemiology or neuroepidemiology), important
endpoints are continuous. Much of the corresponding data analysis methodology
is well established and does not need to be discussed here. However, methods
for handling measurement error with continuous data (57) also require
much additional development. Recent advances in the methods for analysis
of longitudinal data (50) for discrete or continuous data are also
quite relevant to the analysis of certain types of environmental epidemiologic
data.
Preceding sections also have not addressed the simultaneous analysis
of two or more endpoints. For example, in respiratory epidemiology, there
may be several measures of lung function, and a data analysis goal may be
to summarize exposure effects over several correlated measures of change
in lung function. The estimating equation approaches mentioned above (50,51)
provide an approach to such problems with discrete or continuous outcomes,
but work could be done to compare these methods to univariate methods based
on some summary endpoint. Methods for the analysis of correlated failure
time data currently are not well established, though much statistical research
is underway presently. See, for example, Clayton and Cuzick (58),
Wei et al. (59), and Prentice and Cai (60) for recent contributions.
Correlated failure-time methods also are required for the investigation
of genetic factors or gene-environment interactions under certain types
of study designs. For example, in a pedigree cohort study, it typically
will be essential to allow for dependence between the disease occurrence
times of family members when studying environmental exposure effects in
relation to genetic indicators of susceptibility.
Morgenstern and Thomas, in this volume, mention certain designs other
than those discussed thus far in this article, as well as the use of biomarker
endpoints. Corresponding data analysis issues and methods will be mentioned
only briefly here.
It was noted that experimental designs are practical occasionally in
environmental epidemiologic research. The relative-risk and odds-ratio regression
methods described above apply equally well for the comparison of disease
incidence (or mortality) rates between randomization groups in individually
randomized designs. However, a group-randomized design (e.g., with community
as the unit of randomization) is more likely to be feasible, in which case
it is essential to acknowledge the possibility of correlation among the
responses (e.g., disease incidence times) of subjects in the same randomization
group, which require the use of the type of correlated failure-time methods
mentioned above.
In the discussion of ecologic designs it was noted that descriptive studies
of the clustering of disease (e.g., in space or time) can play a useful
role in the generation of environmental health hypotheses. These types of
studies also have specialized data-analytic issues and methods. Statistical
analysis has little to offer in the event of an isolated cluster discovered
by ad hoc methods. Clusters within which the disease counts substantially
exceed expected counts perhaps are best addressed by direct fieldwork to
identify a putative cause. On the other hand, hypotheses of a general tendency
to cluster can be addressed statistically by using methods that compare
the number of cases in certain neighborhoods of each case to the expected
number of cases, while also taking account of population density. Local
neighborhood tests also are available with case-control sampling. See Rothman
(61) and other papers in this volume for discussions of disease-clustering
methods.
The design chapter (in this issue) also emphasizes cross-sectional studies
for the estimation of prevalence rates. The logistic regression methods
outlined in "Relative-Risk and Odds-Ratio Estimation" may be used
to relate prevalence probabilities to retrospectively obtained exposure
and confounding factor histories. Of course, such prevalence probabilities
reflect aspects of both disease incidence and disease duration, and therefore,
may be difficult to interpret. Keiding (62) provides a comprehensive
discussion of the relationships between prevalence probabilities, incidence
rates, and disease durations and of the possibility of deriving estimates
of age-specific incidence from cross-sectional studies.
As discussed previously, biomarkers may serve usefully as exposure indicators
or as early indicators of disease (see Hatch and Thomas, this volume). An
example of a biomarker as an intermediate endpoint is seen in chromosomal
abnormalities in the radiation-exposed cohorts of Hiroshima and Nagasaki.
The rates of such abnormalities among long-lived lymphocytes (usually 100
cells examined for each subject) have played a useful role in assessing
the health effects of radiation exposure in these populations. The correlation
among the chromosomal events in cells from the same study subject has a
strong influence on dose-response analyses in this application (35,63).
Recent advances in the ability to study the cellular and molecular mechanisms
involved when responding to exposure and disease pathogenesis will lead
inevitably to greater use of biomarkers and biological measurement in environmental
epidemiologic studies. Hence, data analysis methods that incorporate such
measurements in a biologically meaningful fashion are required. Suitable
methods for dose-response analysis with biomarker endpoints will vary according
to the type of endpoint(s) involved. Recent estimating equation approaches
(50,51) often may be useful for such analyses. Circumstances
under which a biomarker endpoint can substitute for disease occurrence and
yield valid dose-response tests and estimates is also of considerable interest.
See Prentice (64) for the introduction and discussion of such criteria.
Finally, it seems worth noting that the interpretation of relative-risk
estimates from a study may depend on prior knowledge and on study goals.
For example, if such estimation takes place in the context of a study specifically
designed to confirm a particular association, the corresponding tests and
confidence intervals are more appropriately taken at face value than if
the relative risk is estimated in a purely exploratory context wherein various
other exposures also are examined in relation to disease risk. In this latter
situation, formal methods may be used to acknowledge the multiple hypotheses
being examined, but precise statistical methods for doing so in a general
way are not available. (So-called Bonferroni methods are available widely
and may be precise enough.) Also, one is often neither in a purely exploratory
nor a purely confirmatory mode in data analysis.
Summary Recommendations
Perhaps the single most important data analysis research need in environmental
epidemiology concerns the development of improved methods to accommodate
measurement errors in exposure assessment. Efforts aimed at the design and
use of validation studies would be particularly useful, as would studies
to document the scope and magnitude of measurement error influences.
A second important need concerns improved methods for the conduct and
analysis of aggregate data (ecologic) studies. The development of strategies
for controlling potential confounding, particularly by using individual
surveys in multigroup studies, along with corresponding innovative data
analysis methods, will be important. Empirical studies that illustrate various
analytic and aggregate data analyses of real data sets also would be valuable.
Other pertinent topics for data analysis research include the development
of improved methods for meta-analyses when studies of different types with
differing potential for measurement error biases are available, the development
of flexible data analysis methods, and the study of properties of analyses
based on biomarker indicators of exposure or biomarker end points. Studies
that evaluate and compare strategies for the control of confounding also
merit continuing attention in environmental epidemiology as in other observational
research areas. Further work on biologically based mathematical models for
cancer and for other disease also would be well motivated.