Introduction
In the field of structure-activity relationship (SAR) studies, the software
programs CASE (computer-automated structure evaluation) and MULTICASE, created
by Klopman and Rosenkranz (1), represent an original approach for
elucidating mechanisms of interaction between biological systems and exogenous
compounds to predict the biological activities of chemicals. The strategy
adopted is based on the hypothesis that molecular connectivity identifies
the tridimensional structure: fragments of connected atoms and their interatomic
bonds determine to a significant extent angles between pairs of contiguous
atoms and their interatomic distance. The program should be able to detect,
with the help of a statistical procedure, the submolecular structures that
could interact with biological sites (i.e., receptors) involved in the biological
process analyzed. The structure can be responsible for the biological activity
of the compound (biophore) or its inhibition (biophobe). This view partially
agrees with the work of Ashby and Paton (2), who singled out specific
molecular fragments associated with genotoxicity.
The analytical capabilities of CASE increase with the amount of data
input. CASE minimizes the possibility of bias due because it identifies
parameters objectively, independent of human judgment. The only human operations
are the choice of the data to be submitted to analysis and the interpretation
of data in output. The selection of the descriptors (molecular fragments)
that are used to predict biological activity is completely automated. The
choice of descriptors is based on statistically significant prevalence in
active or inactive molecules.
Since 1984, many studies have been published by Klopman and Rosenkranz
(3-11) on this subject: sets of congeneric and noncongeneric compounds
have been tested for several biological endpoints (mutagenicity, carcinogenicity,
etc.). We have selected for discussion in this report some papers among
the most pertinent to our work. Concerning predictivity, the results obtained
by Klopman and Rosenkranz change for different endpoints and for different
chemical classes analyzed and overall show a high level of accuracy; often,
however, predictivity has been tested only in the training set or in arbitrarily
built test sets.
The general strategy of CASE is known, but the detailed structure of
the software is not available because it is protected by copyright. Up to
now, all reports on predictivity using CASE have been published solely by
the program creators or by authors using the CASE program by license or
permission. Due to these restrictions, we saw the need to develop a new,
completely independent program to confirm (or disprove) the validity of
the type of SAR approach used by CASE.
Our software uses graph theory to reproduce basic operations characterizing
the CASE program. The program associates a graph with a molecule to represent
its topological properties. The program searches for subgraphs (molecular
fragments) characteristic of groups of carcinogenic or noncarcinogenic compounds.
To test the performance of the software, we chose, the induction of tumors
in rodents as a biological end point. Tumors are the end point of carcinogenesis,
a complex multistage event, in which genetic alterations are only one part
of the story. We used the Carcinogenic Potency Database (CPDB; 12-15)
and the National Toxicology Program (NTP; 16-18) data to obtain information
on rodent carcinogenicity. We divided the data into two subsets: a randomly
selected learning set including 80% of the chemicals, and a nonoverlapping
test set including 20% of the chemicals. An additional control analysis
tested an artificially paired set of data where carcinogenicity is attributed
randomly to the molecules of the training set but not to the molecules of
the test set.
Methods
Software features
To analyze the possible relationships between the structure of molecular
fragments and carcinogenicity, our software analyzes the topological properties
of molecular fragments using graph theory. For a detailed introduction to
graph theory, see Christofides (19).
Graph theory is used to relate the topological properties of molecules
to their possible carcinogenicity. A graph is a pair (V, E), where
V is the set {vi, i = 1,...n} of
vertices, and E is the set {eij = (vi,
vj), vi, vj Œ V}
of edges that express existing relations between vertices; both vertices
and edges may be labeled (i.e., they may have an associated name or value).
Any compound can be represented as a graph by associating the atoms with
the vertices and the bonds with the edges. This kind of representation is
frequently adopted in literature because it allows easy handling of the
topological properties of compounds. In fact, graph theory has many applications,
such as in nomenclature, coding, and information processing, storage, and
retrieval (20).
Our software system uses a fragmentation approach to determine whether
subfamilies of compounds with carcinogenic activity, or lack thereof, are
characterized by the presence of some common structural features (molecular
fragments). A similar approach has already been applied in earlier computer-aided
methods (21-23) for predicting different biological activities (antiarthritic-immunoregulatory
effects and antineoplastic effects). In these earlier works, not all the
possible fragments within a given range of non-H atoms were generated, but
only a limited subset of fragments, such as augmented atoms, heteropaths,
and ring fragments. A definition of these substructural units is given by
Chu et al. (22). Our work is mainly based on the works of Rosenkranz
and Klopman (3,4) and on the studies of Ashby (24,25), who
has defined indicators that can be thought of as subgraphs usually present
in genotoxic compounds (genotoxicity is an important component of carcinogenicity).
Essentially, the system searches all the fragments (i.e., subgraphs)
of the compounds present in the training set whose activity is known, in
an attempt to determine a reliable set of fragments whose presence in compounds
of unknown carcinogenicity (test set) may be an indicator of their activity.
In particular, the main procedure of the program that executes the fragmentation
works as follows: all the fragments within a given size of each compound
of the training set are produced; a unique code is associated to any fragment
yielded, and, if this code is not already present in a fragment dictionary,
it is inserted in the dictionary. A list of the compounds to which the fragment
belongs is linked to the fragment code and it is initially filled with the
code of the compound currently examined. Otherwise, if the fragment code
is already present in the dictionary, only the corresponding compound list
is updated. Once all the compounds of the training set have been fragmented,
the system scans the dictionary by searching for the fragments that satisfy
the statistical conditions (described in later).
The program was developed in standard C language, and it can be compiled
on both MS-DOS and Unix architecture. The version used for the experiments
described here can run on any machine with a 3.0 or later version of MS-DOS
operating system, and it requires at least 4 MB of memory and 100 MB of
hard disk. A typical experiment (a single run of a standard training set
of 661 molecules) takes about 4 hr of computation time on a 486 machine
to develop the database of significant fragments. Two additional hours are
required for the statistical analysis that selects the significant fragments.
The amount of time needed to determine if a new compound of a test set contains
one or more of such fragments depends mainly on the compound structure;
for example, the analysis of a 40-atom (nonhydrogen) compound, normally
connected, takes about 5 min, whereas a 10-atom (nonhydrogen) compound takes
no more than 30 sec.
The program accepts as input an ASCII file describing the structure of
the compounds that will be analyzed by a connectivity matrix. A separate
interface program has been developed to graphically input such structures,
storing them in that ASCII file. In general, the analysis system yields
synoptic reports files, but it also stores information in ASCII files in
which data are organized in tables; in this way such information can be
easily accessed by the most popular database software.
Statistical Methods
After the software has considered all the molecular subunits with size
between two and eight "heavy" atoms, a statistical analysis is
performed to select only significant fragments. The first selection is based
on the distribution of the fragments between positive and nonpositive molecules.
The training set initially generates a global number of about 278,000 fragments.
Of these, about 103,000 are different fragments. For the successive stages
of the analysis, the software keeps only those fragments that have a probability
of random association with carcinogenicity (or lack thereof) lower than
0.125 (one tailed) according to binomial distribution. We computed our statistical
estimate for the tail in the direction of biological prevalence; however,
statistical fluctuations can make a fragment significant in both directions
(carcinogenicity or lack thereof). Therefore, conceptually, the real confidence
limits have to be considered two tailed, and about twice the one-tailed
level of confidence. We have calculated the probability for the entire tail
of the distribution to estimate statistical significance. For each monomial
of the distribution we have used the classical formula:

where N is the number of times in which a given fragment has been
generated in different molecules (trials);
X is the number of times in which the fragment has been generated
by positive molecules (successes);
p is the probability that one fragment has been generated by a
positive molecule [probability of success; its value is determined by the
ratio

q is the probability that the fragment has been generated by a
nonpositive molecule (probability of failure = 1 - p); and Pr(X)
is the probability of X successes (single monomial).
The fragments selected in this way are labeled "activating"
if their occurrence in carcinogenic chemicals is higher than the statistical
limit that we established. Similarly, the fragments are labeled "inactivating"
if their occurrence in nonpositive compounds is higher than the established
statistical limit. In a second stage, the program removes the fragments
that are redundant because they are "imbedded" in larger fragments
and have identical behavior (only the subunit with smaller size is kept).
At this stage the number of fragments is reduced at least 300 times in respect
to the initial set of fragments generated (generally from 103,000 to 315
fragments).
A test set, a random sample of the overall data set, is tested to search
each chemical for the presence of significant fragments selected in the
training stage. On the basis of fragment distribution for the chemicals
in the test set, a prediction of their carcinogenicity is made.
A molecule of the test set can have one or more fragments that are present
in molecules of the training set. Combining the statistical significance
of these fragments, we calculate an empirical index, PI (probability index),
for the molecules of the test set. An example of the calculation of this
simple index follows.
A molecule, XV, of the test set contains three fragments among
those ones selected as statistically significant in the training set (F1
and F2 "activating," F3 "inactivating").
The fragment F1 has been selected because it is present, in the
training set, in five active molecules (AT, BT, CT,
DT, ET) and in one inactive molecule (GT).
Similarly, F2 is contained in four active molecules (AT,
BT, CT, HT), whereas the selection of fragment
F3 originates by the presence of this subunit in four inactive
molecules (GT, QT, ST, TT).
The fragments F1 and F2 are probably related because
they were generated by a similar set of molecules. To remove the redundancies,
the two fragments are treated as one fragment that originates by seven chemicals
(AT, BT, CT, DT, ET,
GT, HT). In a similar way, the information obtained
from the fragments F3 is added to create a single aggregate (AT,
BT, CT, DT, ET, GT,
HT, QT, ST, TT), in which the
ratio, between molecules with carcinogenic properties and all the molecules
contributing to the evaluation, is 0.6. This value is used as a PI.
A successive step is the calculation of the PI value that is used as
a cut-off value to define two categories (positives and negatives) of predicted
activity for the test set. This cut-off index is the value that maximizes
the accuracy of the contingency table 2 X 2 (carcinogenicity or lack thereof
versus predicted activity) in the training set. Accuracy in the training
set as a function of the PI is illustrated in Figure 1. Levels of accuracy
higher than 0.75 are obtained in the training set in a range of PI values
between 0.35 and 0.8. This is because the majority of molecules have a probability
index higher than 0.8 or lower than 0.35 (Fig. 2). A cut-off within this
range only slightly affects the attribution to the carcinogenic or noncarcinogenic
class. The average optimal cut-off value for eight runs was 0.41.


Figure 1. Behavior
of the accuracy value for different probability index cutoff values in the
average training set.
Figure 2. Distribution
of probability index values for the chemicals in the average training set.
Preliminary runs of our program showed, for partial subsets of carcinogenicity
data, statistical fluctuations in terms of predictivity indices. For this
reason, we performed eight runs using our final database (826 compounds,
515 carcinogens and 311 noncarcinogens). For each run we randomly drew 80%
of compounds for the training set and used the remaining 20% as the test
set. We also performed eight paired runs using the same chemicals, but,
in this case, the property of carcinogenicity in the training set was randomly
attributed (pseudo-training set). The procedure for randomly sorting the
chemicals for the training set and the test set imposed the condition that
in both sets, 62.3% of the chemicals must be positive carcinogens. This
simple procedure uses a routine of BASIC language (RANDOMIZE TIMER) as a
random-number generator to assign the chemicals for the training sets and
to assign the carcinogenic property in the pseudo-training sets.
To evaluate the predictivity level of our methodology, we
adopted some indices that are conventionally used for diagnostic tests:
Sensitivity (SE) = [TP/(TP+FN)]100
Specificity (SP) = [TN/(TN+FP)]100
Positive predictive value (PPV) = [TP/(TP+FP)]100
Negative predictive value (NPV) = [TN/(TN+FN)]100
Observed correct predictions (OCP) = [(TP+TN)/N]100
where TP = true positive, FP = false positive, TN
= true negative, FN = false negative, and N = (TP + FP + TN + FN) = number
of molecules in the data set.
In addition, according to Klopman and Kolossvary (26),
we evaluated the following two parameters:
Expected correct predictions
(ECP) = (1 + 2 * X * Y - X - Y)100
where X is the fraction of active molecules
in the data set, and Y is the fraction of molecules predicted as
active.

Sources of Data
We gathered the carcinogenicity data analyzed here from two of the main
databases: CPDB (12-15), in which more than 4000 experimental values
are reported (1053 chemicals are considered in the database), and the NTP
database (16-18), in which 301 chemicals have been tested
with standardized protocols in mice and rats. The two databases provide
qualitative and quantitative data for each experiment. We considered only
qualitative results because our software can process only categorical outcomes
at this time. To simplify the situation, in our first analysis we used only
binary data: we classified the experimental results for each chemical as
"positive" or "nonpositive." To this end, we arbitrarily
fixed criteria to make a binary outcome. Table 1 shows the rules adopted
for CPDB data, and Table 2 describes the rules used for NTP data. The two
databases overlap extensively due to the fact that NTP data (except for
most recent experiments) are already contained in CPDB. For only a few chemicals
was there incomplete agreement between the two sources: Table 3 considers
all the possible combinations of matched results.


A large portion of the compounds for which there are data available in
the two databases is included in our database. No intentional selection
was performed. We discarded 50 (4.4%) chemicals with uncertain carcinogenicity
status (not classified according to Tables 1-3); 263 (23.1%) chemicals were
excluded for one or more of the following reasons: 1) administered in mixture;
2) less than three "heavy" atoms; 3) molecules too large for the
input interface (more than 50 heavy atoms); 4) contained unusual atoms (chemicals
containing only H, C, S, N, Cl, O, Na, F, Br, P were included in the database);
5) difficulty finding the structural formula. Our program can currently
analyze 826 chemicals. The CAS numbers of these chemicals is given in Appendix
A.
Results
The fragmentation stage of the process produces about 278,000 fragments
(average of 8 runs), adding up all the fragments produced for each molecule;
of these, about 103,000 are different fragments. From the analysis of their
occurrence and after removal of redundant fragments, on the average, 315
fragments significantly associated with carcinogenicity or lack thereof
(p<0.125 according to binomial distribution) are kept for the
successive steps of the analysis. The number of fragments is significantly
lower for the paired training sets with a random attribution of carcinogenicity:
on average, 174 fragments are selected. Detailed features of the data analyzed
are summarized in Table 4. We also counted the fragments generated with
a threshold of statistical significance at p<0.01. In this case,
the training set of all the 826 chemicals in our database generated 50 fragments,
whereas 6 pseudo-training sets (see Methods) of 826 chemicals generated
an average of only 11.8 fragments. Examining at the distribution of the
fragments shown in Appendix B, we observe that the most common size is 4
"heavy" atoms (15 fragments), although sizes between 3 and 7 are
also relatively common (5-10 fragments). Only two significant fragments
of eight "heavy" atoms and only one fragment of two "heavy"
atoms are present.

The 315 fragments obtained from the training stage are prevalently "inactivating"
(60.6%), and only 39.4% are "activating." This fact may be due
to the ratio between fragments generated from carcinogens and noncarcinogens
in the database studied. In our global database we have more carcinogens
(62.3%) than noncarcinogens (37.7%). However, noncarcinogens have an average
size larger than carcinogens (15.1 "heavy" atoms versus 13.0 "heavy
atoms"). Most likely for this reason, out of the total number of generated
fragments (redundant fragments included), 57.0% come from carcinogens and
43.% from noncarcinogens. Figure 3 shows the distribution of the occurrences
of 103,000 fragments of the average training set. In the case of negative
fragments, those present in three noncarcinogens reach our established limit
of statistical significance (0.433<0.125). This is not the
case for positive fragments (0.573>0.125). For a positive
fragment to become significant, it has to be present in at least four carcinogens
(0.574<0.125). As shown in Figure 3, there are many more fragments
are present at least three times than those present at least four times.
Statistically significant negative fragments can be sorted from a larger
set than statistically significant positive ones. As a consequence, even
if we start with more positive (57%) than negative fragments (43%), we end
up with 60.6% statistically significant negative fragments and 39.4% statistically
significant positive ones (in the final set of 315 statistically significant
different and nonredundant fragments).
Among the 315 significant and nonredundant fragments, similar (not identical),
related fragments are still present, but the possible bias that they could
introduce in terms of predictivity is lessened by the statistical treatment
described in the previous section. These fragments generate the predictions
of carcinogenicity or lack thereof for the test sets. For each run, a 2
X 2 contingency table is created and all the most important indices of qualitative
predictivity are calculated.

Figure 3. Fragment
occurrences for 661 chemicals (average training set).
Table 5 shows the contingency table obtained from the average data of
eight runs for the compounds in the training sets where real experimental
carcinogenicity data have been used. All the indices calculated seem to
show a high level of predictivity. However, even the indices obtained with
the eight training sets where carcinogenicity was randomly attributed (Table
6) show a high predictivity performance. It is clear that the results obtained
are not due to the predictive capability of the program but mainly to the
many degrees of freedom existing in the system. These degrees of freedom
allow for an a posteriori adaptation of the program to the pattern
of positive and negative data in the training sets. In conclusion, the training
sets cannot be used for an assessment of predictivity. It must be noted
that the pseudo-training sets generate less "significant" fragments
than the real training sets. As a consequence, there are fewer chemicals
associated with a positive or negative prediction (376.9) in respect to
the real training sets (521.6).


Table 7 shows the contingency table obtained for an average of eight
test sets. The level of accuracy (67.5%) is significantly higher (p~0.0006)
than the expected level, based on the hypothesis of no association between
connectivity and carcinogenicity (53.2%). The results obtained when the
training sets with carcinogenicity randomly attributed are used to predict
the same test sets (Table 8) do not show any association. These results
and the previous observation that for a random attribution of carcinogenicity,
about 55% of apparently significant fragments are generated in respect to
a real training set, strongly suggest that connectivity is associated only
with a real biological property and not with a randomly distributed simulated
property.


Among the 165 chemicals of the test sets: 1) 32.4% (average of eight
runs) contained only statistically significant positive fragments and were
predicted with an accuracy of 78.7%; 2) 24.4% of the chemicals contained
only statistically significant negative fragments and were predicted with
an accuracy of 60%; 3) 19.8% of the chemicals contained both statistically
significant positive and negative fragments and were predicted with an accuracy
of 59.3%; 4) 23.3% of the chemicals contained no statistically significant
fragments (70.8% of these chemicals were carcinogens and 29.2% were noncarcinogens),
thus preventing a prediction of carcinogenicity.
Of those chemicals without statistically significant fragments, the ratio
between carcinogens and noncarcinogens (70.8/29.2) is higher than the ratio
present in the global database (62.3/37.7). This result can be explained
by the fact that among the 315 statistically significant fragments selected
by the program, more negative fragments (60.6%) than positive fragments
(39.4%) are detected. For this reason, perhaps, we more often detected noncarcinogens
than carcinogens. This could explain the enrichment in carcinogens among
the molecules not associated with significant fragments.
Discussion
The major drawback to this type of automated analysis is the number of
elementary operations performed and the quantity of memory needed. Determining
the largest common subgraph between two graphs is a nonpolynomial operation
and is generally considered difficult. Fortunately, some characteristics
of the chemical compounds partially simplify this otherwise formidable task:
1) the maximum number of edges converging at a node is usually small (around
four); 2) the number of atoms in the compounds of our database is relatively
small: the average number of heavy atoms (nonhydrogen) per compound is 13.8,
and the largest compound contains 48 heavy atoms (see Fig. 4); 3) the maximum
size of the searched fragments was limited to eight heavy atoms. As can
be observed in Figures 5 and 6, fragments of greater size tend to appear
in large numbers, but each of them tend to be present in too few compounds
to be statistically significant. We have also observed that in our database,
the information (associated with carcinogenicity or lack thereof) related
to fragments of size 9 is redundant in respect to the information of smaller
sizes in 100% of the cases (data not reported).

Figure 4. Size
of the molecules present in the database.

Figure 5. Global
number of different fragments, according to their size. Results are for
a set of 661 randomly selected chemicals.
Figure 6. Number
of different fragements present in at least five molecules, according to
their size. Results are for a set of 661 randomly selected chemicals.
Finally, thus far, the adopted technique of representation of molecular
fragments does not make a distinction among steric isomers; such cases will
be dealt with in a future improvement to the system.
We have described the method for calculating our PI value in Methods.
We used the PI value as a discriminant for deciding if a molecule of the
test set will be predicted to be a carcinogen or a noncarcinogen. The strategy
adopted prevents strongly related fragments from contributing to the analysis
as independent fragments. In this way the informative content of a single
chemical in the training set can have only one unit weight: we thus avoid
the introduction of a bias of redundancy resulting from the multiplication
of information related to a single molecule.
This strategy can introduce a different potential bias for a subset of
molecules with different active substructures all common to the same molecules:
in this case the index calculated can be underestimated. However, in our
opinion, adding up the contributions of highly correlated fragments would
cause more distortion than discarding multiple contributions present in
the same molecule.
As a general result, we have confirmed what has been suggested by Klopman
and Rosenkranz (4): an approach based on molecular connectivity can
predict carcinogenicity. The results obtained in our test sets are statistically
significant (p~0.0006). We believe that the observed levels of predictivity
are not only statistically significant but also biologically relevant and
potentially useful as one component of a spectrum of information that can
contribute to hazard evaluations. Our initial work is promising, but we
must test the software in additional experiments to develop it as a predictive
toxicology system. For instance, we have to investigate in detail the performance
of our program for different thresholds of statistical significance when
we are selecting significant fragments from the training set to be used
for predictions in the test set.
We can logically presume that with a smaller (and /or less diversified)
training set, a fragment potentially associated with carcinogenicity or
lack thereof could not reach statistical significance (or reach a more equivocal
statistical significance). Therefore, we would expect that the percentage
of nonassessable chemicals should decrease for a larger training set, and
we should obtain better predictivity in general.
We plan to test our software program using smaller training sets (i.e.,
from 200 to 400 chemicals randomly selected) to verify if our assumption
is correct. Klopman and Rosenkranz (11) have already verified this
assumption. However, for the moment, we do not know if the similarities
between the CASE program and our program are sufficient to allow extrapolation
of their results to the results of our program.
We also have to look in detail at the fragments selected as significant
to comment about their biological plausibility and compare them with the
alert structures of Ashby (2,16,17,18,24,25) and also with fragments
identified by the CASE and MULTICASE programs. We plan to coordinate with
the authors of CASE and MULTICASE to test our respective programs with identical
training sets and identical test sets so that we can compare the results
obtained.
We used a database much larger than those used previously by other authors.
We have obtained an average (eight runs) level of accuracy of 67.5% (SE,
±1.3). As shown in Table 7, we predicted 82.1 chemicals as positive
and 44.4 as negatives. If these predictions (with the same proportions of
predicted positives and negatives) had been based only on chance, the level
of accuracy would have been 53.2% (ECP value). In our database, the prevalence
of positive carcinogens is 62.3%. If we had predicted all the chemicals
of the test sets as carcinogens, we would have obtained an accuracy of 62.3%.
When you predict that all chemicals are potential carcinogens, the sensitivity
is 100% and the specificity is 0%, and the prediction is not very useful.
An accuracy of 62.3% is apparently not very different from 67.5%, but we
would anticipate from our software program levels of accuracy in the range
of 65-70% and a ratio of carcinogens/noncarcinogens of 50/50, or even 38/62.
We plan to perform these experiments in a future study.
Different levels of predictivity were observed for different subclasses
of chemicals. For instance, the confidence of the prediction for a chemical
of the test sets, characterized only by positive fragments, is significantly
higher (78.7%) than the confidence of the prediction for a chemical characterized
only by negative fragments or contradictory fragments (60.7% and 59.3%,
respectively).
We have met some difficulties in performing a direct comparison of our
results with the results obtained by CASE. At the level of the training
set, accuracy was higher (~ 95%) for CASE (8,9) than for our program.
This difference is probably related to differences in the decisional-statistical
procedures used for the information obtained from different molecular fragments.
In addition, the carcinogenicity database used by Klopman and Rosenkranz
was different from ours. We have clearly demonstrated that accuracy at the
level of the training sets is not correlated to the real predictivity of
the software program (compare Tables 6 and 8).
A test set concerning carcinogenicity is present in two different reports
by Klopman and Rosenkranz (8,9). The training set contained 189 chemicals
of the NTP study (50.2% active, 22.2% marginally active, and 27.5% noncarcinogens).
The rodent carcinogens (or noncarcinogens) considered in the test sets of
the two papers are the same chemicals. They had been evaluated for carcinogenicity
in the GeneTox program. In this test set, 23 out of 24 chemicals were rodent
carcinogens. The expected correct predictivity was 92%, and the observed
predictivity (accuracy) was 100%. Obviously, it is not possible to directly
compare this extremely unbalanced database with ours.
In 1990, an analysis of the capability of CASE to predict carcinogenicity
for a group of polycyclic aromatic hydrocarbons was reported by Richard
and Woo (27). Thirty-one active and 25 inactive PAHs were used in
the training set ("LEARN"), and 9 active and 15 inactive PAHs
were used in the test set ("VALIDATE"). The authors reported an
accuracy of 75% (SE, 89%; SP, 67%). In a recent publication (28),
results concerning the predictive capabilities of CASE were reported for
a group of chemicals for which carcinogenicity data recently became available
(NTP studies). Out of 25 chemicals predicted by CASE, 17 were carcinogens
and 8 were noncarcinogens (6 equivocals omitted). The degree of accuracy
was 64% (SE, 59%; SP, 75%). Obviously, these results are from a small test
set, not directly comparable with ours.
Among the works published by Klopman and Rosenkranz, a larger database
(more similar to our database) was used to predict mutagenicity in Salmonella.
In a recent study (1), Klopman and Rosenkranz used mutagenicity data
from the GeneTox program and NTP studies to perform the analysis. The training
set was built using GeneTox mutagenicity data, and the test set was built
using NTP mutagenicity data. Chemicals present in both the databases were
not submitted to CASE and MULTICASE analysis. In this way, the training
set contained 450 mutagens, 253 marginally active mutagens, and 123 nonmutagens,
whereas the test set contained 63 mutagens, 21 marginally active mutagens,
and 61 nonmutagens. The highest level of predictivity obtained using the
MULTICASE program was about 80%, opposed an expected correct prediction
of about 50%. According to Ashby and Tennant (29), mainly electrophiles
(directly or after metabolic activation) are involved in Salmonella
mutagenicity. It is reasonable to think that mutagenicity in Salmonella
should be more easy to predict than the complex endpoint of carcinogenicity:
phenomena such as promotion, clonal expansion, remodeling, tissue necrosis
and regeneration, and modulation of proliferation, apoptosis, differentiation
are clearly involved in the carcinogenic process, but not in mutagenicity
in Salmonella or in other short-term tests of genotoxicity. We would
expect a wider and more heterogeneous spectrum of molecular fragments to
be involved in carcinogenicity than in genotoxicity. In the future, we will
have to apply our software program not only to carcinogenicity but also
to mutagenicity in Salmonella to test our hypothesis that it is in
general easier to predict genotoxicity than carcinogenicity.
After analyzing recent studies evaluating the qualitative correlation
between short-term tests for genotoxicity and carcinogenicity (30,31),
we conclude that accuracy is in the range of 56-62%. It seems reasonable
that short-term genotoxicity tests can reflect irreversible alterations
in the genome during carcinogenesis. On the other hand, short-term tests
should not be able to monitor nongenotoxic events (for instance, those events
linked to promotion and clonal expansion of preneoplastic cells). The fact
that the predictivity of molecular connectivity is better than the predictivity
of short-term genotoxicity tests suggests that molecular connectivity can
detect not only electrophilic fragments, like the ones described by Ashby
et al. (2,16-18,24,25), but also fragments linked to nongenotoxic
effects (promotion, modulation of differentiation, etc.). An alternative
explanation of this difference in accuracy could be related to the fact
that nongenotoxic carcinogens may be more abundant in the databases used
to assess the predictivity of short-term tests (30,31) than in our
larger database. In the future we will investigate the predictivity of molecular
connectivity for genotoxic and nongenotoxic carcinogens.
We have discussed the predictive capability of short-term genotoxicity
tests. How much higher would this predictivity be with a test biologically
closer to carcinogenicity in rodents? We can partially answer this question.
The endpoint of carcinogenicity in a single species of small rodents is
not very different in the evolutionary scale from the endpoint of carcinogenicity
in at least one of two closely related species. If our endpoint is now only
in mice or rats, we can predict carcinogenicity in one species with carcinogenicity
in the other. For the database of Gold et al. (12-15), a concordance
of 75% between rat and mouse studies has been reported (32), and
for the chemicals of the NTP studies, a concordance of 74% has been reported
(33): the predictivity of molecular connectivity is only moderately
lower than the values reported above. This can be considered an additional
indication of the good behavior of our parameter. We will have to confirm
this impression in future experiments using only mouse data or rat data.
Within the framework of hazard evaluation, we believe that the computerized
SAR approach should be given a weight similar to that of a standard short-term
test in a multifactorial analysis of the carcinogenic potential of a given
chemical. With regard to genotoxicity and carcinogenicity, Ashby (34)
has pointed out that some fragments detected as significant by Klopman and
Rosenkranz (and likewise by us) could not stand an in-depth analysis performed
by a human expert, considering both biological and chemical specific arguments.
We agree with this observation. Because we found in the training sets a
number of apparently significant fragments equal to about 55% of the statistically
significant fragments found in the real training sets, we suspect that (as
a first approximation) about half of the fragments defined as significant
according to our statistical threshold (p<0.125, one tailed) are
spurious. According to our analysis, only about 50% of apparently significant
fragments emerging from a training set can be fragments of real biological
significance. The remaining 50% is probably generated by chance and can
also be present in a pseudo-training set in which carcinogenicity is assigned
randomly. The level of predictivity reached in our experiments is probably
due to a mixture of approximately 50% predictive fragments and approximately
50% of noise fragments. We think that fragments suggested as significant
by our software program should be considered only as candidates for biological
significance, but are by no means foolproof biological indicators of carcinogenicity.
Their probability of being significant is higher, as expected, when we select
a more severe statistical threshold. As a consequence of these considerations,
a new potentially significant fragment detected by our software program
is only submitted to the attention of investigators as a possible fragment
characterizing a subfamily of molecules potentially responsible for their
common carcinogenic activity. Additional biological and chemical considerations
could lead to the acceptance or rejection of the fragment as biologically
significant. For instance, if the chemicals considered are similar procarcinogens,
similar metabolism should generate similar proximate carcinogens and perhaps
also similar DNA adducts.
There are also cases in which it is impossible to reach a definite conclusion.
Statistical significance is only one factor; however, when the statistical
threshold is much more severe (p<0.01 instead of p<0.125),
the number of significant fragments generated in a real training set is
four to five times larger than the number of significant fragments generated
in a pseudo-training set (against a ratio of 2/1 for the threshold, p<0.125).
Fragments with a higher statistical significance deserve priority in subsequent
biological investigations with the aim of confirming or disproving the existence
of a new molecular structure relevant for carcinogenicity or genotoxicity.
On the other hand, the information obtained with the threshold p<0.125,
while less significant than the information obtained with the threshold
p<0.01, still allowed us to make predictions about a much larger
fraction of chemicals. For this reason, the threshold p<0.125
was selected for the general predictivity study presented here.
We have use the overall evidence of carcinogenicity in at least one species,
one sex, and one tissue, without any consideration about carcinogenic potency
to determine whether or not a chemical is a carcinogen (yes or no). In the
future we plan to stratify our database according to spectrum of carcinogenicity
(large spectrum, narrow spectrum), as suggested by Tennant (35) and
perhaps take into consideration different ranges of potency. A subfamily
of chemicals sharing a common chemical fragment could also display a relatively
homogeneous behavior in respect to a different subfamily sharing a different
fragment.
Finally, in conclusion, we have confirmed that with a large database,
using an independent software program, SAR approaches based on the computer-automated
detection of molecular fragments statistically associated with a given biological
property can be used to predict carcinogenicity in rodents. We are not aware
of other independent validations of this type of SAR approach.



