Abstract
In order to model relationships between chemical structures and biological effects in quantitative structure–activity relationship (QSAR) data, an alternative technique of artificial intelligence computing—genetic programming (GP)—was investigated and compared to the traditional method—statistical. GP, with the primary advantage of generating mathematical equations, was employed to model QSAR data and to define the most important molecular descriptions in QSAR data. The models predicted by GP agreed with the statistical results, and the most predictive models of GP were significantly improved when compared to the statistical models using ANOVA. Recently, artificial intelligence techniques have been applied widely to analyse QSAR data. With the capability of generating mathematical equations, GP can be considered as an effective and efficient method for modelling QSAR data.
Export citation and abstract BibTeX RIS
Content from this work may be used under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
QSAR data are known to be highly complex with nonlinear relationships. Modelling the relationships between chemical structures and biological effects is the target of structure–activity studies [1]. Therefore, with a large number of molecular descriptors, finding an efficient tool to glean knowledge from the relationships of structure–activity in QSAR data is necessary. There are many methods applied routinely for QSAR. In addition to statistical methods [2–4], artificial intelligence techniques have been employed to deal with such data. Recently, several research papers have discussed whether artificial intelligence can deal with the problem of QSAR modelling. Examples include the application of neural networks for discriminating antibacterial activity [5] and for 3D QSAR [6], or an application of neural networks and genetic algorithms for modelling QSAR data [7].
In the previous studies, an alternative artificial intelligence method that was developed based on the Darwinian theory called genetic programming (GP) was successfully employed to modelling pharmaceutical formulation data [8, 9]. In this study, GP was used to analyse QSAR data and a comparison of the ability of GP and statistical methods to model QSAR data was examined for a QSAR data set from a study on the effect of structural QSAR parameters on skin penetration. For the comparison, an assumption was made that in all cases there was no interaction between the molecular descriptors (i.e. all independent variables).
2. Materials and methods
2.1. Materials
The data set used in this work was taken from a study of the effect of structural QSAR parameters on skin penetration. In [3], Ghafourian and Fooladi analysed 20 variables of 17 inputs and 3 outputs—permeability (log K p ), skin/water partition (log K m ) and diffusion coefficient (log D/h), as listed in table 1. Ghafourian and Fooladi analysed 39 compounds. However, in this study, for ease of comparison with the statistical results in the literature, only 22 compounds (compound 1–14 and 26–33) were used for evaluating the predictability of GP for all responses. From a total of 22 compounds used for observation, 20 records are used as training data and 2 records are used as test data selected by the Smart Selection function in GP.
Table 1. QSAR parameters.
Independent variables | |||
Octanol/water partition coefficient | MW | Molecular weight | |
Logarithm of total solvent accessible | E HOM | Energy of the highest occupied | |
surface area | molecular orbital | ||
OT | Number of hydrogen bonding heteroatoms | HT | Number of hydrogen bonding hydrogens |
qs +a | Sum of atomic charges on the hydrogen | δ | Solubility parameter |
bonding hydrogen atoms | |||
q−a | The most negative atomic charge on the | V | Molar volume |
hydrogen bond accepting heteroatoms | |||
qs −a | Sum of atomic charges on the hydrogen | 2κ | Second-order molecular shape index |
bonding heteroatoms | |||
qs −b | Sum of atomic charges on the hydrogen | 0κ α | Alpha-modified zero-order molecular |
bonding heteroatoms | shape index | ||
q−b | The most negative atomic charge on the | 4χ pc v | Fourth-order valence-corrected path- |
hydrogen bond accepting heteroatoms | cluster molecular connectivity index | ||
ESP − | The most negative electrostatic potential on | ||
the solvent accessible surface of molecules | |||
Dependent variables | |||
Permeability coefficient | |||
Skin/water partition coefficient | |||
– |
2.2. Methods
GP has been known primarily as a learning technique producing mathematical equations [10–12]. It is founded on the basic principle of Darwin's theory of evolution. In nature, biological structures that are more successful at operating in their given environment have a higher probability of surviving and reproducing. These structures are considered to be the result of Darwinian natural selection operating in an environment over a period of time. With a similar principle of natural selection, Koza introduced GP that attempts and succeeds in applying this evolutionary theory in order to find the best (fittest) or the most appropriate equation (solution) for a problem domain [10].
2.2.1. Representation scheme
GP modifies individuals of the population after a number of runs in every cycle, called a generation, to find the best solution. The structure of individuals in GP is a tree (see figure 1(a)). Possible structures in genetic programming are the set of all compositions of functions and the set of terminals.
The elements in the function set may include:
- arithmetic operations (+, -, *, etc),
- mathematical functions (such as sin, cos, exp and log),
- Boolean operations (such as AND, OR and NOT),
- conditional operators (such as If-Then-Else),
- functions causing iteration (such as Do-Until),
- functions causing recursion,
- any other domain-specific functions that may be defined.
The terminals are typically either a variable or a constant value.
2.2.2. Genetic operations
In the Darwinian principle of reproduction and survival of the fittest, a population always has many generations and the individuals of new generations are the result of the combination of the individuals in the previous generations. In other words, this is a transfer of a set of individuals into a new generation of the population using genetic operations. Crossover, reproduction and mutation are genetic operations for creating the new individuals in each generation (figures 1(b) and (c)). The crossover operation creates a new individual from two parental individuals selected from the population based on fitness. Within each parental tree, a random node is selected and the sub-trees under the selected node are swapped to create two new trees. The simplest operation of genetic operations selects the individuals from a population based on their fitness and then copies them into the new population. The mutation operation creates a new individual from an existing tree in a population by deleting and replacing one node of that tree with another node from the same set. Note that a function replaces only a function and a terminal replaces only a terminal.
2.2.3. Fitness function
To select individuals for crossover, reproduction and to determine how good the individuals are at solving a given problem, fitness functions are employed. The fitness function assesses how an individual is fitted to the environment of a domain problem, after calculating the fitness for all individuals. Every individual in a population is assigned a fitness value and then, based on the adopted selection method, specific individuals are identified for crossing over, mutating or reproducing. The considered fitness functions used in many systems for solving research and optimization problems are MSE (mean square error), SRM (structural risk minimization), AIC (Akaike's information criterion), MDL (minimum description length) and FPE (final prediction error) [13].
2.2.4. Implementation scheme
The implementation of GP is performed by the following steps:
Step 1: One or more initial populations of individuals are randomly generated with functions and terminals related to the problem domain.
Step 2: The implementation of GP iteratively performs the following steps until the termination criterion has been satisfied:
- The fitness value of every individual is estimated according to a selected fitness measure.
- All individuals in the population are sorted based on their fitness values.
- The next generation is produced using the genetic operations
- The termination criterion is checked. If it is not satisfied, the next iteration is performed; if it is satisfied, go to step 3.
Step 3: The result may be a solution to the problem domain.
In particular, the quality of the final model is controlled by either the selection of training data or the values of the control parameters (crossover, mutation, reproduction and fitness function).
2.3. Software tool
The software used was a modified form of that described previously [14], but with additional exponential and other mathematical functions as well as the fitness functions listed. Table 2 lists the values of the control parameters used in the present study.
Table 2. The value of control parameters of GP used in this study.
Population size(s) | 10 000 | Crossover | 0.2–0.5 |
Generations | 100–500–1000 | Mutation | 0.2–0.5 |
Max. no. nodes | 10–15–20 | Constants mutation | 0.5 |
Max. depth | 10 | Regeneration | 0.7 |
Addition functions | Exp | Fitness type | AIC, FPE, MDL, MSE and SRM |
reproduction | 0.05–0.2 | Test formulations | 2, 10 |
In order to evaluate the quality of a model generated by GP, the correlation coefficient R-squared (R2) was computed, with higher values of R2 indicating improved quality of the model,
where is the mean value of the dependent variable y, is the predicted value from the model and n is the number of records.
3. Results and discussion
Validation of the predictive power of GP for mathematical and formulation data has been evaluated and presented elsewhere [8, 9].
3.1. The predicted results of permeability coefficient (log K p )
The permeability coefficient has the following expression:
From equation (
The values of the training parameters, which GP used to produce equation (
As analysed in the literature using MINITAB statistical software [3], equation (
Table 3. Values of training parameters of GP producing equation (
Population size(s) | 10 000 |
Max. no. generations | 500 |
Max. no. nodes | 10 |
Max. depth | 10 |
Addition functions | |
reproduction | 0.05 |
Crossover | 0.5 |
Mutation | 0.5 |
Constants mutation | 0.5 |
Regeneration | 0.7 |
Fitness type | AIC |
Test formulations | 2, 10 |
When comparing the predictive models between GP and statistical methods, the results of regression analysis for both methods using whole data of 22 structures in figure 2 indicated that there are only minor differences between these techniques. While producing a slightly lower linear R2 value, GP generated a better model with more satisfactory values of slope and intercept coefficients for this response.
3.2. The predicted results of skin/water partition coefficient (log K m )
The skin/water partition coefficient is determined by the following formula:
For this response from GP given in equation (
As seen in table 4, the AIC fitness function also gave a high quality model with a population size of 10 000 500 generations, as well as the same values of crossover and mutation as in the case for the first response. However, in order to achieve a satisfactory result for (log K m ), a larger number of tree nodes equal to 15 are required, and the value of the reproduction parameter also increased from 0.05 to 0.2.
Table 4. Values of training parameters of GP producing (
Population size(s) | 10 000 |
Max. no. generations | 500 |
Max. no. nodes | 15 |
Max. depth | 15 |
Addition functions | |
reproduction | 0.2 |
Crossover | 0.5 |
Mutation | 0.5 |
Constants mutation | 0.5 |
Regeneration | 0.7 |
Fitness type | AIC |
Test formulations | 14, 18 |
For the skin/water partition coefficient log K m response, the models predicted by the statistical method are
As reported in the literature, the [E
HOMO
] parameter is a hydrogen bonding acceptor ability parameter, and according to equation (
When equations (
In order to examine the predictability of GP for the skin/water partition coefficient (log K m ), regression analysis was applied, and the results of both methods for this response were compared. From figure 3, the results of regression analysis for both methods using whole data of 22 structures show that there are no major differences between the two methods in this case. The linear R2 values of three equations and regression coefficients observed are also similar.
3.3. The predicted results of diffusion coefficient (log D/h)
The diffusion coefficient log D/h is predicted by GP as follows:
It can be seen from equation (
GP required the same population size and generations as in the predictions for the previous responses of 10 000 and 500, respectively (see table 5). Apart from the training model using the MDL fitness function and a larger number of nodes set to 15, all values of the remaining parameters in the training model are the same for the GP prediction for the first response of this data set.
Table 5. Values of training parameters of GP producing (equation (
Population size(s) | 10 000 |
Max. no. generations | 500 |
Max. no. Nodes | 15 |
Max. depth | 15 |
Addition functions | |
reproduction | 0.05 |
Crossover | 0.5 |
Mutation | 0.5 |
Constants mutation | 0.5 |
Regeneration | 0.7 |
Fitness type | MDL |
Test formulations | 10, 18 |
In the original paper, Ghafourian and Fooladi [3] showed that this response could also be given by
Based on equation (
As reported in the literature [3], the response of the diffusion coefficient log D/h was modelled successfully using the statistical method with very high R2 values, as shown below:
The model predicted by GP in equation (
When the predictive model of GP (
From figure 4 comparing equation (
From figure 4, it can also be seen that equation (
In general, for the diffusion coefficient (log D/h) response, while not generating the same influential parameters as reported in the literature, the quality of prediction of GP for this response is high in both observations on the R2 value as well as on the regression analysis.
4. Conclusions
For a highly complex problem with nonlinear relationships between chemical structures and biological effects in quantitative structure–activity relationship (QSAR) data, genetic programming, with its advantage of generating mathematical equations, has been shown to be an efficient method. In contrast to statistical approaches, GP requires no assumption of the functional form (e.g. linear or quadratic) to be used to describe the cause-and-effect relationships.