The following article is Open access

Modelling the effect of structural QSAR parameters on skin penetration using genetic programming

and

Published 22 October 2010 2010 Vietnam Academy of Science & Technology
, , Citation K K Chung and D Q Do 2010 Adv. Nat. Sci: Nanosci. Nanotechnol. 1 035003 DOI 10.1088/2043-6254/1/3/035003

2043-6262/1/3/035003

Abstract

In order to model relationships between chemical structures and biological effects in quantitative structure–activity relationship (QSAR) data, an alternative technique of artificial intelligence computing—genetic programming (GP)—was investigated and compared to the traditional method—statistical. GP, with the primary advantage of generating mathematical equations, was employed to model QSAR data and to define the most important molecular descriptions in QSAR data. The models predicted by GP agreed with the statistical results, and the most predictive models of GP were significantly improved when compared to the statistical models using ANOVA. Recently, artificial intelligence techniques have been applied widely to analyse QSAR data. With the capability of generating mathematical equations, GP can be considered as an effective and efficient method for modelling QSAR data.

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

QSAR data are known to be highly complex with nonlinear relationships. Modelling the relationships between chemical structures and biological effects is the target of structure–activity studies [1]. Therefore, with a large number of molecular descriptors, finding an efficient tool to glean knowledge from the relationships of structure–activity in QSAR data is necessary. There are many methods applied routinely for QSAR. In addition to statistical methods [2–4], artificial intelligence techniques have been employed to deal with such data. Recently, several research papers have discussed whether artificial intelligence can deal with the problem of QSAR modelling. Examples include the application of neural networks for discriminating antibacterial activity [5] and for 3D QSAR [6], or an application of neural networks and genetic algorithms for modelling QSAR data [7].

In the previous studies, an alternative artificial intelligence method that was developed based on the Darwinian theory called genetic programming (GP) was successfully employed to modelling pharmaceutical formulation data [89]. In this study, GP was used to analyse QSAR data and a comparison of the ability of GP and statistical methods to model QSAR data was examined for a QSAR data set from a study on the effect of structural QSAR parameters on skin penetration. For the comparison, an assumption was made that in all cases there was no interaction between the molecular descriptors (i.e. all independent variables).

2. Materials and methods

2.1. Materials

The data set used in this work was taken from a study of the effect of structural QSAR parameters on skin penetration. In [3], Ghafourian and Fooladi analysed 20 variables of 17 inputs and 3 outputs—permeability (logK p ), skin/water partition (logK m ) and diffusion coefficient (logD/h), as listed in table 1. Ghafourian and Fooladi analysed 39 compounds. However, in this study, for ease of comparison with the statistical results in the literature, only 22 compounds (compound 1–14 and 26–33) were used for evaluating the predictability of GP for all responses. From a total of 22 compounds used for observation, 20 records are used as training data and 2 records are used as test data selected by the Smart Selection function in GP.

Table 1. QSAR parameters.

Independent variables   
Octanol/water partition coefficientMWMolecular weight
Logarithm of total solvent accessible E HOM Energy of the highest occupied
 surface area molecular orbital
OTNumber of hydrogen bonding heteroatomsHTNumber of hydrogen bonding hydrogens
qs +a Sum of atomic charges on the hydrogenδSolubility parameter
 bonding hydrogen atoms  
qa The most negative atomic charge on theVMolar volume
 hydrogen bond accepting heteroatoms  
qs a Sum of atomic charges on the hydrogen 2κ Second-order molecular shape index
 bonding heteroatoms  
qs b Sum of atomic charges on the hydrogen 0κ α Alpha-modified zero-order molecular
 bonding heteroatoms shape index
qb The most negative atomic charge on the 4χ pc v Fourth-order valence-corrected path-
 hydrogen bond accepting heteroatoms cluster molecular connectivity index
ESP The most negative electrostatic potential on  
 the solvent accessible surface of molecules  
Dependent variables   
Permeability coefficient  
Skin/water partition coefficient  
  

2.2. Methods

GP has been known primarily as a learning technique producing mathematical equations [10–12]. It is founded on the basic principle of Darwin's theory of evolution. In nature, biological structures that are more successful at operating in their given environment have a higher probability of surviving and reproducing. These structures are considered to be the result of Darwinian natural selection operating in an environment over a period of time. With a similar principle of natural selection, Koza introduced GP that attempts and succeeds in applying this evolutionary theory in order to find the best (fittest) or the most appropriate equation (solution) for a problem domain [10].

2.2.1. Representation scheme

GP modifies individuals of the population after a number of runs in every cycle, called a generation, to find the best solution. The structure of individuals in GP is a tree (see figure 1(a)). Possible structures in genetic programming are the set of all compositions of functions and the set of terminals.

Figure 1

Figure 1 (a) An example of a GP structure, (b) crossover operation and (c) process of mutation.

The elements in the function set may include:

  • arithmetic operations (+, -, *, etc),
  • mathematical functions (such as sin, cos, exp and log),
  • Boolean operations (such as AND, OR and NOT),
  • conditional operators (such as If-Then-Else),
  • functions causing iteration (such as Do-Until),
  • functions causing recursion,
  • any other domain-specific functions that may be defined.

The terminals are typically either a variable or a constant value.

2.2.2. Genetic operations

In the Darwinian principle of reproduction and survival of the fittest, a population always has many generations and the individuals of new generations are the result of the combination of the individuals in the previous generations. In other words, this is a transfer of a set of individuals into a new generation of the population using genetic operations. Crossover, reproduction and mutation are genetic operations for creating the new individuals in each generation (figures 1(b) and (c)). The crossover operation creates a new individual from two parental individuals selected from the population based on fitness. Within each parental tree, a random node is selected and the sub-trees under the selected node are swapped to create two new trees. The simplest operation of genetic operations selects the individuals from a population based on their fitness and then copies them into the new population. The mutation operation creates a new individual from an existing tree in a population by deleting and replacing one node of that tree with another node from the same set. Note that a function replaces only a function and a terminal replaces only a terminal.

2.2.3. Fitness function

To select individuals for crossover, reproduction and to determine how good the individuals are at solving a given problem, fitness functions are employed. The fitness function assesses how an individual is fitted to the environment of a domain problem, after calculating the fitness for all individuals. Every individual in a population is assigned a fitness value and then, based on the adopted selection method, specific individuals are identified for crossing over, mutating or reproducing. The considered fitness functions used in many systems for solving research and optimization problems are MSE (mean square error), SRM (structural risk minimization), AIC (Akaike's information criterion), MDL (minimum description length) and FPE (final prediction error) [13].

2.2.4. Implementation scheme

The implementation of GP is performed by the following steps:

Step 1: One or more initial populations of individuals are randomly generated with functions and terminals related to the problem domain.

Step 2: The implementation of GP iteratively performs the following steps until the termination criterion has been satisfied:

  • The fitness value of every individual is estimated according to a selected fitness measure.
  • All individuals in the population are sorted based on their fitness values.
  • The next generation is produced using the genetic operations
  • The termination criterion is checked. If it is not satisfied, the next iteration is performed; if it is satisfied, go to step 3.

Step 3: The result may be a solution to the problem domain.

In particular, the quality of the final model is controlled by either the selection of training data or the values of the control parameters (crossover, mutation, reproduction and fitness function).

2.3. Software tool

The software used was a modified form of that described previously [14], but with additional exponential and other mathematical functions as well as the fitness functions listed. Table 2 lists the values of the control parameters used in the present study.

Table 2. The value of control parameters of GP used in this study.

Training parameterValueTraining parameterValue
Population size(s)10 000Crossover0.2–0.5
Generations100–500–1000Mutation0.2–0.5
Max. no. nodes10–15–20Constants mutation0.5
Max. depth10Regeneration0.7
Addition functionsExpFitness typeAIC, FPE, MDL, MSE and SRM
reproduction0.05–0.2Test formulations2, 10

In order to evaluate the quality of a model generated by GP, the correlation coefficient R-squared (R2) was computed, with higher values of R2 indicating improved quality of the model,

where is the mean value of the dependent variable y, is the predicted value from the model and n is the number of records.

3. Results and discussion

Validation of the predictive power of GP for mathematical and formulation data has been evaluated and presented elsewhere [89].

3.1. The predicted results of permeability coefficient (logK p )

The permeability coefficient has the following expression:

From equation (1), the R2 value of the predictive model is considered as a satisfactory result and moreover the predictive equation is extremely simple for the permeability coefficient (logK p ) response. In equation (1), only three parameters—the number of hydrogen bonding heteroatoms [OT], the sum of atomic charges on the hydrogen bonding heteroatoms [q s b ], and the second-order molecular shape index [2κ ]—from a total of 17 input variables influence the permeability coefficient logK p .

The values of the training parameters, which GP used to produce equation (1), are shown in table 3. For this response, the simplicity of the predictive equation resulted from a maximum of 10 nodes in the training trees. However, because of this small number of tree nodes, the population size required was 10 000 with 500 generations. In addition, maximum values of crossover and mutation in the training models were needed to obtain an acceptable model. For this response, the AIC fitness function was used to generate a high-quality model. The details of the equations from the statistical method are shown below:

As analysed in the literature using MINITAB statistical software [3], equation (2) shows that is correlated with the logarithm of total solvent accessible surface area calculated using a probe of 1.4 Å radius (log [TA]), the sum of atomic charge on the hydrogen bonding heteroatoms [q s b ]and the number of hydrogen bonding heteroatoms [OT] in equation (2). In equation (3), this response is also correlated with the number of hydrogen bonding heteroatoms [OT], the second-order molecular shape index [2κ] and the sum of atomic charge on the hydrogen bonding heteroatoms calculated by the AM1 method [q s +a ].

Table 3. Values of training parameters of GP producing equation (1).

Training parameterValue
Population size(s)10 000
Max. no. generations500
Max. no. nodes10
Max. depth10
Addition functions 
reproduction0.05
Crossover0.5
Mutation0.5
Constants mutation0.5
Regeneration0.7
Fitness typeAIC
Test formulations2, 10

When comparing the predictive models between GP and statistical methods, the results of regression analysis for both methods using whole data of 22 structures in figure 2 indicated that there are only minor differences between these techniques. While producing a slightly lower linear R2 value, GP generated a better model with more satisfactory values of slope and intercept coefficients for this response.

Figure 2

Figure 2 Scatter plots, linear equations and linear R2 values of logK p generated from GP equation (1) (▴) and statistical equations (2) () and (3) (▪).

3.2. The predicted results of skin/water partition coefficient (logK m )

The skin/water partition coefficient is determined by the following formula:

For this response from GP given in equation (4), the parameters of the octanol/water partition coefficient molecular weight [logP oct ], the second-order molecular shape index [2κ] and the energy of the highest occupied molecular orbital E HOMO correlate with the skin/water partition coefficient (logK m ) response with a high R2 value of 0.96. This result is similar to the prediction for the first response of these QSAR data in that the equation predicted is simple with only three parameters.

As seen in table 4, the AIC fitness function also gave a high quality model with a population size of 10 000 500 generations, as well as the same values of crossover and mutation as in the case for the first response. However, in order to achieve a satisfactory result for (logK m ), a larger number of tree nodes equal to 15 are required, and the value of the reproduction parameter also increased from 0.05 to 0.2.

Table 4. Values of training parameters of GP producing (4).

Training parameterValue
Population size(s)10 000
Max. no. generations500
Max. no. nodes15
Max. depth15
Addition functions 
reproduction0.2
Crossover0.5
Mutation0.5
Constants mutation0.5
Regeneration0.7
Fitness typeAIC
Test formulations14, 18

For the skin/water partition coefficient logK m response, the models predicted by the statistical method are

As reported in the literature, the [E HOMO ] parameter is a hydrogen bonding acceptor ability parameter, and according to equation (5) it has a positive effect on (logK m ). Also with regard to the literature reported correlation of [logP oct ] with the two parameters [OT] and [log TA] in equation (7) [3], it can be seen that the parameters of [OT] and [log TA] influence the octanol/water partition coefficient [logP oct ]. The GP (4) for this response shows that the predictive model also displays the correlation between (logK m ) and two parameters of [E HOMO ] and [logP oct ].

When equations (6) and (4) are considered in more detail, the results from the two methods are consistent in that the skin/water partition coefficient is correlated with the size of the compounds, as presented by [log TA] (in equation (6)) and [2κ] (in equation (4)). Thus GP also produced a high- quality model for this response data with similar influential parameters as reported in the literature [6].

In order to examine the predictability of GP for the skin/water partition coefficient (logK m ), regression analysis was applied, and the results of both methods for this response were compared. From figure 3, the results of regression analysis for both methods using whole data of 22 structures show that there are no major differences between the two methods in this case. The linear R2 values of three equations and regression coefficients observed are also similar.

Figure 3

Figure 3 Scatter plots, linear equations and linear R2 values of logK m generated from GP equation (4) (▴) and statistical equations (5) () and (6) (▪).

3.3. The predicted results of diffusion coefficient (logD/h)

The diffusion coefficient logD/h is predicted by GP as follows:

It can be seen from equation (8) that the predictive model for this response has extremely high train and test R2 values with four input variables influencing the final results of this response. The diffusion coefficient (logD/h) response is correlated with the number of hydrogen bonding hydrogens [HT], the sum of atomic charges on the hydrogen bonding heteroatoms calculated by both methods of the AM1 method [q s a ]and the classical method [q s -b ] [3], and the most negative electrostatic potential on the solvent accessible surface of molecule [ESP] parameters.

GP required the same population size and generations as in the predictions for the previous responses of 10 000 and 500, respectively (see table 5). Apart from the training model using the MDL fitness function and a larger number of nodes set to 15, all values of the remaining parameters in the training model are the same for the GP prediction for the first response of this data set.

Table 5. Values of training parameters of GP producing (equation (8)).

Training parameterValue
Population size(s)10 000
Max. no. generations500
Max. no. Nodes15
Max. depth15
Addition functions 
reproduction0.05
Crossover0.5
Mutation0.5
Constants mutation0.5
Regeneration0.7
Fitness typeMDL
Test formulations10, 18

In the original paper, Ghafourian and Fooladi [3] showed that this response could also be given by

Based on equation (9), an alternative equation for this response predicted by GP using the predictions from the logK m and logK p responses was

As reported in the literature [3], the response of the diffusion coefficient logD/h was modelled successfully using the statistical method with very high R2 values, as shown below:

The model predicted by GP in equation (8) demonstrates that there are four parameters from a total of 17 influencing the diffusion coefficient (logD/h).

When the predictive model of GP (8) is compared with the statistical results (equations (11) and (12)), both methods indicate that the diffusion coefficient (logD/h) is negatively correlated with the number of hydrogen bonding hydrogens [HT] (those connected to the oxygen or nitrogen atoms), and positively correlated with the most negative electrostatic potential on the solvent accessible surface of molecules [ESP], which can be considered as the hydrogen bonding basicity parameter [3]. However, in the statistical result, the molecular weight [MW] influences (logD/h), while in the GP result, logD/h is correlated with the sum of atomic charges on the hydrogen bonding heteroatoms [q s a ] and [q s b ]parameters.

From figure 4 comparing equation (8) from GP to equations (11) and (12), it is apparent that GP delivers a slightly higher quality model for parameter (logD/h) in comparison with the statistical results based on the regression analysis for whole data of 22 records. Although the linear R2 values of the three models are equal to 0.98, the slope and intercept coefficients in the linear equations of GP are closer to 1.00, while those from the statistical equations are 0.00.

Figure 4

Figure 4 Scatter plots, linear equations and linear R2 values of logD/h generated from GP equations (8) () and (10) (▪), and statistical equations (11) (▴) and (12) (x).

From figure 4, it can also be seen that equation (10) is an improved model with more satisfactory values of slope and intercept coefficients in comparison with the statistical results.

In general, for the diffusion coefficient (logD/h) response, while not generating the same influential parameters as reported in the literature, the quality of prediction of GP for this response is high in both observations on the R2 value as well as on the regression analysis.

4. Conclusions

For a highly complex problem with nonlinear relationships between chemical structures and biological effects in quantitative structure–activity relationship (QSAR) data, genetic programming, with its advantage of generating mathematical equations, has been shown to be an efficient method. In contrast to statistical approaches, GP requires no assumption of the functional form (e.g. linear or quadratic) to be used to describe the cause-and-effect relationships.

Please wait… references are loading.
10.1088/2043-6254/1/3/035003