Modelling the effect of structural QSAR parameters on skin penetration using genetic programming

K K Chung; D Q Do

doi:10.1088/2043-6254/1/3/035003

1. Introduction

QSAR data are known to be highly complex with nonlinear relationships. Modelling the relationships between chemical structures and biological effects is the target of structure–activity studies [1]. Therefore, with a large number of molecular descriptors, finding an efficient tool to glean knowledge from the relationships of structure–activity in QSAR data is necessary. There are many methods applied routinely for QSAR. In addition to statistical methods [2–4], artificial intelligence techniques have been employed to deal with such data. Recently, several research papers have discussed whether artificial intelligence can deal with the problem of QSAR modelling. Examples include the application of neural networks for discriminating antibacterial activity [5] and for 3D QSAR [6], or an application of neural networks and genetic algorithms for modelling QSAR data [7].

In the previous studies, an alternative artificial intelligence method that was developed based on the Darwinian theory called genetic programming (GP) was successfully employed to modelling pharmaceutical formulation data [8, 9]. In this study, GP was used to analyse QSAR data and a comparison of the ability of GP and statistical methods to model QSAR data was examined for a QSAR data set from a study on the effect of structural QSAR parameters on skin penetration. For the comparison, an assumption was made that in all cases there was no interaction between the molecular descriptors (i.e. all independent variables).

2. Materials and methods

2.1. Materials

The data set used in this work was taken from a study of the effect of structural QSAR parameters on skin penetration. In [3], Ghafourian and Fooladi analysed 20 variables of 17 inputs and 3 outputs—permeability (^log K_^p), ^skin/^water partition (^log K_^m) and diffusion coefficient (^log D/h), as listed in table 1. Ghafourian and Fooladi analysed 39 compounds. However, in this study, for ease of comparison with the statistical results in the literature, only 22 compounds (compound 1–14 and 26–33) were used for evaluating the predictability of GP for all responses. From a total of 22 compounds used for observation, 20 records are used as training data and 2 records are used as test data selected by the Smart Selection function in GP.

Table 1. QSAR parameters.

Independent variables
	^Octanol/^water partition coefficient	MW	Molecular weight
	Logarithm of total solvent accessible	E_^HOM	Energy of the highest occupied
	surface area		molecular orbital
OT	Number of hydrogen bonding heteroatoms	HT	Number of hydrogen bonding hydrogens
q_s ^+a	Sum of atomic charges on the hydrogen	δ	Solubility parameter
	bonding hydrogen atoms
q^−a	The most negative atomic charge on the	V	Molar volume
	hydrogen bond accepting heteroatoms
q_s ^−a	Sum of atomic charges on the hydrogen	²κ	Second-order molecular shape index
	bonding heteroatoms
q_s ^−b	Sum of atomic charges on the hydrogen	⁰κ _α	Alpha-modified zero-order molecular
	bonding heteroatoms		shape index
q^−b	The most negative atomic charge on the	⁴χ_^pc ^v	Fourth-order valence-corrected path-
	hydrogen bond accepting heteroatoms		cluster molecular connectivity index
^ESP ⁻	The most negative electrostatic potential on
	the solvent accessible surface of molecules
Dependent variables
	Permeability coefficient
	^Skin/^water partition coefficient
	–

2.2. Methods

GP has been known primarily as a learning technique producing mathematical equations [10–12]. It is founded on the basic principle of Darwin's theory of evolution. In nature, biological structures that are more successful at operating in their given environment have a higher probability of surviving and reproducing. These structures are considered to be the result of Darwinian natural selection operating in an environment over a period of time. With a similar principle of natural selection, Koza introduced GP that attempts and succeeds in applying this evolutionary theory in order to find the best (fittest) or the most appropriate equation (solution) for a problem domain [10].

2.2.1. Representation scheme

GP modifies individuals of the population after a number of runs in every cycle, called a generation, to find the best solution. The structure of individuals in GP is a tree (see figure 1(a)). Possible structures in genetic programming are the set of all compositions of functions and the set of terminals.

**Figure 1** (a) An example of a GP structure, (b) crossover operation and (c) process of mutation.

The elements in the function set may include:

arithmetic operations (+, -, *, etc),
mathematical functions (such as sin, cos, exp and log),
Boolean operations (such as AND, OR and NOT),
conditional operators (such as If-Then-Else),
functions causing iteration (such as Do-Until),
functions causing recursion,
any other domain-specific functions that may be defined.

The terminals are typically either a variable or a constant value.

2.2.2. Genetic operations

In the Darwinian principle of reproduction and survival of the fittest, a population always has many generations and the individuals of new generations are the result of the combination of the individuals in the previous generations. In other words, this is a transfer of a set of individuals into a new generation of the population using genetic operations. Crossover, reproduction and mutation are genetic operations for creating the new individuals in each generation (figures 1(b) and (c)). The crossover operation creates a new individual from two parental individuals selected from the population based on fitness. Within each parental tree, a random node is selected and the sub-trees under the selected node are swapped to create two new trees. The simplest operation of genetic operations selects the individuals from a population based on their fitness and then copies them into the new population. The mutation operation creates a new individual from an existing tree in a population by deleting and replacing one node of that tree with another node from the same set. Note that a function replaces only a function and a terminal replaces only a terminal.

2.2.3. Fitness function

To select individuals for crossover, reproduction and to determine how good the individuals are at solving a given problem, fitness functions are employed. The fitness function assesses how an individual is fitted to the environment of a domain problem, after calculating the fitness for all individuals. Every individual in a population is assigned a fitness value and then, based on the adopted selection method, specific individuals are identified for crossing over, mutating or reproducing. The considered fitness functions used in many systems for solving research and optimization problems are MSE (mean square error), SRM (structural risk minimization), AIC (Akaike's information criterion), MDL (minimum description length) and FPE (final prediction error) [13].

2.2.4. Implementation scheme

The implementation of GP is performed by the following steps:

Step 1: One or more initial populations of individuals are randomly generated with functions and terminals related to the problem domain.

Step 2: The implementation of GP iteratively performs the following steps until the termination criterion has been satisfied:

The fitness value of every individual is estimated according to a selected fitness measure.
All individuals in the population are sorted based on their fitness values.
The next generation is produced using the genetic operations
The termination criterion is checked. If it is not satisfied, the next iteration is performed; if it is satisfied, go to step 3.

Step 3: The result may be a solution to the problem domain.

In particular, the quality of the final model is controlled by either the selection of training data or the values of the control parameters (crossover, mutation, reproduction and fitness function).

2.3. Software tool

The software used was a modified form of that described previously [14], but with additional exponential and other mathematical functions as well as the fitness functions listed. Table 2 lists the values of the control parameters used in the present study.

Table 2. The value of control parameters of GP used in this study.

Training parameterValueTraining parameterValue

Population size(s)	10 000	Crossover	0.2–0.5
Generations	100–500–1000	Mutation	0.2–0.5
Max. no. nodes	10–15–20	Constants mutation	0.5
Max. depth	10	Regeneration	0.7
Addition functions	Exp	Fitness type	AIC, FPE, MDL, MSE and SRM
reproduction	0.05–0.2	Test formulations	2, 10

In order to evaluate the quality of a model generated by GP, the correlation coefficient R-squared (R²) was computed, with higher values of R² indicating improved quality of the model,

where is the mean value of the dependent variable y, is the predicted value from the model and n is the number of records.

3. Results and discussion

Validation of the predictive power of GP for mathematical and formulation data has been evaluated and presented elsewhere [8, 9].

3.1. The predicted results of permeability coefficient (^log K_^p)

The permeability coefficient has the following expression:

From equation (1), the R² value of the predictive model is considered as a satisfactory result and moreover the predictive equation is extremely simple for the permeability coefficient (^log K_^p) response. In equation (1), only three parameters—the number of hydrogen bonding heteroatoms [OT], the sum of atomic charges on the hydrogen bonding heteroatoms [q_^s ^−b], and the second-order molecular shape index [²κ ]—from a total of 17 input variables influence the permeability coefficient ^log K_^p.

The values of the training parameters, which GP used to produce equation (1), are shown in table 3. For this response, the simplicity of the predictive equation resulted from a maximum of 10 nodes in the training trees. However, because of this small number of tree nodes, the population size required was 10 000 with 500 generations. In addition, maximum values of crossover and mutation in the training models were needed to obtain an acceptable model. For this response, the AIC fitness function was used to generate a high-quality model. The details of the equations from the statistical method are shown below:

As analysed in the literature using MINITAB statistical software [3], equation (2) shows that is correlated with the logarithm of total solvent accessible surface area calculated using a probe of 1.4 Å radius (log [TA]), the sum of atomic charge on the hydrogen bonding heteroatoms [q_^s ^−b]and the number of hydrogen bonding heteroatoms [OT] in equation (2). In equation (3), this response is also correlated with the number of hydrogen bonding heteroatoms [OT], the second-order molecular shape index [²κ] and the sum of atomic charge on the hydrogen bonding heteroatoms calculated by the AM1 method [q_^s ^+a].

Table 3. Values of training parameters of GP producing equation (1).

Training parameterValue

Population size(s)	10 000
Max. no. generations	500
Max. no. nodes	10
Max. depth	10
Addition functions
reproduction	0.05
Crossover	0.5
Mutation	0.5
Constants mutation	0.5
Regeneration	0.7
Fitness type	AIC
Test formulations	2, 10

When comparing the predictive models between GP and statistical methods, the results of regression analysis for both methods using whole data of 22 structures in figure 2 indicated that there are only minor differences between these techniques. While producing a slightly lower linear R² value, GP generated a better model with more satisfactory values of slope and intercept coefficients for this response.

**Figure 2** Scatter plots, linear equations and linear R² values of ^log K_^p generated from GP equation (1) (▴) and statistical equations (2) () and (3) (▪).

3.2. The predicted results of skin/water partition coefficient (^log K_^m)

The ^skin/^water partition coefficient is determined by the following formula:

For this response from GP given in equation (4), the parameters of the octanol/water partition coefficient molecular weight [^log P_^oct], the second-order molecular shape index [²κ] and the energy of the highest occupied molecular orbital E_^HOMO correlate with the skin/water partition coefficient (^log K_^m) response with a high R² value of 0.96. This result is similar to the prediction for the first response of these QSAR data in that the equation predicted is simple with only three parameters.

As seen in table 4, the AIC fitness function also gave a high quality model with a population size of 10 000 500 generations, as well as the same values of crossover and mutation as in the case for the first response. However, in order to achieve a satisfactory result for (^log K_^m), a larger number of tree nodes equal to 15 are required, and the value of the reproduction parameter also increased from 0.05 to 0.2.

Table 4. Values of training parameters of GP producing (4).

Training parameterValue

Population size(s)	10 000
Max. no. generations	500
Max. no. nodes	15
Max. depth	15
Addition functions
reproduction	0.2
Crossover	0.5
Mutation	0.5
Constants mutation	0.5
Regeneration	0.7
Fitness type	AIC
Test formulations	14, 18

For the skin/water partition coefficient ^log K_^m response, the models predicted by the statistical method are

As reported in the literature, the [E_^HOMO] parameter is a hydrogen bonding acceptor ability parameter, and according to equation (5) it has a positive effect on (^log K_^m). Also with regard to the literature reported correlation of [^log P_^oct] with the two parameters [OT] and [log TA] in equation (7) [3], it can be seen that the parameters of [OT] and [log TA] influence the octanol/water partition coefficient [^log P_^oct]. The GP (4) for this response shows that the predictive model also displays the correlation between (^log K_^m) and two parameters of [E_^HOMO] and [^log P_^oct].

When equations (6) and (4) are considered in more detail, the results from the two methods are consistent in that the skin/water partition coefficient is correlated with the size of the compounds, as presented by [log TA] (in equation (6)) and [²κ] (in equation (4)). Thus GP also produced a high- quality model for this response data with similar influential parameters as reported in the literature [6].

In order to examine the predictability of GP for the skin/water partition coefficient (^log K_^m), regression analysis was applied, and the results of both methods for this response were compared. From figure 3, the results of regression analysis for both methods using whole data of 22 structures show that there are no major differences between the two methods in this case. The linear R² values of three equations and regression coefficients observed are also similar.

**Figure 3** Scatter plots, linear equations and linear R² values of ^log K_^m generated from GP equation (4) (▴) and statistical equations (5) () and (6) (▪).

3.3. The predicted results of diffusion coefficient (log D/h)

The diffusion coefficient ^log D/h is predicted by GP as follows:

It can be seen from equation (8) that the predictive model for this response has extremely high train and test R² values with four input variables influencing the final results of this response. The diffusion coefficient (^log D/h) response is correlated with the number of hydrogen bonding hydrogens [HT], the sum of atomic charges on the hydrogen bonding heteroatoms calculated by both methods of the AM1 method [q_^s ^−a]and the classical method [q_^s ^-b] [3], and the most negative electrostatic potential on the solvent accessible surface of molecule [ESP] parameters.

GP required the same population size and generations as in the predictions for the previous responses of 10 000 and 500, respectively (see table 5). Apart from the training model using the MDL fitness function and a larger number of nodes set to 15, all values of the remaining parameters in the training model are the same for the GP prediction for the first response of this data set.

Table 5. Values of training parameters of GP producing (equation (8)).

Training parameterValue

Population size(s)	10 000
Max. no. generations	500
Max. no. Nodes	15
Max. depth	15
Addition functions
reproduction	0.05
Crossover	0.5
Mutation	0.5
Constants mutation	0.5
Regeneration	0.7
Fitness type	MDL
Test formulations	10, 18

In the original paper, Ghafourian and Fooladi [3] showed that this response could also be given by

Based on equation (9), an alternative equation for this response predicted by GP using the predictions from the ^log K_^m and ^log K_^p responses was

As reported in the literature [3], the response of the diffusion coefficient ^log D/h was modelled successfully using the statistical method with very high R² values, as shown below:

The model predicted by GP in equation (8) demonstrates that there are four parameters from a total of 17 influencing the diffusion coefficient (^log D/h).

When the predictive model of GP (8) is compared with the statistical results (equations (11) and (12)), both methods indicate that the diffusion coefficient (^log D/h) is negatively correlated with the number of hydrogen bonding hydrogens [HT] (those connected to the oxygen or nitrogen atoms), and positively correlated with the most negative electrostatic potential on the solvent accessible surface of molecules [ESP], which can be considered as the hydrogen bonding basicity parameter [3]. However, in the statistical result, the molecular weight [MW] influences (^log D/h), while in the GP result, ^log D/h is correlated with the sum of atomic charges on the hydrogen bonding heteroatoms [q_^s ^−a] and [q_^s ^−b]parameters.

From figure 4 comparing equation (8) from GP to equations (11) and (12), it is apparent that GP delivers a slightly higher quality model for parameter (^log D/h) in comparison with the statistical results based on the regression analysis for whole data of 22 records. Although the linear R² values of the three models are equal to 0.98, the slope and intercept coefficients in the linear equations of GP are closer to 1.00, while those from the statistical equations are 0.00.

**Figure 4** Scatter plots, linear equations and linear R² values of ^log *D/h* generated from GP equations (8) () and (10) (▪), and statistical equations (11) (▴) and (12) (x).

From figure 4, it can also be seen that equation (10) is an improved model with more satisfactory values of slope and intercept coefficients in comparison with the statistical results.

In general, for the diffusion coefficient (^log D/h) response, while not generating the same influential parameters as reported in the literature, the quality of prediction of GP for this response is high in both observations on the R² value as well as on the regression analysis.

4. Conclusions

For a highly complex problem with nonlinear relationships between chemical structures and biological effects in quantitative structure–activity relationship (QSAR) data, genetic programming, with its advantage of generating mathematical equations, has been shown to be an efficient method. In contrast to statistical approaches, GP requires no assumption of the functional form (e.g. linear or quadratic) to be used to describe the cause-and-effect relationships.

Modelling the effect of structural QSAR parameters on skin penetration using genetic programming

Article metrics

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Materials and methods

2.1. Materials

2.2. Methods

2.2.1. Representation scheme

2.2.2. Genetic operations

2.2.3. Fitness function

2.2.4. Implementation scheme

2.3. Software tool

3. Results and discussion

3.1. The predicted results of permeability coefficient (^log K_^p)

3.2. The predicted results of skin/water partition coefficient (^log K_^m)

3.3. The predicted results of diffusion coefficient (log D/h)

4. Conclusions

Modelling the effect of structural QSAR parameters on skin penetration using genetic programming

Article metrics

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Materials and methods

2.1. Materials

2.2. Methods

2.2.1. Representation scheme

2.2.2. Genetic operations

2.2.3. Fitness function

2.2.4. Implementation scheme

2.3. Software tool

3. Results and discussion

3.1. The predicted results of permeability coefficient (log K p )

3.2. The predicted results of skin/water partition coefficient (log K m )

3.3. The predicted results of diffusion coefficient (log D/h)

4. Conclusions

3.1. The predicted results of permeability coefficient (^log K_^p)

3.2. The predicted results of skin/water partition coefficient (^log K_^m)