Hierarchical Representation Based Constrained Multi-objective Evolutionary Optimisation of Molecular Structures

We propose an efficient algorithm to generate Pareto optimal set of reliable molecular structures represented by group contribution methods. To effectively handle structural constraints we introduce goal oriented genetic operators to the multi-objective Nondominated Sorting Genetic Algorithm-II (NSGA-II). The constraints are defined based on the hierarchical categorisation of the molecular fragments. The efficiency of the approach is tested on several benchmark problems. The proposed approach is highly efficient to solve the molecular design problems, as proven by the presented benchmark and refrigerant design problems.


Introduction
Computational intelligence has opened a new way to the design of chemical technologies from the smaller scale of molecules to the bigger scale of process design, which is consistent with the expectations of Industry 4.0 and with the trends of modern chemical engineering [1][2][3].Nowadays, numerous computational methods are available for the solution of the computer aided molecular design (CAMD) task.Neural networks are mainly used for the revealing of correlations between structural and macroscopic properties [4][5][6].The design task is usually handled as an optimisation problem.Among the wide range of applicable algorithms particle swarm optimization (PSO) is used for drug design [7,8].Linear programming is also applicable [9,10], but usually the design task is formalised as mixed integer nonlinear integer programming (MINLP) problem.Among the wide range of computational intelligence related solvers simulated annealing and outer approximation programming algorithm have been applied [11].As it is illustrated by Mitra [12] and Weber [13] chromosomes of genetic algorithms (GAs) are convenient ways to represent molecular structures, thus most of the successful approaches use different types of genetic algorithms e.g. in [14][15][16][17].
GAs are popular global optimization algorithms developed by Holland [18] based on the mimicking of natural selection and the competitive model of survival of the fittest.The modification of genetic operators to suit the specific problem seems to be a promising way for the solution of large, complex problems, there is a trend towards exploring and developing such algorithms [19,20].
Michalewicz has showed that repair functions and special genetic operators are highly effective for the solution of constrained genetic optimization problems [21].Renner and Ekárt also reported the effectivity of such functions in computer aided design [22].Therefore, numerous publications can be found in literature with special and goal-orientedly modified genetic operators.Deep and Thakur introduced new mutation and crossover operators for real coded genetic algorithms [23,24].Colanzi and Vergilio could improve the previous empirical studies for the optimization of product line architectures with the implementation of a modified crossover operator [25].Strug et al. represented the genotypes of design problems with hierarchical hypergraphs and introduced modified mutation and cross-over operators to adapt to such representation [26].Salimi et al. solved scheduling and load balancing optimization problems with the application of fuzzy adaptive operators [27].
Venkatasubramanian et al. introduced a modified genetic algorithm with string representation of the molecular structures and new, task-oriented genetic operators that facilitate the chemistry of molecular interactions and rearrangements [28].Dyk and Nieuwoudt applied new genetic operators during the design of solvents for distillation processes [29].Glen and Payne described a genetic algorithm with 12 task-oriented mutation operators for the modification of the molecular structure used in the automated generation of molecules within constraints [30].Brown et al. proposed novel genetic operators for the modification of graph-based representation of molecular structures [31].
We believe that the biggest challenge of evolutionary molecular design is that several conflicting objectives must be handled simultaneously.The application of multi-objective genetic algorithms is a promising approach to generate the Pareto set of solutions which represent different design aspects.We demonstrated that the well-established multi-objective Non-dominated Sorting Genetic Algorithm-II (NSGA-II) is ideal tool to handle this problem [32].There is a need to further improve this approach by increasing the efficiency of the search in the huge chemical search space represented by group contribution methods [33].
The main problem with the application of multi-objective evolutionary algorithms for molecular design is the difficulty of handling of a large number of structural constraints.Our key idea is that the efficiency of the algorithm can be significantly increased by the introduction of problem-relevant genetic operators, which always generate feasible solutions.Since in molecular design the building blocks of the optimised structure are not identical, for a parsimonious and systematic design of these constraints it is beneficial to characterise how the molecular fragments can be connected to each other and define the constraints and the related genetic operators based on the hierarchical classification of the identified connection rules.
Our key contribution is therefore that instead of testing the branching and the octet rule related feasibility constraints we introduce special genetic operators that ensure the generation of reliable and connectable molecular fragments.It is important to highlight, that in the present paper we only describe the generation of connectable fragments and not any type of spatial structure of the molecule.The huge variety of molecular segments could result in hundreds of constraints.To handle this problem, we generated the goal oriented operators based on the hierarchical categorisation of the molecular fragments ensuring the effective incorporation of chemical information into the optimisation algorithm.
From now on the formalisation of the design problem is followed by a theoretical overview of the nature of genetic algorithms paying special attention to NSGA-II.After the description of the different algorithms proposed for the solution of the design task, the efficiencies of these approaches are examined through two design problems, and the results are discussed extensively to determine the improvement in the applicability of these algorithms and to compare the effectiveness with other approaches from the literature.
The proposed method was implemented in MATLAB.The results are reproducible since all the functions and numerical experiments are downloadable from the website of the authors: www.abonyilab.com.

Problem formulation 2.1 Hierarchical representation of molecular structure
Our aim is to design molecules that simultaneously meet several requirements, so we deal with models that can estimate m properties P x x x x , , , . Each property is estimated based on a group contribution method realised by a function P , , , , , ,  (here we would like to mention that there are important thermodynamic properties that are state functions and therefore could not be predicted by group contribution methods).Therefore, the molecule is represented by an n dimensional vector of positive integers x , , , , ,  (where , , ,  are the number of group type #1, #2, …, #n, respectively).For simpler handling, the fragments from the Joback method are classified according to Fig. 1.Vector x therefore yields the number of the assigned molecular fragments, e.g.

Formulation of the multi-objective optimisation problem
During a material design task, multiple objectives must be taken into consideration simultaneously.The solution of such problems can be computed by combining them into a single criterion to be optimised, called a utility function [34].Considering the characteristics of design problems, some of the specified features need to be maximised, Fig. 1 The classification of the Joback molecular fragments based on the type and number of bonds minimised or close to the given targets, as described by U q in Eq. ( 1), the utility function of target q (P q  stands for the target value of target q).
Consequently, the target properties are incorporated into a set of utility functions, U , an nq dimensional vector as U , , , . The general formulation of an optimal CAMD design problem supposing a multi-objective optimisation problem with nq functions is presented in Eq. ( 2): max , , , , (2) where Ω is the space of feasible solutions, g s (x) is a set of structural equality constraints, D s (x) is a set of divisibility constraints, and E s (x) is a set of restrictions for existence.
The property constraints are represented by P c l and P c u .
Due to the often conflicting objective functions, the solution of the optimisation problem is not a single solution, but a group of points called the Pareto-optimal set, which is defined as follows: Definition 1. x * ∈ Ω is called a Pareto-optimal (efficient, non-inferior or non-dominated) solution of a multi-objective problem described in Eq. ( 2), if another feasible solution x does not exist such that U U , , ,  [35].Therefore, the Pareto-optimal solutions are those for which improvement in one objective can only take place with the worsening of at least one other objective function.Pareto-ranking is the process of determining the rank of each solution through identifying the number of other solutions that dominate it [36].

Constraints
The solution of practical problems is often restricted by some constraints imposed on the decision variables.Explicit constraints can usually be classified into two major groups [37]: Domain constraints defined for the expression of the definition of the objective function.In the present formalisation structural feasibility constraints are defined to ensure the solutions are restricted to the space of feasible solutions.
Preference constraints which impose further constraints based on further knowledge of the designer.In the case of material design, constraints of design aspects (property constraints and design limitations) are defined, which must be established for each specific design problem.

Structural feasibility constraints
1.The fragments of a feasible molecular structure can be joined to a single connected component, while unconnected bonds are prohibited.This means that neither connections of (-Cl, -Cl, -Cl) nor (>CH-, -Br) are feasible molecules, due to the unconnected molecular fragments and unconnected bonds, respectively.The formulation of a single molecular structure is ensured by the octet rule, as described in Eq. ( 3) [38]: where v i stands for the connections of group i, that is the number of other fragments it can be connected with (for example -X, =X and ≡X fragments have 1, while -X-and =X= have two connections).The connections of each type of fragments are presented in Fig. 1.Further constraints are applied only in the presence of multiple bond types simultaneously, e.g. when the molecular structure contains single and double and/ or triple bonds as well.2. The feasibility of the molecule further requires that each connection should have its appropriate pair, e.g. each =X should have a double bond to connect with within the molecule.This means that the sum of single, double or triple bonds of fragments in a molecule must be even, respectively.For the satisfaction of feasibility, the connectivity of different types of molecular subunits in terms of bond type must be maintained with "bridge" fragments, e.g.=X-, =X<, ≡X-.For example, if a molecule contains double and single bonds simultaneously, the connection between them must be ensured with a "bridge-like" group, marked with the bold character in X=X-X-X.This rule for double bonds is represented in Eq. ( 4).
When the molecule contains groups with double bonds, the group = X= does not influence the connection of fragments in the molecular structure in terms of feasibility, it can be added to the structure or removed from it without restrictions.This existence constraint is described in Eq. ( 5).
3. In the case of triple bonds, the even number of triple bond connections and the appropriate number of "bridge" groups need to maintain the link with the other bond types as described in Eq. ( 6).
One should notice that the number of double and triple bonds is constrained in the connection of molecular fragments, but none of the constraints restricts the number of single bonds.This is because if the octet rule and the criteria for double and triple bonds are fulfilled, then the number of single connections of the groups in a molecule must be even and the links to the subunits with double or triple bonds solved.

Constraints of design aspects
During the formalisation of preference constraints, the designer must form restrictions for the solution according to the knowledge at a higher level.The designer can define the type of available groups, their available numbers, and the constraints for physicochemical properties as well.and upper limits P c u ( ) for property c are given in Eq. ( 7), where c n c = ( ) As part of the solution of the above-defined CAMD task, the "generate-and-test" method can seem to be inefficient as a large number of candidate molecules are created which finally turn out to be infeasible from the view of connectivity.In the present work feasibility constraints are implemented in the algorithm to filter out the resultant molecular structures in terms of feasibility.As in this approach the candidates are still filtered out after property evaluation, the efficiency of the search can still seem to be inefficient.To solve this contradiction problem, specifically modified genetic operators are introduced which improve the individuals of the populations to ensure the estimation of properties is based on only reliable structures, and the property evaluation is carried out on solely feasible molecules.
In the present work, the Joback method used to predict specific properties, while genetic algorithms are applied for the design of candidate molecular structures.Based on the characteristics of CAMD problems the evolutionary algorithm is commonly accepted as an efficient method of finding solutions, thus to solve the problem defined in Section 2 the use of these stochastic minimum searchbased algorithms was a promising approach.To achieve the Pareto-optimal solutions two major methods are common: • single objective-based solutions with multiple applications of the approach conducted to find a set of Pareto-optimal solutions • an evolutionary algorithm with multiple objectives, in which multiple Pareto-optimal solutions are found simultaneously in a single run.
GAs do not provide a single optimal solution for a problem, but several near-optimal solutions.This is the main advantage of evolutionary algorithms in the field of CAMD problems because near-optimal solutions can be further processed later by the designer and the most promising ones can be selected for synthesis.
During the operation of a GA, a population of candidate solutions competes for survival, based on their closeness to the target values.This closeness is described by a normalised distance value between 0 and 1 and called the fitness.The candidate molecules are usually described by the form of strings, and the components of these strings represent the 'genes' of the individual.Therefore, the evaluation of the population is carried out with the calculation of fitness, and the surviving members have the opportunity to reproduce and propagate their genes, thus forming the next generation.This propagation is dependent on the genetic operators applied by the specific algorithm being used; the most common ones are crossover and mutation.The creation of subsequent generations is continued until convergence is obtained (no considerable improvement is observed), or the maximum number of generations set by the user is achieved [15].

Description of the proposed algorithm
The developed algorithm, therefore, needs to solve effectively the CAMD tasks based on the needs of industrial and research work.

The utilised Non-dominated Sorting Genetic Algorithm-II (NSGA-II)
The proposed structural multi-objective optimisation algorithm is based on the modification of the fast non-dominated sorting approach of NSGA-II [39,40].
The randomly created initial parent population (with N members) is sorted based on non-domination.The crossover (Eq.( 8)-( 9)) and mutation (Eq.( 10)-( 11)) operators are applied to create the next generation (with N members) [41].The algorithm applies an intermediate crossover, which creates two children from two parents: parent1 and parent2 (child and parent are vectors ( x ) containing the results of the specific problems): )   where the ratio is a scalar between 0 and 1.The applied Gaussian mutation adds a normally distributed random number to each variable: where scale is a scalar, that determines the standard deviation of the random number generated and shrink is a scalar between 0 and 1.As the optimisation progresses, this shrink parameter decreases the mutation range.currGen and maxGen are the numbers of the current and maximal generations, respectively.Since elitism has been introduced, the creation of the first population differs from the creation of a subsequent one.The algorithm is described in terms of the t th generation.
A combined population (R t ) (with 2N members) is created by the summation of the parent population (Pp t ) and the population obtained by the use of crossover and mutation operators (Q t ).The population R t is sorted according to non-domination, and as all previous and current population members are included, elitism is ensured.Now solutions belonging to the first non-dominated front (F 1 ) are chosen for the next generation (Pp t+1 ) (if the size of F 1 is smaller than N).This selection for the next generation is continued until the number of members from F 1 to F i is larger than N.In these cases F i is sorted based on the crowded-comparison operator, and the best solutions are chosen to fill the empty slots of the new population.
The defined input parameters are forwarded to a problem-specific genetic algorithm.This algorithm was built from the combination of a conventional NSGA-II algorithm with goal-oriented genetic operators with a method of correcting solutions called the constraint correction.The aim of modifications is to find feasible solutions more efficiently, which are difficult to find by conventional evolutionary algorithms due to the generation of a large number of impractical molecular structures.

Constraint Correction
The constraint correction algorithm ensures the feasibility of molecules by the stochastic modification of the originally impractical solution to a similar form of feasible molecular structure.The algorithm is presented in four sections for a better overview.
The algorithm starts with the triple bond correction part, as can be seen in Fig. 2. As was shown in Fig. 1 there are two types of groups containing triple bonds,  9 (≡X) and  10 (≡X-), so the correction of this part of the molecule means the rearrangement of these groups.
First the existence of triple bonds in the molecule is checked to decide if triple bond correction is needed or not.Then if the number of triple bonds is not even, the algorithm randomly adds or removes a group from the ≡X or ≡X-type of molecular fragments, resulting in feasible subunits containing random pairs of ≡X and ≡X-groups at this step of the algorithm.This means that, e.g.multiple X≡X pairs can exist separately in the structure, which is inadequate as a single molecular structure with properly connected fragments is required.The next step of the algorithm therefore replaces as many ≡X fragments with ≡X-fragments as needed for the connection of these subunits with the other parts of the molecule (this means that at least half of the groups containing triple bonds must be from the type of ≡X-fragments if the molecule is not X≡X).If the rearrangement of triple bonds is performed, the algorithm progresses to the double bond correction part presented in Fig. 3.
Analogous to the triple bond correction, the algorithm checks if the molecule contains double bonds for correction.In the case of double bonds molecular fragments of  5-8 (=X, =X-, =X< and =X=) are available, therefore the case of double bond correction is more complicated than triple bond correction.Since =X= can be available as well, a longer chain of groups connected with only double bonds can be formed.The existence of this chain is checked by the octet rule for double bonds as can be seen in Eq. ( 12): where v double,i stands for the number of double bond corrections of the i th molecular fragment (e.g.-X-has none; X=, =X-and =X< have one, while =X= has two).If the equation above is fulfilled, then a single, continuously connected structure (connected via only double bonds) can be formed from the molecular fragments containing double bonds.However, the molecular structure is practical only in terms of the double bonds.If the molecular structure contains other bonds as well the feasibility of the whole structure must be maintained, thus the rearrangement of double bonds may be required.First of all, if besides a =X= group no other groups with double bonds are contained in the structure, then the fragment is removed.As for the completion of this structure two other groups are needed, which is a bigger modification of the molecular structure, than the removal of only one =X= fragment.Then analogously to triple bonds, the number of fragments with double bonds is modified to be even by the addition or removal of one double-bond-fragment randomly (here the number of =X= groups is not considered, as the number of these fragments between two =X, =X-or =X< fragments is not limited).If the molecule contains only multiple =X= type double-bond fragments then the structure is completed with two =X, =X-or =X< fragments (to close the chain of fragments connected with only double bonds).At this step of the algorithm, molecular subunits connected with double bonds whose connections are needed to each other and to the other parts of the molecule through single-bond-units exist.The next step of the algorithm therefore replaces as many =X fragments with =X-or =X< fragments as needed for the connection of these subunits to the other parts of the molecule (this means at least half of the number of =X, =Xor =X< fragments must be of =X-or =X< type).The next section of the algorithm is the single bond correction part as presented in Fig. 4.This part of the algorithm rearranges the structure to be feasible with the use of fragments having only single bonds.At the beginning of the algorithm double-and triple-bond-connected subunits and fragments with only single bonds (X-, -X-, -X<, >X<) can exist, which preferably would connect through fragments having only single bonds to form a continuously connected fragment-chain.To check if the connection of the subunits is fulfilled, the octet rule is applied for connections as presented in Eq. ( 3).If the octet rule for connections differs to the negative direction, the algorithm adds an X-fragment or removes an -X< or >X< fragment randomly, while if it differs to the positive direction, the algorithm removes an X-fragment or adds an -X< or >X< fragment randomly (after determining that the given operation can occur).
Sometimes in the case of complicated problems no fragments with only single bonds might be available as a variable.In this case, the pair correction section starts to change the number of double-or triple-bond-connected subunits in the optimal direction as can be seen in Fig. 5.If the octet rule differs to the negative direction, it adds a possible subunit, while if the octet rule differs to the negative direction it removes one.At this part of the algorithm, the randomly chosen modification is neglected, as this case does not occur in common problems.Stochasticity is ensured by the random modifications carried out by the NSGA-II algorithm.
Before stopping the algorithm, it is determined whether =X= type fragments were the only fragments with double bonds in the molecule due to the pair correction algorithm and =X= fragments are deleted if necessary.

Description of the tested algorithms
Four different versions of the proposed algorithm were implemented in MATLAB.

The 'feasibility constraint' algorithm (Algorithm
No. 1): The type of groups, their minimal and maximal numbers, the target properties and the property Based on eight simple principles 422 mutational rules were defined.These eight principles are: • X-fragments can be replaced by X-groups (and vice versa) • -X-fragments can be added or removed without restrictions • -X=X-fragment pairs can be added or removed without restrictions (the possible preservation of =X= type groups is examined) • >X-X-fragment pairs can be added or removed without restrictions • =X-can be replaced by an =X< and an -X fragment (and vice versa) • X-can be replaced by an X=X=X-fragment triple (and vice versa) (the possible preservation of =X= type groups is examined) • X-, >X<, X-fragment triples can be added or removed without restrictions • X-can be replaced by an X≡X-fragment pair (and vice versa) During the definition of these principles, the most important aspect was to be able to modify the numbers of each fragment in the molecule, thus to include them in at least one principle and to ensure that by the application of any of these rules the structure stays feasible.
During the use of the defined rules, the algorithm determines the possibility of application to avoid infeasible structures or negative variables in case of subtraction.

The 'mutational rules with crossover' algorithm (Algorithm No. 4):
In the case of Algorithm No. 4 the mutational rules of Algorithm No. 3 are used, but in this case the crossover of NSGA-II ("Intermediate crossover") together with constraint correction is used as well to ensure the space of feasible solutions.The feasibility criterion as a hard constraint is still present.
The features of each applied algorithm are summarised in Table 1.

The evaluation of the results
As the purpose of the current work is the development of an effective algorithm for the design of molecular structures to obtain the desired target properties, the comparison of the results is an essential task to determine the improvement and efficiency of the proposed formalisation against the results found in the literature.The efficiency inspection contains graphical and numerical methods as well.During the numerical evaluation, three efficiency indexes were determined: • Feasible solutions: the number of non-dominated solutions of the last generation for each algorithm • Last changed population (N gen,change ): the last generation in which the population of the Pareto-front is changed is also determined • Domination percentage (D p ): the better performance of an algorithm is not just indicated from the larger number of solutions obtained by it, their domination compared to each other is important as well.For easier understanding, imagine the following didactic example.Algorithm a. has ten solutions, while the results of Algorithm b. give just nine solutions, from the ten members of the combined non-dominated fronts only 4 come from the results of Algorithm a. (D p : 40%), and 6 come from Algorithm b. (D p : 60%).This efficiency index is determined for both algorithms and compared to each other.
As the proposed genetic algorithm is a stochastic approach, ten runs of each algorithm were applied to evaluate the average efficiency, and the average of these indexes is presented in the Results section.

Results
To avoid the problem of random number generation of the stochastic method, the results of ten different runs were evaluated.The genetic algorithm worked for 250 generations with 100 population members in each.The effectivity of the different algorithms is compared through the following design tasks for the identification of different chemicals having the desired physical and chemical properties, estimated by the multidimensional property model.The results are downloadable from the website of the authors: www.abonyilab.com

Example I
The problem given in Friedler et al. [9] has been solved using the proposed genetic approach.The input parameters for the design task were as follows: Available groups: -CH 3 , -CH 2 -, >CH-, >C<, -F and -Cl Target properties: Max (T b ), Min (T m ) Property constraints: 330 K < T b < 340 K 130 K < T m < 140 K The proposed algorithms found only one feasible structure having the given properties, as can be seen in Table 2     (the domination percentage is not given as no real Paretofront was obtained).This result is similar to the results of Friedler et al. found only one feasible partition [9], this is the one found by us as well.
A possible structure with the given molecular fragments can be seen in Fig. 6.

Example II
This example is a modification of Example I as described by Friedler et al. [9].The described problem is a good example to test the algorithms in the presence of single, double and triple bonds simultaneously.3 Algorithm No. 4 found the most solutions in average, thus the use of mutational rules and constraint correction to keep the optimisation in the space of feasible solutions and the use of crossover to keep the genetic diversity seems to be an efficient approach for the solution of the design task.
As can be seen in Table 4 Algorithm No. 4 did not just find the most solutions, but these solutions mainly dominate the solutions of other algorithms as well.

Refrigerant design
During the operation of a cooling circuit, the circulated liquid must meet several requirements.In order to formalise the problem, imagine a simple cooling circuit with a condenser, a reciprocating compressor, an evaporator and an expansion valve, similarly as presented in Fig. 7.
The problem with the design of appropriate refrigerants described by Joback in [42] has been reexamined under the following conditions: • P vp (T = -1.1 °C) > 1.4 bar The lowest pressure in the cycle should be greater than atmospheric pressure to reduce the possibility of air and moisture leaking into the system.The maximisation of P vp (T = -1.1 °C) was defined as a target: • P vp (T = 43.3°C) < 14 bar High vapour pressure increases the size, weight and cost of the equipment.The minimisation of P vp (T = 43.3°C) was defined as a target: • Δ H v (T = −1.1 °C) > 18.4 kJ/(g•mol) A larger enthalpy of vaporisation occurs in the smaller amount of used refrigerant.The maximisation of Δ H v (T = −1.1 °C) was defined as a target: • C p, L (T = 21.1 °C) < 32.2 cal/(g•mol•K) Low liquid heat capacity prevents (or reduces the likelihood of) the refrigerant from flashing upon passage through the expansion valve.The heat capacity is evaluated at a constant temperature.The minimisation of C p, L (T = 21.1 °C) was defined as a target.
Besides the described targets, all the conditions were set as constraints.The results of refrigerant design can be seen in Table 5 and Table 6.As the value of N gen,change (last changed population) is quite high (250 is the maximum number of   generations), the algorithms seem to evolve effectively towards the Pareto front.Most of the solutions were found by Algorithms No. 3 and 4, the number of members in the populations was almost 100.These solutions mainly dominate the solutions of Algorithms No. 1 and 2, as is represented by the domination percentage.This shows that the mutational rules alone or together with intermediate crossover (completed with constraint correction), and the restriction of genetic algorithms to the space of feasible molecules is an effective approach for solving the design problem.
The obtained results were compared to the results of Joback (the results were filtered according to the corresponding design conditions).According to Fig. 8 the presented method found significantly more structures (Joback found only 45 non-ring molecular structures) and the obtained results are better in terms of property values as well, however, there are negative values of liquid heat capacities.This result highlights an important remark on material design: all approaches are as good as they are allowed to be according to the used property estimation methods.The false results of negative liquid heat capacity values draw attention to the importance of the target and constraint definition and to the drawback that any proposed approach is highly determined by the uncertainty of the used property estimation methods.
To obtain physically interpretable results, a lower bound of 0 cal/molK was defined for liquid heat capacity, and the optimisation was repeated.The results can be seen in Table 7 and Table 8.Even though fewer structures have been found, the effectivity of the algorithms is still noticeable, as Algorithms No 2-4 found significantly more results than Joback.
Fig. 9 shows the properties of the obtained molecular structures with lower liquid heat capacity constraints

Summary
The design of new molecules with specified properties is of increasing importance in the modern chemical industry and research.To effectively handle structural constraints we introduce goal oriented genetic operators to the multi-objective Non-dominated Sorting Genetic Algorithm-II (NSGA-II).The constraints are defined based on the hierarchical categorisation of the molecular fragments.The presented examples have demonstrated that the proposed genetic approach is capable of solving the formulated problems and moves effectively towards the Pareto-optimal front.With the multiple candidate molecules of the resultant Pareto front, several further targets can taken into consideration: financial aspects, toxicity, availability, taste, aroma, colour, etc., thus the designer can consider personal insights during the design task.
The presented algorithms have a wider application range compared to the previously published material in [32].The defined methods worked with varying degrees of effectivity in the test problems.From the results of Algorithm No. 1 it is obvious that the constraint correction of the resultant structures is far less effective compared to the goal-oriented modification of genetic operators as can be seen in Algorithms No. 2-4.The introduction of constraint correction in the field of the discrete optimisation problems of molecular design is a novel approach, as such methods are mainly applied in the case of continuous optimisation problems.The goal-oriented mutational operator tested in Algorithm No. 3 confines the variable vectors to the space of feasible solutions, thus exhibits an increased efficiency as illustrated by the examples.Algorithm No. 4 exhibits the combined potency of confining the variables to the space of feasibility via the application of goal-oriented mutational rules and the application of constraint correction, but to preserve genetic diversity intermediate crossover is applied as well.The proposed approach is highly efficient to solve the molecular design problems as proven by the presented problems and ready for the design of hypothetical components in flowsheeting simulators.A possible future improvement of the algorithm can be the design of spatial structure of the molecule from the found fragments.
The method was implemented in MATLAB.The results are reproducible since all the functions and numerical experiments are downloadable from the website of the authors: www.abonyilab.com

Annexes Prediction of vapour pressure
The prediction of vapour pressure was carried out by the application of the Riedel-Plank-Miller equation-oriented technique.According to Eq. ( 13) the vapour pressure is a function of T b , T br and P c values, The applied T b , T br and P c values are estimated with the use of the Joback method.The vapour pressure can be calculated using Eq. ( 14)- (17).

Prediction of enthalpy of vaporisation
The prediction of enthalpy of vaporisation was carried out with the use of the Watson relation as presented in Eq. (18).The ΔH vb value is obtained from the Joback estimation method.(18)

Prediction of liquid heat capacity
The prediction of liquid heat capacity is carried out with the use of the Rowlinson equation-oriented technique as a function of an acentric factor, ideal gas heat capacity and critical temperature, as seen in Eq. ( 19), The acentric factor is estimated by the Lee-Kesler equation-oriented technique, as presented in Eq. (20).List of Abbreviations C R (K, N ) selection of N groups from a set of K groups x 1 , x 2 , … , x n the number of group type #1, #2, …, #n, respectively x n dimensional vector of positive integers (x i ) representing the variable vector P k a specified property f k property estimation function P m dimensional vector of properties  1−10 number of elements in the set of fragments (see Fig. 1) " X-" , etc. set of fragments (see Fig. 1) P q  target property value of target q U q utility function of target q U nq dimensional utility vector

Fig. 2
Fig. 2 The triple bond correction (Part I of constraint correction)

Fig. 6 A
Fig. 6 A possible structure of the found fragments

Fig. 8
Fig.8 Comparison between the properties of refrigerant structures found by the proposed methods and by Joback[42]

Fig. 9
Fig.9 Comparison of the results found by the proposed methods (with lower liquid heat capacity constraint) and by Joback[42] T br are measured in K and P c is measured in bars.
T c values are estimated by the Joback method.
T b and T br are measured in K and P c is measured in bars.Their values are obtained from the Joback estimation method.Therefore, the Rowlinson equation-oriented technique uses Eq. (21).

Table 1
The features of the applied algorithms

Table 3
The results of Example II

Table 2
The results of Example I

Table 4
The domination percentage compared between pairs of

Table 5
The results of refrigerant design

Table 7
The results of refrigerant design with lower liquid heat capacity constraint

Table 8
The domination percentage compared between pairs of