Keywords

1 Introduction

Within recent decades, the field of evolutionary computation has seen a surge of novel algorithms being proposed, frequently with the intent to operate on very specific problem domains. While this reflects on one hand the efficacy of population-based and evolutionary approaches for a wide range of applications, it also reflects deep rooted issues within the current state of the art. Particularly in regards to: 1) A lack of a prescriptive theory on how to construct efficient algorithms for a given problem and 2) a lack of understanding on what constitutes and characterizes optimisation problems and the similarity thereof in a more generalized way. While the theorists cannot give definite answer to both questions at the moment, one may still legitimately ask whether or not it is possible to approach some of these problems from a pragmatic line of attack. For this reason, two popular trends have emerged within the optimisation community: 1) Research on meta-learning frameworks [13, 18, 23] and 2) research on transfer learning approaches [4, 9, 12, 14, 17]. Both try to boost the efficiency of optimisation algorithms by using prior knowledge from solving problem instances.

In our work, we progress at the intersection of both lines of research by building a model of a search strategy from individual runs which may be then subsequently transferred to similar problem instances. For this reason, we first give in Sect. 2 a brief overview discussing these two existing lines of research and give insight into the state-of-the-art. Section 3 explains the extensions we introduce to consolidate a search strategy. Further, we demonstrate its functionality on an illustrative benchmark function. In Sect. 4.1, we widen the range of considered benchmark problems to a selected variety of multimodal and valley-shaped problems. Subsequently, in Sect. 4.2 we consider the scenario of transferring search strategies across problem instances generated by translations, rotation and various non-linear transformations to the benchmark functions. We conclude our study with a summary in Sect. 5 and give an outlook on future work.

2 Knowledge from Problem Solving Exercises

In principle, within the optimisation community two approaches have been investigated within the recent decades. The first one being related to the construction of meta-learning frameworks for algorithm selection and configuration. The second one relating to instance-based transfer learning through candidate solutions. Both fields, while having gained strong traction with the recent years, can trace their origin back to much earlier roots. With seminal work on algorithm selection being done by Rice et al. [21] in the 1970s and research on transfer learning emerging from the discourse on lifelong machine learning systems in the 1990s [19]. However, their application towards the domain of optimisation has been only considered since recently within the 2000s [17, 23].

Fig. 1.
figure 1

Roughly adapted from Pan et al. [19]. Similar setups are frequently encountered within literature on population-based optimisation (e.g. [4, 5, 14]). From previously solved source problem classes knowledge is extracted by the algorithm such that it can subsequently improve the performance on it on a new target problem class.

Diagram of the archetypical pipeline for transfer learning.

Meta-learning frameworks attempt to harness high-level knowledge that can be subsequently used in the future to more efficiently solve related tasks. In the classical algorithm selection and algorithm configuration problem, this would equate to predicting the best performing algorithm or configuration for a given problem [13, 18]. However, a key problem in optimisation lies in the first place in the extraction and computation of said task specific features. This poses especially an outstanding problem within the domain of continuous optimisation, where unlike in the combinatorial domain, problem features cannot be simply derived from the problem state or definition. Features thus have to be explicitly computed in a cheap and at best informative manner.

Transfer learning approaches on the other side may be seen as operating under more relaxed conditions. Essentially, what transfer learning assumes between two problem instances, is that beneficial knowledge which helped solving one problem instance, can be transferred either directly or by means of a transformation to a new problem instance. However, notably it introduces by this further uncertainties. The bulk of transfer learning literature in optimisation draws inspiration from instance-based transfer [19] by means of transferring high performing candidate solutions between tasks (e.g. [5, 12, 14, 22]). Retrieval of the candidate solutions occurs either directly from the previous solved tasks [5, 14] or through probabilistic sampling from a continuously built repository (e.g. [4, 17]). As a way of determining the probabilistic weights, often times task similarity measures may be used [17]. However, in many scenarios instead simply solution similarity may be used as a proxy of task similarity [4, 17]. In general, the lack of satisfying task similarity measures together with being prone to uncontrolled ‘negative’ knowledge transfer which degrades algorithm performance [5] are known problems of these approaches.

Interestingly, aside from these mentioned works, barely any of the recent literature tries to learn across problem instances explicitly by means of internal sampling models. Although quite notably, many popular algorithms rely upon operators drawing random variables from symmetrical distributions and thus have by default isotropy assumptions built in. However, this assumption becomes broken when given an optimisation problem which does not resemble a flat plane. Quite intuitively, the interplay between algorithm and optimisation problem should enforce characteristic search strategies and behaviors. Modern model-based algorithms [10, 15] acknowledge this by adapting a distribution online during the optimisation run. However, they do not attempt to memorize these in a more rough and abstract way, such that these can be transferred across problem instances. In many ways, this perspective might be also the only meaningful notion to realize transfer learning in continuous single-objective optimisation. In the following, we build up on our previous work [7, 8] and try tackle the issue in a study using a variant of the popular \((\mu , \,\lambda )\)-Evolution Strategy for continuous optimisation. We explicitly incorporate strategy parameters through a windowing approach and harness systematics from the literature to build benchmarking scenarios.

3 Extending the Evolution Strategy

In the following, we consider continuous single-objective optimisation problems of the form , where \(\chi \) denotes the search space and d its associated dimensionality. As a base we use a variant of the Evolution Strategy with \((\mu , \lambda )\) selection mechanism [1]. We keep out explicitly any recombination operators to have the framework reduced to its essentials. Meaning to sample mutations from a multivariate distribution and performing selection in an elitist manner. Note, that from an evolutionary perspective, mutation is the principle source of variation [16]. In many ways, this basic outline may resemble continuous variants of Evolutionary Programming. However, the elitist selection mechanism in Evolution Strategies has been implicated to contribute to performance improvements [2].

In the Evolution Strategy, population members s(j) are represented by tuples \(\mathbf{s} (j) = [\mathbf{x} (j) , \varvec{\sigma }(j) ]\), where \(\mathbf{x} (j) = (x_1(j),\cdots ,x_d(j))\) is the population member’s representation in the solution space and \(\varvec{\sigma }(j)= (\sigma _1(j),\cdots ,\sigma _d(j))\) are its strategy parameters. The latter can be considered to be a key feature of Evolution Strategy implementations. Strategy parameters essentially control the shape the normal distribution from which mutations

$$\begin{aligned} \varDelta \mathbf {x}(j) \sim \mathcal {N}(\textit{\textbf{0}}, \text {diag}[\varvec{\sigma }(j)]) \end{aligned}$$
(1)

for the individuals j are drawn which shift the individuals \(\mathbf{x} '(j) = \mathbf{x} (j) + \varDelta \mathbf{x} (j)\) in the solution space. Likewise, variation operators can be defined such that they also vary and recombine the strategy parameters of population members. However, we neglect this extension within our study.

3.1 Quality-Based Filtering of Mutations

In the following, we will further filter performed mutations according to their quality. Thus, we will distinguish between beneficial mutations as defined by

$$\begin{aligned} f(\mathbf{x} (j)_{before}^{i}) - f(\mathbf{x} (j)_{after}^{i}) \ge 0 \end{aligned}$$
(2)

and detrimental mutations defined by

$$\begin{aligned} f(\mathbf{x} (j)_{before}^{i}) - f(\mathbf{x} (j)_{after}^{i}) < 0. \end{aligned}$$
(3)

The idea is, that once we have stored statistics about mutations outside of the algorithm, we can use them to design improved search strategies. Specifically, by means of constructing empirical distributions which serve as basis for model-based mutation operators. These can be seen as reflecting globally averaged characteristics of the fitness landscape. In principle, one would intuitively be interested into enforcing beneficial mutations and suppressing detrimental mutations. However, distributions of detrimental mutations have been implicated to be strongly normal distributed [8]. It is also questionable from the perspective of algorithm design whether suppressing mutations comes at the expense of convergence properties, as every point in the search space should remain reachable by a small finite amount of probability. Thus, we focus in the following only on biasing the algorithm through distributions of beneficial mutations (Fig. 2).

Fig. 2.
figure 2

Left panel: Rastrigin’s benchmark function. Right panel: Search distributions for different pairs of strategy parameters \(\varvec{\sigma }=(\sigma _x,\sigma _y)\) derived from a 100 component mixture model of the distribution of beneficial mutations from 1000 runs under reweighing according to Eq. (4)  & (5).

3.2 Constructing Operators from Empirical Distributions

Choosing a Density Estimator. While by default, mutations are sampled in the Evolution Strategy from a multivariate normal distribution as given in Eq. (1), for empirical distributions one explicitly has to use a modeling technique. In principle, many techniques are available for this purpose. However, in the following we will use the Gaussian mixture model as it is a well-studied model which can act as universal density approximator. Mixture models reduce the input data to a small set of descriptive clusters which are parametrized by multivariate normal distributions, such that the full data distribution can then be expressed as \(p(\mathbf {x}) = \sum ^{K}_{k=1} \pi _k \cdot \mathcal {N}(\mathbf {x}|\varvec{\mu }_k,\mathbf {\varSigma }_k)\), with mixture coefficients \(\pi _k\), which are normalized such that \(\varSigma _{k=1}^K \pi _k =1\), and determined together with means \(\mu _k\) and covariances \(\varSigma _k\) by maximizing the log-likelihood through the expectation-maximization algorithm [3, 20].

Incorporating Strategy Parameters. However, an outstanding problem still lies in the fact that the Evolution Strategy possesses strategy parameters \(\varvec{\sigma }\) which control the shape of the distribution from which mutations are sampled. Changing the shape of an empirical distribution as basis for improved sampling should not break the contained spatial information. Therefore, we simply window the empirical distribution with the multivariate normal distribution spanned by the strategy parameters as defined by Eq. (1). Effectively, this results in a reweighing of the mixture model where we replace the original mixture coefficients \(\pi _k\) with

$$\begin{aligned} r_k = \frac{\pi _k c_k}{\sum ^{N}_{i=1} \pi _i c_i}, \end{aligned}$$
(4)

where the coefficients \(c_k\) per mixture component quantify the average value of the normal distribution spanned by the strategy parameters over the k-th mixture component. This can be analytically calculated such that

$$\begin{aligned} \begin{aligned} c_k :=&\int _{\mathbb {R}^n}\mathcal {N}(\textit{\textbf{x}}|\varvec{\mu }_k,\varSigma _k) \, \mathcal {N}(\textit{\textbf{x}}|\textit{\textbf{0}},\text {diag}(\varvec{\sigma })) \, \text {d}^n \mathbf {x}\\&= \int _{\mathbb {R}^n} \frac{\text {exp}\left[ {\,-\frac{1}{2}\,(\mathbf {x}-\mathbf {\mu }_k)^T \varSigma _k^{-1}\,(\mathbf {x}-\mathbf {\mu }_k)}\right] }{\sqrt{(2\pi )^d |\varSigma _k|}}\!\times \!\frac{\text {exp}\left[ {\,-\frac{1}{2}\,\mathbf {x}^T \varSigma _{\varvec{\sigma }}^{-1}\,\mathbf {x}}\right] }{\sqrt{(2\pi )^d |\varSigma _{\varvec{\sigma }}|}}\, \text {d}^n \mathbf {x}\\&\,\,\,\,\,\,= \frac{\text {exp}\left( {\,-\frac{1}{2}\,\mathbf {\mu }_k^T \varSigma _k^{-1}\mathbf {\mu }_k} + {\frac{1}{2}\,\mathbf {\mu }_k^T \left[ \varSigma _k^{-1}\,\varSigma _c\,\,\,\varSigma _{k}^{-1} \right] \mathbf {\mu }_k}\right) }{\sqrt{(2\pi )^d |\varSigma _k||\varSigma _c^{-1}||\varSigma _{\varvec{\sigma }}}|}\!\times \!\!\int _{\mathbb {R}^n}\! \mathcal {N}(\mathbf {x}|\varvec{\mu }_c,\varSigma _c)\, \text {d}^n\mathbf {x}\\&= \frac{\text {exp}\left( {\,-\frac{1}{2}\,\mathbf {\mu }_k^T \left[ \varSigma _k^{-1}\,(\varSigma _k^{-1}{+}\,\varSigma _{\varvec{\sigma }}^{-1})^{-1}\,\varSigma _{\varvec{\sigma }}^{-1} \right] \mathbf {\mu }_k}\right) }{\sqrt{(2\pi )^d |\varSigma _k||\varSigma _k^{-1} +\varSigma _{\varvec{\sigma }}^{-1}| |\varSigma _{\varvec{\sigma }}|}}, \end{aligned} \end{aligned}$$
(5)

where we further introduced \(\varSigma _{\varvec{\sigma }}:= \text {diag}(\varvec{\sigma })\) and \(\varSigma _{c}:= (\varSigma _k^{-1} +\varSigma _{\varvec{\sigma }}^{-1})^{-1}\).

Table 1. Benchmark functions used in this study, grouped from top to bottom according to landscape structure. 1st–3rd row: Unimodal and valley shaped problems. 4th–6th row: Multimodal problems with single global optimum and strong regularity. 7th–9th row: Difficult multimodal problems with single global optimum and high irregularity.

4 Experimental Study

The following study is based upon the DEAP library for Evolutionary Computation [6] with the extensions as elaborated in Sect.  3. We first investigate in Sect. 4.1 whether distributions of beneficial mutations can be harnessed at all to realize performance improvements on a selected range of different continuous optimisation problems. Subsequently in Sect. 4.2 we investigate different transfer scenarios between problem instances. Particularly, we build these scenarios by harnessing existing systematics from the literature.

4.1 On the Efficacy of Distributions of Beneficial Mutations

In the following we conduct experiments over a range of 9 different optimisation problems listed in Table 1. We group these into unimodal and valley-shaped problems (1st–3rd row), multimodal problems with single global optimum and high regularity (4th–6th row) and multimodal problems with single global optimum and high irregularity (7th–9th row). All experiments are conducted with a population size of \(\mu =10\) and we generate at each generation \(\lambda =30\) offspring members by randomly selecting individuals and either cloning or mutating them with a \(30\%\) chance. In all experiments, the population is initialized randomly upon the entire search space, where we use additionally a penalization for the difficult multimodal problems by means of rejecting mutations crossing the search space boundaries. This is necessary, as otherwise in these problems lower optima could be reached in the outer areas. Strategy parameters are initialized such that \(\sigma \in [0.1,4.0]\) for the problems in row 1–6 in Table 1. For the difficult multimodal functions we re-adjust the upper boundaries, where we use for Schwefel’s function \(\sigma _{\text {max}}\,=\,400\), for Eggholder \(\sigma _{\text {max}}\,=\,480\) and for Rana’s function \(\sigma _{\text {max}}\,=\,150\). We will elaborate further in the succeeding paragraph on the necessity of the re-adjustment. Experiments are conducted over 1000 generations and we accumulate data per experiment from 100 runs. Problem dimension is kept at \(d\,=\,2\) in all experiments, as this still allows the interpretation of the retrieved distributions and lifts problems of data sparsity arising with more degrees of freedom. The mixture model is constructed with a total number of \(K\,=\,50\) components.

Fig. 3.
figure 3

Column 1–3: Fitness curves (light blue) for the unimodal Sphere, Bohachevsky’s and Rosenbrock’s function from 100 runs, as well as median (dark blue) and mean (dark grey) curves. Top row: With default sampling. Bottom row: With improved sampling using quality-based mutations. (Color figure online)

Fig. 4.
figure 4

Column 1–3: Fitness curves (light blue) for the multimodal Rastrigin’s, Ackley’s and Griewank’s function from 100 runs, as well as median (dark blue) and mean (dark grey) curves. Top row: With default sampling. Middle row: With improved sampling considering strategy parameters. Middle row: With improved sampling considering strategy parameters. Bottom row: With improved sampling using quality-based mutations. (Color figure online)

Resulting minimum fitness curves per generation of the optimisation runs are plotted per problem group in Figs. 3 and 4. Where top rows are the runs using default mutation distributions, and the lower rows are runs which use distributions of beneficial mutations with and without considering strategy parameters. Further, median (dark blue), mean (grey) and individual runs (light blue) are plotted. Quite notably, across all considered problems the distribution of beneficial mutations significantly improves the search behavior. Particularly, it reduces late convergences by acting in a regularizing fashion. However, the inclusion of strategy parameters is only helpful when some regularity along the parameter axis can be harnessed. Otherwise, it’s effect on the performance is detrimental. The approach can even be shown to work on the difficult multimodal functions of row 7–9 in Table 1. However, we openly admit that further precautions have to be taken for these experiments to work. In particular, for all three we had to re-adjust the upper bound of the strategy parameter to the previously mentioned values such that we achieved good convergence behavior in the runs with default sampling. Without taking these precautions, we were not able to achieve any improvements using the distribution of beneficial mutations. In fact, for the lower values of the strategy parameters we even found that the distributions of beneficial mutations were detrimental to the optimisation and encouraged premature convergence into local optima. We further list performance values of our experiments, as well as results from a statistical Wilcoxon rank sum test under normal approximation in Table 2. The results indicate that for a significance level of \(\alpha =0.05\), the null hypothesis can be rejected in all experiments.

Table 2. Medians \(\tilde{f}_{\text {min}}\), means \(\overline{f}_{\text {min}}\) and standard deviations \(\sigma _{\text {min}}\) of the minimum fitness after 1000 generations aggregated from 100 runs for default sampling using a normal distribution \((\mathcal {N})\) and improved sampling using a mixture model of quality-filtered mutations\((\mathcal {M})\). Further, normalized ranks z and p-values for a two-tailed Wilcoxon rank sum test have been calculated. For a significance level of \(\alpha =0.05\) the null hypothesis can be considered to be rejected in all experiments.

4.2 Cross-Instance Transfer Scenarios

In the following section we will consider now cross-instance transfer learning scenarios. Meaning we try to transfer a mutation operator learned on a source optimisation problem to a target problem (c.f. Fig. 1) in the hope of realizing performance improvements. To generate variations of the source problem instances we apply in the following a systematic of transformations proposed by Hansen et al. [11].

Transformations of the Fitness Landscape. The following base transformations are designed to explicitly break the well-behavedness of our optimisation problems by acting upon the decision variables \(\mathbf {x}\). Ill-conditioning introduces fast running components by a means of a linear rescaling

$$\begin{aligned} T_{ill{\text {-}}c.}: \mathbb {R}^d \rightarrow \mathbb {R}^d,\,\,\, x_i \longmapsto x_i\,\, \alpha ^{\frac{1}{2}\frac{i-1}{d-1}}, \end{aligned}$$
(6)

where we choose \(\alpha =10\) in our experiments. The asymmetrical transformation breaks the symmetry of components \(x_i\) under sign transformations with

$$\begin{aligned} T_{asy}: \mathbb {R}^d&\rightarrow \mathbb {R}^d, x_i \longmapsto {\left\{ \begin{array}{ll} x_i^{1+\beta \frac{i-1}{d-1}\sqrt{x_i}} &{} \text {if}\,\, x_i > 0 \\ x_i &{} \text {otherwise} \end{array}\right. } , \end{aligned}$$
(7)

such that in the positive quadrant the components scale up exponentially. The oscillatory transformation introduces sinusoidal variability of the components by

$$\begin{aligned} T_{osc}: \mathbb {R}^d \rightarrow \mathbb {R}^d, x_i \longmapsto \text {sign}(x_i)\,\,\text {exp}(\hat{x}_i {+} 0.049 (\text {sin}(c_1 \hat{x}_i) + \text {sin}(c_2 \hat{x}_i))), \end{aligned}$$
(8)
$$\begin{aligned} \,\,\, \hat{x} \longmapsto {\left\{ \begin{array}{ll} \text {log}(|x|) &{} \text {if}\,\,\, x\ne 0\\ 0 &{} \text {otherwise} \end{array}\right. } , \hat{c_1} \longmapsto {\left\{ \begin{array}{ll} 10 &{} \text {if}\,\,\, x\ne 0\\ 5.5 &{} \text {otherwise} \end{array}\right. } , \hat{c_2} \longmapsto {\left\{ \begin{array}{ll} 7.9 &{} \text {if}\,\,\, x\ne 0\\ 3.1 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

Further, we also use counter-clockwise rotations \(T_{rot}(\theta )\) by angle \(\theta \) and translations \(T_{trans}\) of the global optimum.

Fig. 5.
figure 5

From the top left corner clockwise: Altered variants of Rastrigin’s (\(R_1\)), Sphere (\(S_1\)), Griewank’s (\(G_1\)) and Ackley’s (\(A_1\)) function.

Table 3. Medians \(\tilde{f}_{\text {min}}\), means \(\overline{f}_{\text {min}}\) and standard deviations \(\sigma _{\text {min}}\) of the minimum fitness after 1000 generations aggregated from 100 runs for default sampling (upper table) and transfer scenarios (bottom table). Further, normalized ranks z and p-values for a two-tailed Wilcoxon rank sum test are given. For a significance level of \(~\alpha =0.05\) the null hypothesis can be considered to be rejected in all experiments.

Experimental Validation. We investigate the utility of the transformations in a set of 9 experiments with 4 transformed standard problems. Further, the transfer from source problem to target problem \(P_0 \rightarrow P_1\), and likewise the transfer into the reverse direction \(P_0 \rightarrow P_1\). We use in the following the sphere function \(S_1\) with ill-conditioning, \(45^\circ \) rotation and extended search space to \([-100,100]^2\), the Ackley’s function \(A_1\) with a translation of \(\mathbf {t}=(-15,20)\) and subsequently added oscillations and asymmetries, Rastrigin’s function \(R_1\) with \(22.5^\circ \) rotation, small shift \(\mathbf {t}=(3,2)\), extended search space to \([-100,100]^2\) and added asymmetry, as well as Griewanks function \(G_1\) with \(20^\circ \) rotation and added oscillations. Further, we denote the Sphere and Rastrigin’s function with extended search spaces to \([-100,100]^2\) as \(S_0^*\) and \(R_0^*\). Heightmaps of most altered benchmark problems are plotted in Fig. 5. We find that in most considered transfer scenarios, performance improvements can be realized (Table 3). However, finding difficult and interesting scenarios without making them obvious is a bit of a hurdle. For example, in our experiments the scenario \(S_0^*\rightarrow G_0\) features negative transfer, as the transferred distribution is simply adapted for a unimodal fitness landscape with small search space.

5 Conclusions

We have investigated in this paper an approach which allows us to learn an evolutionary search strategy reflecting rough and globally averaged characteristics of a fitness landscape. We represented this search strategy through flexible mixture-based distributions of beneficial mutations as basis for improved operators. Particularly, these distributions can be considered to be improved as they enable us to lift the isotropy assumption usually built into mutation operators, thus ingrain the problem structure and redistribute probability weight radially to more appropriately balance exploration and exploitation on a given problem instance. The distribution can be further adapted through a Gaussian reweighing approach, thus emulating the role strategy parameters have for sampling with a default normal distribution. However, this only seems to be useful on a limited range of scenarios. We showed that unweighted distributions can indeed lead to performance improvements on a large variety problems, however prior good convergence properties of the default sampling approach seems to be an essential prerequisite. Further, we investigated systemically built transfer scenarios and could also realize performance improvements in these. However, we openly acknowledge the difficulty of finding meaningful and difficult transfer scenarios. Part of the problem stems from the fact, as it is unsure to which degree one can alter or change a problem such that it still may be attributed to be an instance of the former. However, introducing and investigating systematic transformations should be one the first key steps towards to resolving the issue. For the future, we plan to investigate the proposed framework in higher dimensions for improved transfer scenarios, as well as look into measures of problem similarity potentially by means of fitness landscape analysis.