Bayesian Optimization with a Prior for the Optimum

Souza, Artur; Nardi, Luigi; Oliveira, Leonardo B.; Olukotun, Kunle; Lindauer, Marius; Hutter, Frank

doi:10.1007/978-3-030-86523-8_17

Artur Souza¹³,
Luigi Nardi^14,15,
Leonardo B. Oliveira¹³,
Kunle Olukotun¹⁵,
Marius Lindauer¹⁶ &
…
Frank Hutter^17,18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12977))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2548 Accesses
14 Citations

Abstract

While Bayesian Optimization (BO) is a very popular method for optimizing expensive black-box functions, it fails to leverage the experience of domain experts. This causes BO to waste function evaluations on bad design choices (e.g., machine learning hyperparameters) that the expert already knows to work poorly. To address this issue, we introduce Bayesian Optimization with a Prior for the Optimum (BOPrO). BOPrO allows users to inject their knowledge into the optimization process in the form of priors about which parts of the input space will yield the best performance, rather than BO’s standard priors over functions, which are much less intuitive for users. BOPrO then combines these priors with BO’s standard probabilistic model to form a pseudo-posterior used to select which points to evaluate next. We show that BOPrO is around $6.67\times $ faster than state-of-the-art methods on a common suite of benchmarks, and achieves a new state-of-the-art performance on a real-world hardware design application. We also show that BOPrO converges faster even if the priors for the optimum are not entirely accurate and that it robustly recovers from misleading priors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Filtering Bayesian optimization approach in weakly specified search space

Article 10 July 2018

A Comparative Study on Bayesian Optimization

Learning to Optimize Black-Box Functions with Extreme Limits on the Number of Function Evaluations

Notes

1.
https://github.com/luinardi/hypermapper/wiki/prior-injection.
2.
Technically, the model does not parameterize p(y), since it is computed based on the observed data points, which are heavily biased towards low values due to the optimization process. Instead, it parameterizes a dynamically changing $p_t(y)$, which helps to constantly challenge the model to yield better observations.
3.
We note that for continuous spaces, $P_b(\boldsymbol{x})$ is not a probability distribution as it does not integrate to 1 and therefore is only a pseudo-prior. For discrete spaces, we normalize $P_b(\boldsymbol{x})$ so that it sums to 1 and therefore is a proper distribution and prior.
4.
We note that the structural prior p(f) and the optimum-prior $P_g(\boldsymbol{x})$ provide orthogonal ways to input prior knowledge. p(f) specifies our expectations about the structure and smoothness of the function, whereas $P_g(\boldsymbol{x})$ specifies knowledge about the location of the optimum..
5.
If the optimum for a benchmark is not known, we approximate it using the best value found during previous BO experiments.
6.
https://github.com/HIPS/Spearmint.
7.
https://github.com/hyperopt/hyperopt.
8.
https://github.com/uber-research/TuRBO.
9.
https://github.com/automl/SMAC3.

References

Balandat, M., et al.: BoTorch: a framework for efficient Monte-Carlo Bayesian optimization. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: International Conference on Machine Learning (2013)
Google Scholar
Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in Neural Information Processing Systems (2011)
Google Scholar
Bouthillier, X., Varoquaux, G.: Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020. Research report, Inria Saclay Ile de France (January 2020). https://hal.archives-ouvertes.fr/hal-02447823
Calandra, R., Seyfarth, A., Peters, J., Deisenroth, M.P.: Bayesian optimization for learning gaits under uncertainty. Ann. Math. Artif. Intell. 76(1–2), 5–23 (2016)
Article MathSciNet Google Scholar
Chen, Y., Huang, A., Wang, Z., Antonoglou, I., Schrittwieser, J., Silver, D., de Freitas, N.: Bayesian optimization in AlphaGo. CoRR abs/1812.06855 (2018)
Google Scholar
Clarke, A., McMahon, B., Menon, P., Patel, K.: Optimizing hyperparams for image datasets in Fastai (2020). https://www.platform.ai/post/optimizing-hyperparams-for-image-datasets-in-fastai
Dixon, L.C.W.: The global optimization problem: an introduction. In: Toward Global Optimization 2, pp. 1–15 (1978)
Google Scholar
Eriksson, D., Pearce, M., Gardner, J.R., Turner, R., Poloczek, M.: Scalable global optimization via local Bayesian optimization. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Falkner, S., Klein, A., Hutter, F.: BOHB: robust and efficient hyperparameter optimization at scale. In: International Conference on Machine Learning (2018)
Google Scholar
Feurer, M., Springenberg, J.T., Hutter, F.: Initializing bayesian hyperparameter optimization via meta-learning. In: AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Gardner, J.R., Kusner, M.J., Xu, Z.E., Weinberger, K.Q., Cunningham, J.P.: Bayesian optimization with inequality constraints. In: International Conference on Machine Learning (ICML) (2014)
Google Scholar
Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., Sculley, D.: Google Vizier: a service for black-box optimization. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (2017)
Google Scholar
GPy: GPy: a Gaussian process framework in Python (since 2012). http://github.com/SheffieldML/GPy
Hansen, N., Ostermeier, A.: Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In: Proceedings of IEEE International Conference on Evolutionary Computation (1996)
Google Scholar
Hansen, N., Akimoto, Y., Baudis, P.: CMA-ES/pycma on GitHub
Google Scholar
Hernández-Lobato, J.M., Hoffman, M.W., Ghahramani, Z.: Predictive entropy search for efficient global optimization of black-box functions. In: Advances in Neural Information Processing Systems (2014)
Google Scholar
Hutter, F., Xu, L., Hoos, H., Leyton-Brown, K.: Algorithm runtime prediction: methods & evaluation. Artif. Intell. 206, 79–111 (2014)
Article MathSciNet Google Scholar
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Learning and Intelligent Optimization Conference (2011)
Google Scholar
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning: Methods, Systems, Challenges. TSSCML, Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5
Book Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Google Scholar
Klein, A., Dai, Z., Hutter, F., Lawrence, N.D., Gonzalez, J.: Meta-surrogate benchmarking for hyperparameter optimization. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Koeplinger, D., et al.: Spatial: a language and compiler for application accelerators. In: SIGPLAN Conference on Programming Language Design and Implementation (2018)
Google Scholar
Kushner, H.J.: A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Basic Eng. 86(1), 97–106 (1964)
Article Google Scholar
Li, C., Gupta, S., Rana, S., Nguyen, V., Robles-Kelly, A., Venkatesh, S.: Incorporating expert prior knowledge into experimental design via posterior sampling. arXiv preprint arXiv:2002.11256 (2020)
Lindauer, M., Eggensperger, K., Feurer, M., Falkner, S., Biedenkapp, A., Hutter, F.: SMAC v3: algorithm configuration in Python (2017). https://github.com/automl/SMAC3
Lindauer, M., Hutter, F.: Warmstarting of model-based algorithm configuration. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Stützle, T., Birattari, M.: The irace package: iterated racing for automatic algorithm configuration. Oper. Res. Perspect. 3, 43–58 (2016)
MathSciNet Google Scholar
Mockus, J., Tiesis, V., Zilinskas, A.: The application of Bayesian methods for seeking the extremum. In: Towards Global Optimization 2, pp. 117–129 (1978)
Google Scholar
Nardi, L., Bodin, B., Saeedi, S., Vespa, E., Davison, A.J., Kelly, P.H.: Algorithmic performance-accuracy trade-off in 3d vision applications using hypermapper. In: International Parallel and Distributed Processing Symposium Workshops (2017)
Google Scholar
Nardi, L., Koeplinger, D., Olukotun, K.: Practical design space exploration. In: International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (2019)
Google Scholar
Neal, R.M.: Bayesian Learning for Neural Networks, vol. 118. Springer, New York (1996). https://doi.org/10.1007/978-1-4612-0745-0
Oh, C., Gavves, E., Welling, M.: BOCK: Bayesian optimization with cylindrical kernels. In: International Conference on Machine Learning (2018)
Google Scholar
Paleyes, A., Pullin, M., Mahsereci, M., Lawrence, N., González, J.: Emulation of physical processes with Emukit. In: Workshop on Machine Learning and the Physical Sciences, NeurIPS (2019)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Perrone, V., Shen, H., Seeger, M., Archambeau, C., Jenatton, R.: Learning search spaces for Bayesian optimization: another view of hyperparameter transfer learning. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Ramachandran, A., Gupta, S., Rana, S., Li, C., Venkatesh, S.: Incorporating expert prior in Bayesian optimisation via space warping. Knowl. Based Syst. 195, 105663 (2020)
Article Google Scholar
Shahriari, B., Bouchard-Côté, A., Freitas, N.: Unbounded Bayesian optimization via regularization. In: Artificial Intelligence and Statistics. pp. 1168–1176 (2016)
Google Scholar
Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2015)
Article Google Scholar
Siivola, E., Vehtari, A., Vanhatalo, J., González, J., Andersen, M.R.: Correcting boundary over-exploration deficiencies in Bayesian optimization with virtual derivative sign observations. In: International Workshop on Machine Learning for Signal Processing (2018)
Google Scholar
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems (2012)
Google Scholar
Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.W.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: International Conference on Machine Learning (2010)
Google Scholar
Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning. TSSCML, Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5
Book Google Scholar
Wang, Q., et al.: ATMSeer: increasing transparency and controllability in automated machine learning. In: CHI Conference on Human Factors in Computing Systems (2019)
Google Scholar
Wu, J., Poloczek, M., Wilson, A.G., Frazier, P.I.: Bayesian optimization with gradients. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar

Download references

Acknowledgments

We thank Matthew Feldman for Spatial support. Luigi Nardi and Kunle Olukotun were supported in part by affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook, Google, Intel, Microsoft, NEC, SAP, Teradata, and VMware. Luigi Nardi was also partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Artur Souza and Leonardo B. Oliveira were supported by CAPES, CNPq, and FAPEMIG. Frank Hutter acknowledges support by the European Research Council (ERC) under the European Union Horizon 2020 research and innovation programme through grant no. 716721. The computations were also enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at LUNARC partially funded by the Swedish Research Council through grant agreement no. 2018-05973.

Author information

Authors and Affiliations

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Artur Souza & Leonardo B. Oliveira
Lund University, Lund, Sweden
Luigi Nardi
Stanford University, Stanford, USA
Luigi Nardi & Kunle Olukotun
Leibniz University Hannover, Hannover, Germany
Marius Lindauer
University of Freiburg, Freiburg im Breisgau, Germany
Frank Hutter
Bosch Center for Artificial Intelligence, Renningen, Germany
Frank Hutter

Authors

Artur Souza
View author publications
You can also search for this author in PubMed Google Scholar
Luigi Nardi
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo B. Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Kunle Olukotun
View author publications
You can also search for this author in PubMed Google Scholar
Marius Lindauer
View author publications
You can also search for this author in PubMed Google Scholar
Frank Hutter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Artur Souza .

Editor information

Editors and Affiliations

ELLIS - The European Laboratory for Learning and Intelligent Systems, Alicante, Spain
Nuria Oliver
ETHZ and EPFL, Zürich, Switzerland
Fernando Pérez-Cruz
Johannes Gutenberg University of Mainz, Mainz, Germany
Stefan Kramer
École Polytechnique, Palaiseau, France
Jesse Read
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Appendices

A Prior Forgetting Supplementary Experiments

In this section, we show additional evidence that BOPrO can recover from wrongly defined priors so to complement Sect. 4.1. Figure 6 shows BOPrO on the 1D Branin function as in Fig. 3 but with a decay prior. Column (a) of Fig. 6 shows the decay prior and the 1D Branin function. This prior emphasizes the wrong belief that the optimum is likely located on the left side around $\mathrm {x} = -5$ while the optimum is located at the orange dashed line. Columns (b), (c), and (d) of Fig. 6 show BOPrO on the 1D Branin after $D+1=2$ initial samples and 0, 10, and 20 BO iterations, respectively. In the beginning of BO, as shown in column (b), the pseudo-posterior is nearly identical to the prior and guides BOPrO towards the left region of the space. As more points are sampled, the model becomes more accurate and starts guiding the pseudo-posterior away from the wrong prior (column (c)). Notably, the pseudo-posterior before $\mathrm {x} = 0$ falls to 0, as the predictive model is certain there will be no improvement from sampling this region. After 20 iterations, BOPrO finds the optimum region, despite the poor start (column (d)). The peak in the pseudo-posterior in column (d) shows BOPrO will continue to exploit the optimum region as it is not certain if the exact optimum has been found. The pseudo-posterior is also high in the high uncertainty region after $x = 4$, showing BOPrO will explore that region after it finds the optimum.

Figure 7 shows BOPrO on the standard 2D Branin function. We use exponential priors for both dimensions, which guides optimization towards a region with only poor performing high function values. Figure 7a shows the prior and Fig. 7b shows optimization results after $D+1=3$ initialization samples and 50 BO iterations. Note that, once again, optimization begins near the region incentivized by the prior, but moves away from the prior and towards the optima as BO progresses. After 50 BO iterations, BOPrO finds all three optima regions of the Branin.

B Mathematical Derivations

1.1 B.1 EI Derivation

Here, we provide a full derivation of Eq. (7):

$$\begin{aligned} EI_{f_{\gamma }}(\boldsymbol{x})&:=\int _{-\infty }^{\infty } \max (f_{\gamma } - y, 0) p(y|\boldsymbol{x}) dy = \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y)\frac{p(\boldsymbol{x}|y) p(y)}{p(\boldsymbol{x})} dy. \end{aligned}$$

As defined in Sect. 3.2, $p(y < f_{\gamma }) = \gamma $ and $\gamma $ is a quantile of the observed objective values $\{y^{(i)}\}$. Then $p(\boldsymbol{x})= \int _{\mathbb {R}}p(\boldsymbol{x}|y)p(y)dy = \gamma g(\boldsymbol{x}) + (1-\gamma ) b(\boldsymbol{x})$, where $g(\boldsymbol{x})$ and $b(\boldsymbol{x})$ are the posteriors introduced in Sect. 3.3. Therefore

$$\begin{aligned} \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y) p(\boldsymbol{x}|y) p(y) dy&= g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y) p(y) dy \nonumber \\&= \gamma f_{\gamma } g(\boldsymbol{x}) - g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} y p(y) dy, \end{aligned}$$

(8)

so that finally

$$\begin{aligned} EI_{f_{\gamma }}(\boldsymbol{x})&=\frac{\gamma f_{\gamma } g(\boldsymbol{x}) - g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} y p(y) dy}{\gamma g(\boldsymbol{x}) + (1-\gamma ) b(\boldsymbol{x})} \propto \left( \gamma + \dfrac{b(\boldsymbol{x})}{g(\boldsymbol{x})}(1 - \gamma ) \right) ^{-1}. \end{aligned}$$

(9)

1.2 B.2 Proof of Proposition 1

Here, we provide the proof of Proposition 1:

$$\begin{aligned}&\lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} EI_{f_{\gamma }}(\boldsymbol{x}) \end{aligned}$$

(10)

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y) p(\boldsymbol{x}|y) p(y) dy \end{aligned}$$

(11)

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y) p(y) dy \end{aligned}$$

(12)

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \gamma f_{\gamma } g(\boldsymbol{x}) - g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} y p(y) dy\right) \end{aligned}$$

(13)

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \frac{\gamma f_{\gamma } g(\boldsymbol{x}) - g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} y p(y) dy}{\gamma g(\boldsymbol{x}) + (1-\gamma ) b(\boldsymbol{x})} \end{aligned}$$

(14)

which, from Eq. (9), is equal to:

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \gamma + \dfrac{b(\boldsymbol{x})}{g(\boldsymbol{x})}(1 - \gamma ) \right) ^{-1} \end{aligned}$$

(15)

we can take Eq. (15) to the power of $\dfrac{1}{t}$ without changing the expression, since the argument that maximizes EI does not change:

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \gamma + \dfrac{b(\boldsymbol{x})}{g(\boldsymbol{x})}(1 - \gamma ) \right) ^{-\frac{1}{t}} \end{aligned}$$

(16)

substituting g(x) and b(x) using their definitions in Sect. 3.3:

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \gamma + \dfrac{P_b(\boldsymbol{x})\mathcal {M}_b(\boldsymbol{x})^{\tfrac{t}{\beta }}}{P_g(\boldsymbol{x})\mathcal {M}_g(\boldsymbol{x})^{\tfrac{t}{\beta }}}(1 - \gamma ) \right) ^{-\frac{1}{t}}\end{aligned}$$

(17)

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{P_b(\boldsymbol{x})\mathcal {M}_b(\boldsymbol{x})^{\tfrac{t}{\beta }}}{P_g(\boldsymbol{x})\mathcal {M}_g(\boldsymbol{x})^{\tfrac{t}{\beta }}}(1 - \gamma ) \right) ^{-\frac{1}{t}}\end{aligned}$$

(18)

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{P_b(\boldsymbol{x})}{P_g(\boldsymbol{x})}\right) ^{-\frac{1}{t}} \left( \dfrac{\mathcal {M}_b(\boldsymbol{x})^{\frac{t}{\beta }}}{\mathcal {M}_g(\boldsymbol{x})^{\frac{t}{\beta }}}\right) ^{-\frac{1}{t}} \left( 1- \gamma \right) ^{-\frac{1}{t}}\end{aligned}$$

(19)

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{P_b(\boldsymbol{x})}{P_g(\boldsymbol{x})}\right) ^{-\frac{1}{t}} \left( \dfrac{\mathcal {M}_b(\boldsymbol{x})}{\mathcal {M}_g(\boldsymbol{x})}\right) ^{-{\frac{1}{\beta }}} \left( 1- \gamma \right) ^{-\frac{1}{t}}\end{aligned}$$

(20)

$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{\mathcal {M}_b(\boldsymbol{x})}{\mathcal {M}_g(\boldsymbol{x})} \right) ^{-\frac{1}{\beta }}\end{aligned}$$

(21)

$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{1-\mathcal {M}_g(\boldsymbol{x})}{\mathcal {M}_g(\boldsymbol{x})} \right) ^{-\frac{1}{\beta }}\end{aligned}$$

(22)

$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{1}{\mathcal {M}_g(\boldsymbol{x})} - 1 \right) ^{-\frac{1}{\beta }}\end{aligned}$$

(23)

$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \mathcal {M}_g(\boldsymbol{x})\right) ^{\frac{1}{\beta }}\end{aligned}$$

(24)

$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \mathcal {M}_g(\boldsymbol{x})\end{aligned}$$

(25)

This shows that as iterations progress, the model grows more important. If BOPrO is run long enough, the prior washes out and BOPrO only trusts the probabilistic model. Since $\mathcal {M}_g(\boldsymbol{x})$ is the Probability of Improvement (PI) on the probabilistic model $p(y|\boldsymbol{x})$ then, in the limit, maximizing the acquisition function $EI_{f_{\gamma }}(\boldsymbol{x})$ is equivalent to maximizing the PI acquisition function on the probabilistic model $p(y|\boldsymbol{x})$. In other words, for high values of t, BOPrO converges to standard BO with a PI acquisition function.

C Experimental Setup

Table 1. Search spaces for our synthetic benchmarks. For the Profet benchmarks, we report the original ranges and whether or not a log scale was used.

Full size table

We use a combination of publicly available implementations for our predictive models. For our Gaussian Process (GP) model, we use GPy’s [14] GP implementation with the Matérn5/2 kernel. We use different length-scales for each input dimension, learned via Automatic Relevance Determination (ARD) [32]. For our Random Forests (RF), we use scikit-learn’s RF implementation [35]. We set the fraction of features per split to 0.5, the minimum number of samples for a split to 5, and disable bagging. We also adapt our RF implementation to use the same split selection approach as Hutter et al. [18].

Table 2. Search space, priors, and expert configuration for the Shallow CNN application. The default value for each parameter is shown in bold.

Full size table

For our constrained Bayesian Optimization (cBO) approach, we use scikit-learn’s RF classifier, trained on previously explored configurations, to predict the probability of a configuration being feasible. We then weight our EI acquisition function by this probability of feasibility, as proposed by Gardner et al. [12]. We normalize our EI acquisition function before considering the probability of feasibility, to ensure both values are in the same range. This cBO implementation is used in the Spatial use-case as in Nardi et al. [31].

For all experiments, we set the model weight hyperparameter to $\beta = 10$ and the model quantile to $\gamma = 0.05$, see Appendices J and I. Before starting the main BO loop, BOPrO is initialized by random sampling $D+1$ points from the prior, where D is the number of input variables. We use the public implementation of Spearmint^{Footnote 6}, which by default uses 2 random samples for initialization. We normalize our synthetic priors before computing the pseudo-posterior, to ensure they are in the same range as our model. We also implement interleaving which randomly samples a point to explore during BO with a $10\%$ chance.

We optimize our EI acquisition function using a combination of a multi-start local search and CMA-ES [15]. Our multi-start local search is similar to the one used in SMAC [19]. Namely, we start local searches on the 10 best points evaluated in previous BO iterations, on the 10 best performing points from a set of 10,000 random samples, on the 10 best performing points from 10,000 random samples drawn from the prior, and on the mode of the prior. To compute the neighbors of each of these 31 total points, we normalize the range of each parameter to [0, 1] and randomly sample four neighbors from a truncated Gaussian centered at the original value and with standard deviation $\sigma = 0.1$. For CMA-ES, we use the public implementation of pycma [16]. We run pycma with two starting points, one at the incumbent and one at the mode of the prior. For both initializations we set $\sigma _0 = 0.2$. We only use CMA-ES for our continuous search space benchmarks.

Table 3. Search space, priors, and expert configuration for the Deep CNN application. The default value for each parameter is shown in bold.

Full size table

We use four synthetic benchmarks in our experiments.

Branin. The Branin function is a well-known synthetic benchmark for optimization problems [8]. The Branin function has two input dimensions and three global minima.

SVM. A hyperparameter-optimization benchmark in 2D based on Profet [22]. This benchmark is generated by a generative meta-model built using a set of SVM classification models trained on 16 OpenML tasks. The benchmark has two input parameters, corresponding to SVM hyperparameters.

FC-Net. A hyperparameter and architecture optimization benchmark in 6D based on Profet. The FC-Net benchmark is generated by a generative meta-model built using a set of feed-forward neural networks trained on the same 16 OpenML tasks as the SVM benchmark. The benchmark has six input parameters corresponding to network hyperparameters.

XGBoost. A hyperparameter-optimization benchmark in 8D based on Profet. The XGBoost benchmark is generated by a generative meta-model built using a set of XGBoost regression models in 11 UCI datasets. The benchmark has eight input parameters, corresponding to XGBoost hyperparameters.

The search spaces for each benchmark are summarized in Table 1. For the Profet benchmarks, we report the original ranges and whether or not a log scale was used. However, in practice, Profet’s generative model transforms the range of all hyperparameters to a linear [0, 1] range. We use Emukit’s public implementation for these benchmarks [34].

D Spatial Real-World Application

Spatial [23] is a programming language and corresponding compiler for the design of application accelerators on reconfigurable architectures, e.g. field-programmable gate arrays (FPGAs). These reconfigurable architectures are a type of logic chip that can be reconfigured via software to implement different applications. Spatial provides users with a high-level of abstraction for hardware design, so that they can easily design their own applications on FPGAs. It allows users to specify parameters that do not change the behavior of the application, but impact the runtime and resource-usage (e.g., logic units) of the final design. During compilation, the Spatial compiler estimates the ranges of these parameters and estimates the resource-usage and runtime of the application for different parameter values. These parameters can then be optimized during compilation in order to achieve the design with the fastest runtime. We fully integrate BOPrO as a pass in Spatial ’s compiler, so that Spatial can automatically use BOPrO for the optimization during compilation. This enables Spatial to seamlessly call BOPrO during the compilation of any new application to guide the search towards the best design on an application-specific basis.

Table 4. Search space, priors, and expert configuration for the MD Grid application. The default value for each parameter is shown in bold.

Full size table

In our experiments, we introduce for the first time the automatic optimization of three Spatial real-world applications, namely, 7D shallow and deep CNNs, and a 10D molecular dynamics grid application. Previous work by Nardi et al. [31] had applied automatic optimization of Spatial parameters on a set of benchmarks but in our work we focus on real-world applications raising the bar of state-of-the-art automated hardware design optimization. BOPrO is used to optimize the parameters to find a design that leads to the fastest runtime. The search space for these three applications is based on ordinal and categorical parameters; to handle these discrete parameters in the best way we implement and use a Random Forests surrogate instead of a Gaussian Process one as explained in Appendix C. These parameters are application-specific and control how much of the FPGAs’ resources we want to use to parallelize each step of the application’s computation. The goal here is to find which steps are more important to parallelize in the final design, in order to achieve the fastest runtime. Some parameters also control whether we want to enable pipeline scheduling or not, which consumes resources but accelerates runtime, and others focus on memory management. We refer to Koeplinger et al. [23] and Nardi et al. [31] for more details on Spatial ’s parameters.

The three Spatial benchmarks also have feasibility constraints in the search space, meaning that some parameter configurations are infeasible. A configuration is considered infeasible if the final design requires more logic resources than what the FPGA provides, i.e., it is not possible to perform FPGA synthesis because the design does not fit in the FPGA. To handle these constraints, we use our cBO implementation (Appendix C). Our goal is thus to find the design with the fastest runtime under the constraint that the design fits the FPGA resource budget.

The priors for these Spatial applications take the form of a list of probabilities, containing the probability of each ordinal or categorical value being good. Each benchmark also has a default configuration, which ensures all methods start with at least one feasible configuration. The priors and the default configuration for these benchmarks were provided once by an unbiased Spatial developer, who is not an author of this paper, and kept unchanged during the entire project. The search space, priors, and the expert configuration used in our experiments for each application are presented in Tables 2, 3, and 4.

E Multivariate Prior Comparison

In this section we compare the performance of BOPrO with univariate and multivariate priors. For this, we construct synthetic univariate and multivariate priors using Kernel Density Estimation (KDE) with a Gaussian kernel. We build strong and weak versions of the KDE priors. The strong priors are computed using a KDE on the best 10D out of 10,000,000D uniformly sampled points, while the weak priors are computed using a KDE on the best 10D out of 1,000D uniformly sampled points. We use the same points for both univariate and multivariate priors. We use scipy’s Gaussian KDE implementation, but adapt its Scott’s Rule bandwidth to $100n^{-\frac{1}{d}}$, where d is the number of variables in the KDE prior, to make our priors more peaked.

Figure 8 shows a log regret comparison of BOPrO with univariate and multivariate KDE priors. We note that in all cases BOPrO achieves similar performance with univariate and multivariate priors. For the Branin and SVM benchmarks, the weak multivariate prior leads to slightly better results than the weak univariate prior. However, we note that the difference is small, in the order of $10^{-4}$ and $10^{-6}$, respectively.

Surprisingly, for the XGBoost benchmark, the univariate version for both the weak and strong priors lead to better results than their respective multivariate counterparts, though, once again, the difference in performance is small, around 0.2 and 0.03 for the weak and strong prior, respectively, whereas the XGBoost benchmark can reach values as high as $f(\boldsymbol{x}) = 700$. Our hypothesis is that this difference comes from the bandwidth estimator ($100n^{-\frac{1}{d}}$), which leads to larger bandwidths, consequently, smoother priors, when a multivariate prior is constructed.

F Misleading Prior Comparison

Figure 9 shows the effect of injecting a misleading prior in BOPrO. We compare BOPrO with a misleading prior, no prior, a weak prior, and a strong prior. For our misleading prior, we use a Gaussian centered at the worst point out of 10,000,000D uniform random samples. Namely, for each parameter, we inject a prior of the form $\mathcal {N}(x_{w}, \sigma _w^2)$, where $x_{w}$ is the value of the parameter at the point with highest function value out of 10,000,000D uniform random samples and $\sigma _w = 0.01$. For all benchmarks, we note that the misleading prior slows down convergence, as expected, since it pushes the optimization away from the optima in the initial phase. However, BOPrO is still able to forget the misleading prior and achieve similar regret to BOPrO without a prior.

G Comparison to Other Baselines

We compare BOPrO to SMAC [19], TuRBO [9], and TPE [3] on our four synthetic benchmarks. We use Hyperopt’s implementation^{Footnote 7} of TPE, the public implementation of TuRBO^{Footnote 8}, and the SMAC3 Python implementation of SMAC^{Footnote 9}. Hyperopt defines priors as one of a list of supported distributions, including Uniform, Normal, and Lognormal distributions, while SMAC and TuRBO do not support priors on the locality of an optimum under the form of probability distributions.

For the three Profet benchmarks (SVM, FCNet, and XGBoost), we inject the strong priors defined in Sect. 4.2 into both Hyperopt and BOPrO. For Branin, we also inject the strong prior defined in Sect. 4.2 into BOPrO, however, we cannot inject this prior into Hyperopt. Our strong prior for Branin takes the form of a Gaussian mixture prior peaked at all three optima and Hyperopt does not support Gaussian mixture priors. Instead, for Hyperopt, we arbitrarily choose one of the optima $(\pi , 2.275)$ and use a Gaussian prior centered near that optimum. We note that since we compare all approaches based on the log simple regret, both priors are comparable in terms of prior strength, since finding one optimum or all three would lead to the same log regret. Also, we note that using Hyperopt’s Gaussian priors leads to an unbounded search space, which sometimes leads TPE to suggest parameter configurations outside the allowed parameter range. To prevent these values from being evaluated, we convert values outside the parameter range to be equal to the upper or lower range limit, depending on which limit was exceeded. We do not inject any priors into SMAC and TuRBO, since these methods do not support priors about the locality of an optimum.

Figure 10 shows a log regret comparison between BOPrO, SMAC, TuRBO-2 and TPE on our four synthetic benchmarks. We use TuRBO-2 since it led to better performance overall compared to other variants, see Fig. 11. BOPrO achieves better performance than SMAC and TuRBO on all four benchmarks. Compared to TPE, BOPrO achieves similar or better performance on three of the four synthetic benchmarks, namely Branin, SVM, and FCNet, and slightly worse performance on XGBoost. We note, however, that the good performance of TPE on XGBoost may be an artifact of the approach of clipping values to its maximal or minimal values as mentioned above. In fact, the clipping nudges TPE towards promising configurations in this case, since XGBoost has low function values near the edges of the search space. Overall, the better performance of BOPrO is expected, since BOPrO is able to combine prior knowledge with more sample-efficient surrogates, which leads to better performance.

H Prior Baselines Comparison

We show that simply initializing a BO method in the DoE phase by sampling from a prior on the locality of an optimum doesn’t necessarily lead to better performance. Instead in BOPrO, it is the pseudo-posterior in Eq. (4) that drives its stronger performance by combining prior and new observations. To show that, we compare BOPrO with Spearmint and HyperMapper such as in section Sects. 4.2 and 4.3, respectively, but we initialize all three methods using the same approach. Namely, we initialize all methods with $D+1$ samples from the prior. Our goal is to show that simply initializing Spearmint and HyperMapper with the prior will not lead to the same performance as BOPrO, because, unlike BOPrO, these baselines do not leverage the prior after the DoE initialization phase. We report results on both our synthetic and real-world benchmarks.

Figure 12 shows the comparison between BOPrO and Spearmint Prior. In most benchmarks, the prior initialization leads to similar final performance. In particular, for XGBoost, the prior leads to improvement in early iterations, but to worse final performance. We note that for FCNet, Spearmint Prior achieves better performance, however, we note that the improved performance is given almost solely from sampling from the prior. There is no improvement for Spearmint Prior until around iteration 190. In contrast, in all cases, BOPrO is able to leverage the prior both during initialization and its Bayesian Optimization phase, leading to improved performance. BOPrO still achieves similar or better performance than Spearmint Prior in all benchmarks.

Figure 13 shows similar results for our Spatial benchmarks. The prior does not lead HyperMapper to improved final performance. For the Shallow CNN benchmark, the prior leads HyperMapper to improved performance in early iterations, compared to HyperMapper with default initialization, but HyperMapper Prior is still outperformed by BOPrO. Additionally, the prior leads to degraded performance in the Deep CNN benchmark. These results confirm that BOPrO is able to leverage the prior in its pseudo-posterior during optimization, leading to improved performance in almost all benchmarks compared to state-of-the-art BO baselines.

I $\gamma $-Sensitivity Study

We show the effect of the $\gamma $ hyperparameter introduced in Sect. 3.2 for the quantile identifying the points considered to be good. To show this, we compare the performance of BOPrO with our strong prior and different $\gamma $ values. For all experiments, we initialize BOPrO with $D+1$ random samples and then run BOPrO until it reaches 10D function evaluations. For each $\gamma $ value, we run BOPrO five times and report mean and standard deviation.

Figure 14 shows the results of our comparison. We first note that values near the lower and higher extremes lead to degraded performance, this is expected, since these values will lead to an excess of either exploitation or exploration. Further, we note that BOPrO achieves similar performance for all values of $\gamma $, however, values around $\gamma = 0.05$ consistently lead to better performance.

J $\beta $-Sensitivity Study

We show the effect of the $\beta $ hyperparameter introduced in Sect. 3.3 for controlling the influence of the prior over time. To show the effects of $\beta $, we compare the performance of BOPrO with our strong prior and different $\beta $ values on our four synthetic benchmarks. For all experiments, we initialize BOPrO with $D+1$ random samples and then run BOPrO until it reaches 10D function evaluations. For each $\beta $ value, we run BOPrO five times and report mean and standard deviation.

Figure 15 shows the results of our comparison. We note that values of $\beta $ that are too low (near 0.01) or too high (near 1000) often lead to lower performance. This shows that putting too much emphasis on the model or the prior will lead to degraded performance, as expected. Further, we note that $\beta = 10$ lead to the best performance in three out of our four benchmarks. This result is reasonable, as $\beta = 10$ means BOPrO will put more emphasis on the prior in early iterations, when the model is still not accurate, and slowly shift towards putting more emphasis on the model as the model sees more data and becomes more accurate.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Souza, A., Nardi, L., Oliveira, L.B., Olukotun, K., Lindauer, M., Hutter, F. (2021). Bayesian Optimization with a Prior for the Optimum. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-86523-8_17
Published: 11 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86522-1
Online ISBN: 978-3-030-86523-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Bayesian Optimization with a Prior for the Optimum

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Filtering Bayesian optimization approach in weakly specified search space

A Comparative Study on Bayesian Optimization

Learning to Optimize Black-Box Functions with Extreme Limits on the Number of Function Evaluations

Notes

References

Acknowledgments