Skip to main content

Bayesian Optimization with a Prior for the Optimum

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2021)

Abstract

While Bayesian Optimization (BO) is a very popular method for optimizing expensive black-box functions, it fails to leverage the experience of domain experts. This causes BO to waste function evaluations on bad design choices (e.g., machine learning hyperparameters) that the expert already knows to work poorly. To address this issue, we introduce Bayesian Optimization with a Prior for the Optimum (BOPrO). BOPrO allows users to inject their knowledge into the optimization process in the form of priors about which parts of the input space will yield the best performance, rather than BO’s standard priors over functions, which are much less intuitive for users. BOPrO then combines these priors with BO’s standard probabilistic model to form a pseudo-posterior used to select which points to evaluate next. We show that BOPrO is around \(6.67\times \) faster than state-of-the-art methods on a common suite of benchmarks, and achieves a new state-of-the-art performance on a real-world hardware design application. We also show that BOPrO converges faster even if the priors for the optimum are not entirely accurate and that it robustly recovers from misleading priors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/luinardi/hypermapper/wiki/prior-injection.

  2. 2.

    Technically, the model does not parameterize p(y), since it is computed based on the observed data points, which are heavily biased towards low values due to the optimization process. Instead, it parameterizes a dynamically changing \(p_t(y)\), which helps to constantly challenge the model to yield better observations.

  3. 3.

    We note that for continuous spaces, \(P_b(\boldsymbol{x})\) is not a probability distribution as it does not integrate to 1 and therefore is only a pseudo-prior. For discrete spaces, we normalize \(P_b(\boldsymbol{x})\) so that it sums to 1 and therefore is a proper distribution and prior.

  4. 4.

    We note that the structural prior p(f) and the optimum-prior \(P_g(\boldsymbol{x})\) provide orthogonal ways to input prior knowledge. p(f) specifies our expectations about the structure and smoothness of the function, whereas \(P_g(\boldsymbol{x})\) specifies knowledge about the location of the optimum..

  5. 5.

    If the optimum for a benchmark is not known, we approximate it using the best value found during previous BO experiments.

  6. 6.

    https://github.com/HIPS/Spearmint.

  7. 7.

    https://github.com/hyperopt/hyperopt.

  8. 8.

    https://github.com/uber-research/TuRBO.

  9. 9.

    https://github.com/automl/SMAC3.

References

  1. Balandat, M., et al.: BoTorch: a framework for efficient Monte-Carlo Bayesian optimization. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  2. Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In: International Conference on Machine Learning (2013)

    Google Scholar 

  3. Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in Neural Information Processing Systems (2011)

    Google Scholar 

  4. Bouthillier, X., Varoquaux, G.: Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020. Research report, Inria Saclay Ile de France (January 2020). https://hal.archives-ouvertes.fr/hal-02447823

  5. Calandra, R., Seyfarth, A., Peters, J., Deisenroth, M.P.: Bayesian optimization for learning gaits under uncertainty. Ann. Math. Artif. Intell. 76(1–2), 5–23 (2016)

    Article  MathSciNet  Google Scholar 

  6. Chen, Y., Huang, A., Wang, Z., Antonoglou, I., Schrittwieser, J., Silver, D., de Freitas, N.: Bayesian optimization in AlphaGo. CoRR abs/1812.06855 (2018)

    Google Scholar 

  7. Clarke, A., McMahon, B., Menon, P., Patel, K.: Optimizing hyperparams for image datasets in Fastai (2020). https://www.platform.ai/post/optimizing-hyperparams-for-image-datasets-in-fastai

  8. Dixon, L.C.W.: The global optimization problem: an introduction. In: Toward Global Optimization 2, pp. 1–15 (1978)

    Google Scholar 

  9. Eriksson, D., Pearce, M., Gardner, J.R., Turner, R., Poloczek, M.: Scalable global optimization via local Bayesian optimization. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  10. Falkner, S., Klein, A., Hutter, F.: BOHB: robust and efficient hyperparameter optimization at scale. In: International Conference on Machine Learning (2018)

    Google Scholar 

  11. Feurer, M., Springenberg, J.T., Hutter, F.: Initializing bayesian hyperparameter optimization via meta-learning. In: AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  12. Gardner, J.R., Kusner, M.J., Xu, Z.E., Weinberger, K.Q., Cunningham, J.P.: Bayesian optimization with inequality constraints. In: International Conference on Machine Learning (ICML) (2014)

    Google Scholar 

  13. Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., Sculley, D.: Google Vizier: a service for black-box optimization. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (2017)

    Google Scholar 

  14. GPy: GPy: a Gaussian process framework in Python (since 2012). http://github.com/SheffieldML/GPy

  15. Hansen, N., Ostermeier, A.: Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In: Proceedings of IEEE International Conference on Evolutionary Computation (1996)

    Google Scholar 

  16. Hansen, N., Akimoto, Y., Baudis, P.: CMA-ES/pycma on GitHub

    Google Scholar 

  17. Hernández-Lobato, J.M., Hoffman, M.W., Ghahramani, Z.: Predictive entropy search for efficient global optimization of black-box functions. In: Advances in Neural Information Processing Systems (2014)

    Google Scholar 

  18. Hutter, F., Xu, L., Hoos, H., Leyton-Brown, K.: Algorithm runtime prediction: methods & evaluation. Artif. Intell. 206, 79–111 (2014)

    Article  MathSciNet  Google Scholar 

  19. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Learning and Intelligent Optimization Conference (2011)

    Google Scholar 

  20. Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning: Methods, Systems, Challenges. TSSCML, Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5

    Book  Google Scholar 

  21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)

    Google Scholar 

  22. Klein, A., Dai, Z., Hutter, F., Lawrence, N.D., Gonzalez, J.: Meta-surrogate benchmarking for hyperparameter optimization. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  23. Koeplinger, D., et al.: Spatial: a language and compiler for application accelerators. In: SIGPLAN Conference on Programming Language Design and Implementation (2018)

    Google Scholar 

  24. Kushner, H.J.: A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Basic Eng. 86(1), 97–106 (1964)

    Article  Google Scholar 

  25. Li, C., Gupta, S., Rana, S., Nguyen, V., Robles-Kelly, A., Venkatesh, S.: Incorporating expert prior knowledge into experimental design via posterior sampling. arXiv preprint arXiv:2002.11256 (2020)

  26. Lindauer, M., Eggensperger, K., Feurer, M., Falkner, S., Biedenkapp, A., Hutter, F.: SMAC v3: algorithm configuration in Python (2017). https://github.com/automl/SMAC3

  27. Lindauer, M., Hutter, F.: Warmstarting of model-based algorithm configuration. In: AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  28. López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Stützle, T., Birattari, M.: The irace package: iterated racing for automatic algorithm configuration. Oper. Res. Perspect. 3, 43–58 (2016)

    MathSciNet  Google Scholar 

  29. Mockus, J., Tiesis, V., Zilinskas, A.: The application of Bayesian methods for seeking the extremum. In: Towards Global Optimization 2, pp. 117–129 (1978)

    Google Scholar 

  30. Nardi, L., Bodin, B., Saeedi, S., Vespa, E., Davison, A.J., Kelly, P.H.: Algorithmic performance-accuracy trade-off in 3d vision applications using hypermapper. In: International Parallel and Distributed Processing Symposium Workshops (2017)

    Google Scholar 

  31. Nardi, L., Koeplinger, D., Olukotun, K.: Practical design space exploration. In: International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (2019)

    Google Scholar 

  32. Neal, R.M.: Bayesian Learning for Neural Networks, vol. 118. Springer, New York (1996). https://doi.org/10.1007/978-1-4612-0745-0

  33. Oh, C., Gavves, E., Welling, M.: BOCK: Bayesian optimization with cylindrical kernels. In: International Conference on Machine Learning (2018)

    Google Scholar 

  34. Paleyes, A., Pullin, M., Mahsereci, M., Lawrence, N., González, J.: Emulation of physical processes with Emukit. In: Workshop on Machine Learning and the Physical Sciences, NeurIPS (2019)

    Google Scholar 

  35. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  36. Perrone, V., Shen, H., Seeger, M., Archambeau, C., Jenatton, R.: Learning search spaces for Bayesian optimization: another view of hyperparameter transfer learning. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  37. Ramachandran, A., Gupta, S., Rana, S., Li, C., Venkatesh, S.: Incorporating expert prior in Bayesian optimisation via space warping. Knowl. Based Syst. 195, 105663 (2020)

    Article  Google Scholar 

  38. Shahriari, B., Bouchard-Côté, A., Freitas, N.: Unbounded Bayesian optimization via regularization. In: Artificial Intelligence and Statistics. pp. 1168–1176 (2016)

    Google Scholar 

  39. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2015)

    Article  Google Scholar 

  40. Siivola, E., Vehtari, A., Vanhatalo, J., González, J., Andersen, M.R.: Correcting boundary over-exploration deficiencies in Bayesian optimization with virtual derivative sign observations. In: International Workshop on Machine Learning for Signal Processing (2018)

    Google Scholar 

  41. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems (2012)

    Google Scholar 

  42. Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.W.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: International Conference on Machine Learning (2010)

    Google Scholar 

  43. Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning. TSSCML, Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5

    Book  Google Scholar 

  44. Wang, Q., et al.: ATMSeer: increasing transparency and controllability in automated machine learning. In: CHI Conference on Human Factors in Computing Systems (2019)

    Google Scholar 

  45. Wu, J., Poloczek, M., Wilson, A.G., Frazier, P.I.: Bayesian optimization with gradients. In: Advances in Neural Information Processing Systems 30 (2017)

    Google Scholar 

Download references

Acknowledgments

We thank Matthew Feldman for Spatial support. Luigi Nardi and Kunle Olukotun were supported in part by affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook, Google, Intel, Microsoft, NEC, SAP, Teradata, and VMware. Luigi Nardi was also partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Artur Souza and Leonardo B. Oliveira were supported by CAPES, CNPq, and FAPEMIG. Frank Hutter acknowledges support by the European Research Council (ERC) under the European Union Horizon 2020 research and innovation programme through grant no. 716721. The computations were also enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at LUNARC partially funded by the Swedish Research Council through grant agreement no. 2018-05973.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Artur Souza .

Editor information

Editors and Affiliations

Appendices

A Prior Forgetting Supplementary Experiments

Fig. 6.
figure 6

BOPrO on the 1D Branin function with a decay prior. The leftmost column shows the log pseudo-posterior before any samples are evaluated, in this case, the pseudo-posterior is equal to the decay prior. The other columns show the model and pseudo-posterior after 0 (only random samples), 10, and 20 BO iterations. 2 random samples are used to initialize the GP model.

In this section, we show additional evidence that BOPrO can recover from wrongly defined priors so to complement Sect. 4.1. Figure 6 shows BOPrO on the 1D Branin function as in Fig. 3 but with a decay prior. Column (a) of Fig. 6 shows the decay prior and the 1D Branin function. This prior emphasizes the wrong belief that the optimum is likely located on the left side around \(\mathrm {x} = -5\) while the optimum is located at the orange dashed line. Columns (b), (c), and (d) of Fig. 6 show BOPrO on the 1D Branin after \(D+1=2\) initial samples and 0, 10, and 20 BO iterations, respectively. In the beginning of BO, as shown in column (b), the pseudo-posterior is nearly identical to the prior and guides BOPrO towards the left region of the space. As more points are sampled, the model becomes more accurate and starts guiding the pseudo-posterior away from the wrong prior (column (c)). Notably, the pseudo-posterior before \(\mathrm {x} = 0\) falls to 0, as the predictive model is certain there will be no improvement from sampling this region. After 20 iterations, BOPrO finds the optimum region, despite the poor start (column (d)). The peak in the pseudo-posterior in column (d) shows BOPrO will continue to exploit the optimum region as it is not certain if the exact optimum has been found. The pseudo-posterior is also high in the high uncertainty region after \(x = 4\), showing BOPrO will explore that region after it finds the optimum.

Fig. 7.
figure 7

BOPrO on the Branin function with exponential priors for both dimensions. (a) shows the log pseudo-posterior before any samples are evaluated, in this case, the pseudo-posterior is equal to the prior; the green crosses are the optima. (b) shows the result of optimization after 3 initialization samples drawn from the prior at random and 50 BO iterations. The dots in (b) show the points explored by BOPrO, with greener points denoting later iterations. The colored heatmap shows the log of the pseudo-posterior \(g(\boldsymbol{x})\). (Color figure online)

Figure 7 shows BOPrO on the standard 2D Branin function. We use exponential priors for both dimensions, which guides optimization towards a region with only poor performing high function values. Figure 7a shows the prior and Fig. 7b shows optimization results after \(D+1=3\) initialization samples and 50 BO iterations. Note that, once again, optimization begins near the region incentivized by the prior, but moves away from the prior and towards the optima as BO progresses. After 50 BO iterations, BOPrO finds all three optima regions of the Branin.

B Mathematical Derivations

1.1 B.1 EI Derivation

Here, we provide a full derivation of Eq. (7):

$$\begin{aligned} EI_{f_{\gamma }}(\boldsymbol{x})&:=\int _{-\infty }^{\infty } \max (f_{\gamma } - y, 0) p(y|\boldsymbol{x}) dy = \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y)\frac{p(\boldsymbol{x}|y) p(y)}{p(\boldsymbol{x})} dy. \end{aligned}$$

As defined in Sect. 3.2, \(p(y < f_{\gamma }) = \gamma \) and \(\gamma \) is a quantile of the observed objective values \(\{y^{(i)}\}\). Then \(p(\boldsymbol{x})= \int _{\mathbb {R}}p(\boldsymbol{x}|y)p(y)dy = \gamma g(\boldsymbol{x}) + (1-\gamma ) b(\boldsymbol{x})\), where \(g(\boldsymbol{x})\) and \(b(\boldsymbol{x})\) are the posteriors introduced in Sect. 3.3. Therefore

$$\begin{aligned} \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y) p(\boldsymbol{x}|y) p(y) dy&= g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y) p(y) dy \nonumber \\&= \gamma f_{\gamma } g(\boldsymbol{x}) - g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} y p(y) dy, \end{aligned}$$
(8)

so that finally

$$\begin{aligned} EI_{f_{\gamma }}(\boldsymbol{x})&=\frac{\gamma f_{\gamma } g(\boldsymbol{x}) - g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} y p(y) dy}{\gamma g(\boldsymbol{x}) + (1-\gamma ) b(\boldsymbol{x})} \propto \left( \gamma + \dfrac{b(\boldsymbol{x})}{g(\boldsymbol{x})}(1 - \gamma ) \right) ^{-1}. \end{aligned}$$
(9)

1.2 B.2 Proof of Proposition 1

Here, we provide the proof of Proposition 1:

$$\begin{aligned}&\lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} EI_{f_{\gamma }}(\boldsymbol{x}) \end{aligned}$$
(10)
$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y) p(\boldsymbol{x}|y) p(y) dy \end{aligned}$$
(11)
$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} (f_{\gamma } - y) p(y) dy \end{aligned}$$
(12)
$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \gamma f_{\gamma } g(\boldsymbol{x}) - g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} y p(y) dy\right) \end{aligned}$$
(13)
$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \frac{\gamma f_{\gamma } g(\boldsymbol{x}) - g(\boldsymbol{x}) \int _{-\infty }^{f_{\gamma }} y p(y) dy}{\gamma g(\boldsymbol{x}) + (1-\gamma ) b(\boldsymbol{x})} \end{aligned}$$
(14)

which, from Eq. (9), is equal to:

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \gamma + \dfrac{b(\boldsymbol{x})}{g(\boldsymbol{x})}(1 - \gamma ) \right) ^{-1} \end{aligned}$$
(15)

we can take Eq. (15) to the power of \(\dfrac{1}{t}\) without changing the expression, since the argument that maximizes EI does not change:

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \gamma + \dfrac{b(\boldsymbol{x})}{g(\boldsymbol{x})}(1 - \gamma ) \right) ^{-\frac{1}{t}} \end{aligned}$$
(16)

substituting g(x) and b(x) using their definitions in Sect. 3.3:

$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \gamma + \dfrac{P_b(\boldsymbol{x})\mathcal {M}_b(\boldsymbol{x})^{\tfrac{t}{\beta }}}{P_g(\boldsymbol{x})\mathcal {M}_g(\boldsymbol{x})^{\tfrac{t}{\beta }}}(1 - \gamma ) \right) ^{-\frac{1}{t}}\end{aligned}$$
(17)
$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{P_b(\boldsymbol{x})\mathcal {M}_b(\boldsymbol{x})^{\tfrac{t}{\beta }}}{P_g(\boldsymbol{x})\mathcal {M}_g(\boldsymbol{x})^{\tfrac{t}{\beta }}}(1 - \gamma ) \right) ^{-\frac{1}{t}}\end{aligned}$$
(18)
$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{P_b(\boldsymbol{x})}{P_g(\boldsymbol{x})}\right) ^{-\frac{1}{t}} \left( \dfrac{\mathcal {M}_b(\boldsymbol{x})^{\frac{t}{\beta }}}{\mathcal {M}_g(\boldsymbol{x})^{\frac{t}{\beta }}}\right) ^{-\frac{1}{t}} \left( 1- \gamma \right) ^{-\frac{1}{t}}\end{aligned}$$
(19)
$$\begin{aligned}&= \lim _{t\rightarrow \infty } \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{P_b(\boldsymbol{x})}{P_g(\boldsymbol{x})}\right) ^{-\frac{1}{t}} \left( \dfrac{\mathcal {M}_b(\boldsymbol{x})}{\mathcal {M}_g(\boldsymbol{x})}\right) ^{-{\frac{1}{\beta }}} \left( 1- \gamma \right) ^{-\frac{1}{t}}\end{aligned}$$
(20)
$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{\mathcal {M}_b(\boldsymbol{x})}{\mathcal {M}_g(\boldsymbol{x})} \right) ^{-\frac{1}{\beta }}\end{aligned}$$
(21)
$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{1-\mathcal {M}_g(\boldsymbol{x})}{\mathcal {M}_g(\boldsymbol{x})} \right) ^{-\frac{1}{\beta }}\end{aligned}$$
(22)
$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \dfrac{1}{\mathcal {M}_g(\boldsymbol{x})} - 1 \right) ^{-\frac{1}{\beta }}\end{aligned}$$
(23)
$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \left( \mathcal {M}_g(\boldsymbol{x})\right) ^{\frac{1}{\beta }}\end{aligned}$$
(24)
$$\begin{aligned}&= \mathop {\mathrm {arg\,max}}_{\boldsymbol{x}\in \mathcal {X}} \mathcal {M}_g(\boldsymbol{x})\end{aligned}$$
(25)

This shows that as iterations progress, the model grows more important. If BOPrO is run long enough, the prior washes out and BOPrO only trusts the probabilistic model. Since \(\mathcal {M}_g(\boldsymbol{x})\) is the Probability of Improvement (PI) on the probabilistic model \(p(y|\boldsymbol{x})\) then, in the limit, maximizing the acquisition function \(EI_{f_{\gamma }}(\boldsymbol{x})\) is equivalent to maximizing the PI acquisition function on the probabilistic model \(p(y|\boldsymbol{x})\). In other words, for high values of t, BOPrO converges to standard BO with a PI acquisition function.

C Experimental Setup

Table 1. Search spaces for our synthetic benchmarks. For the Profet benchmarks, we report the original ranges and whether or not a log scale was used.

We use a combination of publicly available implementations for our predictive models. For our Gaussian Process (GP) model, we use GPy’s [14] GP implementation with the Matérn5/2 kernel. We use different length-scales for each input dimension, learned via Automatic Relevance Determination (ARD) [32]. For our Random Forests (RF), we use scikit-learn’s RF implementation [35]. We set the fraction of features per split to 0.5, the minimum number of samples for a split to 5, and disable bagging. We also adapt our RF implementation to use the same split selection approach as Hutter et al. [18].

Table 2. Search space, priors, and expert configuration for the Shallow CNN application. The default value for each parameter is shown in bold.

For our constrained Bayesian Optimization (cBO) approach, we use scikit-learn’s RF classifier, trained on previously explored configurations, to predict the probability of a configuration being feasible. We then weight our EI acquisition function by this probability of feasibility, as proposed by Gardner et al. [12]. We normalize our EI acquisition function before considering the probability of feasibility, to ensure both values are in the same range. This cBO implementation is used in the Spatial use-case as in Nardi et al. [31].

For all experiments, we set the model weight hyperparameter to \(\beta = 10\) and the model quantile to \(\gamma = 0.05\), see Appendices J and I. Before starting the main BO loop, BOPrO is initialized by random sampling \(D+1\) points from the prior, where D is the number of input variables. We use the public implementation of SpearmintFootnote 6, which by default uses 2 random samples for initialization. We normalize our synthetic priors before computing the pseudo-posterior, to ensure they are in the same range as our model. We also implement interleaving which randomly samples a point to explore during BO with a \(10\%\) chance.

We optimize our EI acquisition function using a combination of a multi-start local search and CMA-ES [15]. Our multi-start local search is similar to the one used in SMAC [19]. Namely, we start local searches on the 10 best points evaluated in previous BO iterations, on the 10 best performing points from a set of 10,000 random samples, on the 10 best performing points from 10,000 random samples drawn from the prior, and on the mode of the prior. To compute the neighbors of each of these 31 total points, we normalize the range of each parameter to [0, 1] and randomly sample four neighbors from a truncated Gaussian centered at the original value and with standard deviation \(\sigma = 0.1\). For CMA-ES, we use the public implementation of pycma [16]. We run pycma with two starting points, one at the incumbent and one at the mode of the prior. For both initializations we set \(\sigma _0 = 0.2\). We only use CMA-ES for our continuous search space benchmarks.

Table 3. Search space, priors, and expert configuration for the Deep CNN application. The default value for each parameter is shown in bold.

We use four synthetic benchmarks in our experiments.

Branin. The Branin function is a well-known synthetic benchmark for optimization problems [8]. The Branin function has two input dimensions and three global minima.

SVM. A hyperparameter-optimization benchmark in 2D based on Profet [22]. This benchmark is generated by a generative meta-model built using a set of SVM classification models trained on 16 OpenML tasks. The benchmark has two input parameters, corresponding to SVM hyperparameters.

FC-Net. A hyperparameter and architecture optimization benchmark in 6D based on Profet. The FC-Net benchmark is generated by a generative meta-model built using a set of feed-forward neural networks trained on the same 16 OpenML tasks as the SVM benchmark. The benchmark has six input parameters corresponding to network hyperparameters.

XGBoost. A hyperparameter-optimization benchmark in 8D based on Profet. The XGBoost benchmark is generated by a generative meta-model built using a set of XGBoost regression models in 11 UCI datasets. The benchmark has eight input parameters, corresponding to XGBoost hyperparameters.

The search spaces for each benchmark are summarized in Table 1. For the Profet benchmarks, we report the original ranges and whether or not a log scale was used. However, in practice, Profet’s generative model transforms the range of all hyperparameters to a linear [0, 1] range. We use Emukit’s public implementation for these benchmarks [34].

D Spatial Real-World Application

Spatial  [23] is a programming language and corresponding compiler for the design of application accelerators on reconfigurable architectures, e.g. field-programmable gate arrays (FPGAs). These reconfigurable architectures are a type of logic chip that can be reconfigured via software to implement different applications. Spatial provides users with a high-level of abstraction for hardware design, so that they can easily design their own applications on FPGAs. It allows users to specify parameters that do not change the behavior of the application, but impact the runtime and resource-usage (e.g., logic units) of the final design. During compilation, the Spatial compiler estimates the ranges of these parameters and estimates the resource-usage and runtime of the application for different parameter values. These parameters can then be optimized during compilation in order to achieve the design with the fastest runtime. We fully integrate BOPrO as a pass in Spatial ’s compiler, so that Spatial can automatically use BOPrO for the optimization during compilation. This enables Spatial to seamlessly call BOPrO during the compilation of any new application to guide the search towards the best design on an application-specific basis.

Table 4. Search space, priors, and expert configuration for the MD Grid application. The default value for each parameter is shown in bold.

In our experiments, we introduce for the first time the automatic optimization of three Spatial real-world applications, namely, 7D shallow and deep CNNs, and a 10D molecular dynamics grid application. Previous work by Nardi et al. [31] had applied automatic optimization of Spatial parameters on a set of benchmarks but in our work we focus on real-world applications raising the bar of state-of-the-art automated hardware design optimization. BOPrO is used to optimize the parameters to find a design that leads to the fastest runtime. The search space for these three applications is based on ordinal and categorical parameters; to handle these discrete parameters in the best way we implement and use a Random Forests surrogate instead of a Gaussian Process one as explained in Appendix C. These parameters are application-specific and control how much of the FPGAs’ resources we want to use to parallelize each step of the application’s computation. The goal here is to find which steps are more important to parallelize in the final design, in order to achieve the fastest runtime. Some parameters also control whether we want to enable pipeline scheduling or not, which consumes resources but accelerates runtime, and others focus on memory management. We refer to Koeplinger et al. [23] and Nardi et al. [31] for more details on Spatial ’s parameters.

The three Spatial benchmarks also have feasibility constraints in the search space, meaning that some parameter configurations are infeasible. A configuration is considered infeasible if the final design requires more logic resources than what the FPGA provides, i.e., it is not possible to perform FPGA synthesis because the design does not fit in the FPGA. To handle these constraints, we use our cBO implementation (Appendix C). Our goal is thus to find the design with the fastest runtime under the constraint that the design fits the FPGA resource budget.

The priors for these Spatial applications take the form of a list of probabilities, containing the probability of each ordinal or categorical value being good. Each benchmark also has a default configuration, which ensures all methods start with at least one feasible configuration. The priors and the default configuration for these benchmarks were provided once by an unbiased Spatial developer, who is not an author of this paper, and kept unchanged during the entire project. The search space, priors, and the expert configuration used in our experiments for each application are presented in Tables 2, 3, and 4.

E Multivariate Prior Comparison

Fig. 8.
figure 8

Log regret comparison of BOPrO with multivariate and univariate KDE priors. The line and shaded regions show the mean and standard deviation of the log simple regret after 5 runs. All methods were initialized with \(D+1\) random samples, where D is the number of input dimensions, indicated by the vertical dashed line. We run the benchmarks for 200 iterations.

In this section we compare the performance of BOPrO with univariate and multivariate priors. For this, we construct synthetic univariate and multivariate priors using Kernel Density Estimation (KDE) with a Gaussian kernel. We build strong and weak versions of the KDE priors. The strong priors are computed using a KDE on the best 10D out of 10,000,000D uniformly sampled points, while the weak priors are computed using a KDE on the best 10D out of 1,000D uniformly sampled points. We use the same points for both univariate and multivariate priors. We use scipy’s Gaussian KDE implementation, but adapt its Scott’s Rule bandwidth to \(100n^{-\frac{1}{d}}\), where d is the number of variables in the KDE prior, to make our priors more peaked.

Figure 8 shows a log regret comparison of BOPrO with univariate and multivariate KDE priors. We note that in all cases BOPrO achieves similar performance with univariate and multivariate priors. For the Branin and SVM benchmarks, the weak multivariate prior leads to slightly better results than the weak univariate prior. However, we note that the difference is small, in the order of \(10^{-4}\) and \(10^{-6}\), respectively.

Surprisingly, for the XGBoost benchmark, the univariate version for both the weak and strong priors lead to better results than their respective multivariate counterparts, though, once again, the difference in performance is small, around 0.2 and 0.03 for the weak and strong prior, respectively, whereas the XGBoost benchmark can reach values as high as \(f(\boldsymbol{x}) = 700\). Our hypothesis is that this difference comes from the bandwidth estimator (\(100n^{-\frac{1}{d}}\)), which leads to larger bandwidths, consequently, smoother priors, when a multivariate prior is constructed.

Fig. 9.
figure 9

Log regret comparison of BOPrO with varying prior quality. The line and shaded regions show the mean and standard deviation of the log simple regret after 5 runs. All methods were initialized with \(D+1\) random samples, where D is the number of input dimensions, indicated by the vertical dashed line. We run the benchmarks for 200 iterations.

F Misleading Prior Comparison

Figure 9 shows the effect of injecting a misleading prior in BOPrO. We compare BOPrO with a misleading prior, no prior, a weak prior, and a strong prior. For our misleading prior, we use a Gaussian centered at the worst point out of 10,000,000D uniform random samples. Namely, for each parameter, we inject a prior of the form \(\mathcal {N}(x_{w}, \sigma _w^2)\), where \(x_{w}\) is the value of the parameter at the point with highest function value out of 10,000,000D uniform random samples and \(\sigma _w = 0.01\). For all benchmarks, we note that the misleading prior slows down convergence, as expected, since it pushes the optimization away from the optima in the initial phase. However, BOPrO is still able to forget the misleading prior and achieve similar regret to BOPrO without a prior.

G Comparison to Other Baselines

Fig. 10.
figure 10

Log regret comparison of BOPrO, SMAC, and TPE. The line and shaded regions show the mean and standard deviation of the log simple regret after 5 runs. BOPrO was initialized with \(D+1\) random samples, where D is the number of input dimensions, indicated by the vertical dashed line. We run the benchmarks for 200 iterations.

Fig. 11.
figure 11

Log regret comparison of TuRBO with different number of trust regions. TuRBO-M denotes TuRBO with M trust regions. The line and shaded regions show the mean and standard deviation of the log simple regret after 5 runs. TuRBO was initialized with \(D+1\) uniform random samples, where D is the number of input dimensions, indicated by the vertical dashed line. We run the benchmarks for 200 iterations.

We compare BOPrO to SMAC [19], TuRBO [9], and TPE [3] on our four synthetic benchmarks. We use Hyperopt’s implementationFootnote 7 of TPE, the public implementation of TuRBOFootnote 8, and the SMAC3 Python implementation of SMACFootnote 9. Hyperopt defines priors as one of a list of supported distributions, including Uniform, Normal, and Lognormal distributions, while SMAC and TuRBO do not support priors on the locality of an optimum under the form of probability distributions.

For the three Profet benchmarks (SVM, FCNet, and XGBoost), we inject the strong priors defined in Sect. 4.2 into both Hyperopt and BOPrO. For Branin, we also inject the strong prior defined in Sect. 4.2 into BOPrO, however, we cannot inject this prior into Hyperopt. Our strong prior for Branin takes the form of a Gaussian mixture prior peaked at all three optima and Hyperopt does not support Gaussian mixture priors. Instead, for Hyperopt, we arbitrarily choose one of the optima \((\pi , 2.275)\) and use a Gaussian prior centered near that optimum. We note that since we compare all approaches based on the log simple regret, both priors are comparable in terms of prior strength, since finding one optimum or all three would lead to the same log regret. Also, we note that using Hyperopt’s Gaussian priors leads to an unbounded search space, which sometimes leads TPE to suggest parameter configurations outside the allowed parameter range. To prevent these values from being evaluated, we convert values outside the parameter range to be equal to the upper or lower range limit, depending on which limit was exceeded. We do not inject any priors into SMAC and TuRBO, since these methods do not support priors about the locality of an optimum.

Figure 10 shows a log regret comparison between BOPrO, SMAC, TuRBO-2 and TPE on our four synthetic benchmarks. We use TuRBO-2 since it led to better performance overall compared to other variants, see Fig. 11. BOPrO achieves better performance than SMAC and TuRBO on all four benchmarks. Compared to TPE, BOPrO achieves similar or better performance on three of the four synthetic benchmarks, namely Branin, SVM, and FCNet, and slightly worse performance on XGBoost. We note, however, that the good performance of TPE on XGBoost may be an artifact of the approach of clipping values to its maximal or minimal values as mentioned above. In fact, the clipping nudges TPE towards promising configurations in this case, since XGBoost has low function values near the edges of the search space. Overall, the better performance of BOPrO is expected, since BOPrO is able to combine prior knowledge with more sample-efficient surrogates, which leads to better performance.

H Prior Baselines Comparison

Fig. 12.
figure 12

Log regret comparison of BOPrO, Spearmint with prior initialization, and Spearmint with default initialization. The line and shaded regions show the mean and standard deviation of the log simple regret after 5 runs. BOPrO and Spearmint Prior were initialized with \(D+1\) random samples from the prior, where D is the number of input dimensions, indicated by the vertical dashed line. We run the benchmarks for 200 iterations.

Fig. 13.
figure 13

Log regret comparison of BOPrO, HyperMapper with prior initialization, and HyperMapper with default initialization. The line and shaded regions show the mean and standard deviation of the log simple regret after 5 runs. BOPrO and HyperMapper Prior were initialized with \(D+1\) random samples from the prior, where D is the number of input dimensions, indicated by the vertical dashed line.

We show that simply initializing a BO method in the DoE phase by sampling from a prior on the locality of an optimum doesn’t necessarily lead to better performance. Instead in BOPrO, it is the pseudo-posterior in Eq. (4) that drives its stronger performance by combining prior and new observations. To show that, we compare BOPrO with Spearmint and HyperMapper such as in section Sects. 4.2 and  4.3, respectively, but we initialize all three methods using the same approach. Namely, we initialize all methods with \(D+1\) samples from the prior. Our goal is to show that simply initializing Spearmint and HyperMapper with the prior will not lead to the same performance as BOPrO, because, unlike BOPrO, these baselines do not leverage the prior after the DoE initialization phase. We report results on both our synthetic and real-world benchmarks.

Figure 12 shows the comparison between BOPrO and Spearmint Prior. In most benchmarks, the prior initialization leads to similar final performance. In particular, for XGBoost, the prior leads to improvement in early iterations, but to worse final performance. We note that for FCNet, Spearmint Prior achieves better performance, however, we note that the improved performance is given almost solely from sampling from the prior. There is no improvement for Spearmint Prior until around iteration 190. In contrast, in all cases, BOPrO is able to leverage the prior both during initialization and its Bayesian Optimization phase, leading to improved performance. BOPrO still achieves similar or better performance than Spearmint Prior in all benchmarks.

Figure 13 shows similar results for our Spatial benchmarks. The prior does not lead HyperMapper to improved final performance. For the Shallow CNN benchmark, the prior leads HyperMapper to improved performance in early iterations, compared to HyperMapper with default initialization, but HyperMapper Prior is still outperformed by BOPrO. Additionally, the prior leads to degraded performance in the Deep CNN benchmark. These results confirm that BOPrO is able to leverage the prior in its pseudo-posterior during optimization, leading to improved performance in almost all benchmarks compared to state-of-the-art BO baselines.

Fig. 14.
figure 14

Comparison of BOPrO with the strong prior and different values for the \(\gamma \) hyperparameter on our four synthetic benchmarks. We run BOPrO with a budget of 10D function evaluations, including \(D+1\) randomly sampled DoE configurations.

I \(\gamma \)-Sensitivity Study

We show the effect of the \(\gamma \) hyperparameter introduced in Sect. 3.2 for the quantile identifying the points considered to be good. To show this, we compare the performance of BOPrO with our strong prior and different \(\gamma \) values. For all experiments, we initialize BOPrO with \(D+1\) random samples and then run BOPrO until it reaches 10D function evaluations. For each \(\gamma \) value, we run BOPrO five times and report mean and standard deviation.

Figure 14 shows the results of our comparison. We first note that values near the lower and higher extremes lead to degraded performance, this is expected, since these values will lead to an excess of either exploitation or exploration. Further, we note that BOPrO achieves similar performance for all values of \(\gamma \), however, values around \(\gamma = 0.05\) consistently lead to better performance.

J \(\beta \)-Sensitivity Study

Fig. 15.
figure 15

Comparison of BOPrO with the strong prior and different values for the \(\beta \) hyperparameter on our four synthetic benchmarks. We run BOPrO with a budget of 10D function evaluations, including \(D+1\) randomly sampled DoE configurations.

We show the effect of the \(\beta \) hyperparameter introduced in Sect. 3.3 for controlling the influence of the prior over time. To show the effects of \(\beta \), we compare the performance of BOPrO with our strong prior and different \(\beta \) values on our four synthetic benchmarks. For all experiments, we initialize BOPrO with \(D+1\) random samples and then run BOPrO until it reaches 10D function evaluations. For each \(\beta \) value, we run BOPrO five times and report mean and standard deviation.

Figure 15 shows the results of our comparison. We note that values of \(\beta \) that are too low (near 0.01) or too high (near 1000) often lead to lower performance. This shows that putting too much emphasis on the model or the prior will lead to degraded performance, as expected. Further, we note that \(\beta = 10\) lead to the best performance in three out of our four benchmarks. This result is reasonable, as \(\beta = 10\) means BOPrO will put more emphasis on the prior in early iterations, when the model is still not accurate, and slowly shift towards putting more emphasis on the model as the model sees more data and becomes more accurate.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Souza, A., Nardi, L., Oliveira, L.B., Olukotun, K., Lindauer, M., Hutter, F. (2021). Bayesian Optimization with a Prior for the Optimum. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86523-8_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86522-1

  • Online ISBN: 978-3-030-86523-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics