There's a whole class of deterministic Global Optimization solvers on COIN-OR [0] that don't seem to get noticed outside the Operations Research/Applied math community. (many are free and open source, e.g. Couenne)
These solvers (BARON [1], ANTIGONE [2], Couenne [3] etc.) are capable of solving nonconvex global optimization problems that are structurally more complicated (i.e. non-ML applications) than the ones presented in the article linked above.
They employ a bunch of mathematically-rigorous methods such as interval analysis, branch-and-cut and McCormick relaxations. There's a treasure trove of little ideas from decades of global optimization research that I feel the Deep Learning community could benefit from taking a peek at.
I've been looking at open source solvers for a while now to solve an MIP, and this one seems to be the best.
There is this annoying thing about the whole ecosystem though, aside from the proprietary bits.
I started by looking at GLPK and wrote my MIP in GMPL. For some reason a lot of tools I looked at don't "just work". I have to export it to some other format. It works, but feels like a work-around.
I wish there were a lingua franca in this domain and sort of have support from all major solver implementations, but each one seems to have their own way, or just provide an API (ehem google/or-tools)
I am happy I can make it work, but it may not be as easy for others.
Pyomo is also pretty useful if you'd rather use Python than Julia, though it's not as well-documented and doesn't make installation of the open-source solvers quite as easy for you.
Thanks, but it was not really what I was looking for. I was looking for something that feels closer to a DSL than an API. In OR, a language like AMPL (IIRC, by Brian Kernighan) or GAMS is typically used where you define sets, variables, etc. It's something I feel I can more easily share to co-workers with different backgrounds, since it's closer to math, and follows the whole objective function, constraint, etc. organization in OR problems.
If you want to get a feel for it, here's an example in GMPL, which is a subset of AMPL used in GLPK:
JuMP and Pyomo are both exactly that. Pyomo is as near as you can get to an AMPL clone with python syntax. JuMP is an optimization DSL using Julia macros.
Maybe I was confusing my terms or something. I think of these as like an API as well, although packaged into a nice library that automates some stuff. So I guess I'll try to describe my ideal scenario instead.
What I was thinking of was a common language that can be used by all solvers. Similar to how C can be compiled by gcc or clang, this language can be used directly with the solver binary. Something like `export CC=my_solver` and `CC -d data.dat -m model.mod`, if you will.
What a lot of these projects do feels like calling a solver by translating it into some format that's different for everyone, whereas I wanted for the solvers to all agree on some language.
Or at the very least, some standard similar to how languages have libraries for something like XML. It would be nice if JuMP and PyOmO could read these standardized formats.
Then again, it's probably a pain to implement and even C compilers implement different extensions. But one can dream.
See the COIN-OR Optimization Services (OS) project for a solver-independent xml format called OSiL. It's functionally equivalent to AMPL .nl format but more human-readable. Unfortunately the osil readers for solvers have more performance problems and bugs than .nl readers, since the AMPL format has been around longer and is commercially supported.
The proprietary part of AMPL is the program that turns human-readable .mod and .dat files into low-level .nl encodings of the problem representation. The .nl format is documented, and there's an MIT licensed library on netlib (and now github too) that reads it and evaluates functions, constraints, gradients, Jacobians, Hessians, etc. (AD is super useful and has been around way longer than the deep learning popularization of it.)
Julia has a low-level solver-independent API called MathProgBase that sits between solver wrappers and modeling languages like JuMP / Convex.jl. It's in the process of being rewritten as MathOptInterface, but it would allow people to write solvers in pure Julia and be usable right away with JuMP.
> whereas I wanted for the solvers to all agree on some language.
Umm, they (mostly) do. Pyomo supports the AMPL .nl format, which means you can call any AMPL solver from Pyomo (most COIN-OR solvers have AMPL interfaces). The AMPL .nl format is the de facto standard for open-source optimization solvers because it's the only open standard, I believe [0].
So you would define your mathematical program in Pyomo (which isn't a terribly challenging syntax compared to AMPL). When you invoke solve, Pyomo will spit out an .nl file and call a solver (which reads the .nl file, solves the problem, and outputs a .sol file which can be read back into Pyomo).
Incidentally this is exactly how it works in AMPL as well -- all the interop is done through .nl and .sol files. I recommend reading [0] to understand how this works.
Oh, wow. I couldn't really figure out which parts of AMPL were proprietary and I guess I immediately dismissed it after that without understanding how it worked.
As for Pyomo, while at first glance I dislike the syntax (it's not as easy to share with others, at least compared to something like GNU MathProg, where the receiver doesn't need to concern themselves with python or julia), I like python enough to maybe give it a try. I'll have to learn more about it though. Thanks to you and the others!
I guess it doesn't matter that much, but if there were a common language, it would be a nice feature that would allow you to skip the middleman altogether and have simpler dependencies if you know which solvers you want.
It's the common language that matter more though. I've used GAMS in my undergrad, so I'm under the impression that it's had at least some adoption, but it's proprietary AFAIK. I've know some people who use AMPL too, which also seems popular, but I'm not sure about it's licenses.
You might also be interested in looking at MiniZinc (http://minizinc.org/). It is mostly used int the constraint programming community, but the language is solver technology neutral and there are back-ends that translate to MIP systems.
Personally I enjoy using MiniZinc as a language for prototyping models in, and enabling me to experiment and compare various different solver technologies easily.
Speaking of optimization libraries, I've been checking out https://developers.google.com/optimization/ for scheduling and assignment problems. The documentation is not... wonderful, but it has a lot of examples and a decent Python API.
I recently did an "Intro to OR" class and most of the tools we used were proprietary. I've been looking for more open source tools (and more info on OR in general, I think it's fascinating!).
I highly recommend taking a look at the COIN-OR (Computational Infrastructure for Operations Research) project page. All the code is open-source. There are some amazing solvers in that list.
https://www.coin-or.org/projects/
The only issue is that much of the code on that site is very cutting-edge and academic. You may need to read a few benchmarks/publications to figure which solvers are production quality and which are not.
That said, a handful of these free solvers are so well-written that they exceed the performance of many commercial solvers for certain types of optimization problems.
(A notable exception is MIPs. Commercial MIP solvers like Gurobi or CPLEX are orders of magnitude better than the open-source CBC or GLPK. CBC is still pretty ok though, and I bundle it when I need to include a free solver in my code to solve smaller problems.)
I've used the linear programming bit of COIN-OR - Clp and Cbc. They may not be as good as the commercial solvers, but they're still way better than the other open-source options. That is particularly impressive given that, AIUI, they are largely the work of one man, John Forrest.
For general optimisation, i've used NLopt. It was very easy to use, and has a pretty sensible set of algorithms.
One thing i've learnt is that comparing optimisation algorithms is really hard. One might be better at one problem, one at another. One pretty general axis of variation is problem size: some algorithms scale to orders of magnitude more variables (or constraints etc) than others.
In particular, i get the impression that a lot of the cutting-edge research is about being able to solve huge problems at all, or in hours instead of days. Perhaps those algorithms will be more useful for ML hyperparameter optimisation, but unfortunately, my needs are about solving much simpler problems in a fraction of a second. I get a lot more excited about Michael Powell's algorithms, which are cunning but simple, and work well on my problems.
> They may not be as good as the commercial solvers, but they're still way better than the other open-source options.
Agreed. CBC/Clp are the best open-source LP/MIP solvers out there. SCIP is better, but it's not fully free.
> For general optimisation, i've used NLopt. It was very easy to use, and has a pretty sensible set of algorithms.
I noticed the omission of IPOPT support in NLopt. IPOPT is one of the best large-scale nonconvex NLP solvers out there (if you have exactly 2nd order derivative information, typically from AD).
> One thing i've learnt is that comparing optimisation algorithms is really hard. One might be better at one problem, one at another. One pretty general axis of variation is problem size: some algorithms scale to orders of magnitude more variables (or constraints etc) than others.
Very true. The academic world has standardized on Dolan-More charts for comparing optimizers on a standard corpus of problems, and those are the closest we have to quantifying general performance. But the no-free lunch theorem applies. There's also a lot of randomness and chance involved, e.g. initialization points can drastically affect nonconvex solver performance on the same problem. So can little things like ordering of constraints or the addition of redundant non-binding constraints (!) for MIPs (this was demonstrated by Fischetti). That's why modern MIP solvers include a random perturbation parameter.
> ML hyperparameter optimisation
Most types of hyperparameter optimization are black-box optimization problem with no analytic derivative information, which much of research in deterministic optimization does not address. Black-box problems typically rely on DFO (Derivative-Free Optimization) methods. DFO methods are even harder to quantify in terms of performance. The Nelder-Mead Simplex method is as good a method as any.
Interesting! My problems are derivative-free, and my gut feeling is that a lot of real-world engineering-type problems are, because at some point they involve some gnarly nonlinear maths. In my case, my variables are about the positions of some knots from which i construct a monotonic cubic spline, which i then use as a parameter to another model; my objective function is how well that model reproduces some real measurements. I get a headache just writing it all down.
I've had good results from simplex; i'm a bit sad to hear there isn't some magical algorithm that will give me a massive speedup over it.
I haven't tried automatic differentiation. That seems extremely daunting given the volume of code in my problem. Someone has done some groundwork on AD of the main library i use in my model, but it would still be a serious bit of work.
Optimal knot placement, though seemingly simple, can be a gnarly problem due to degeneracy in many cases. Anecdotally I've found quadratic splines to still be somewhat tractable with exact methods, but polynomial splines with cubic or higher orders can be tricky, especially when they are large-scale.
For derivative-free optimization, you can sometimes get speedups from just adopting a embarassingly parallel algorithm like Particle Swarm. These types of stochastic algorithms aren't necessarily algorithmically better than Nelder-Mead Simplex, but they can explore a larger terrain in parallel, which may help. If you can find some parallel stochastic code, you should be able to spin up a VM on the cloud with lots of cores to help you search the space.
No, but I was a big-time user and really pushed it to its limits with my large-scale nonconvex problems. I never actually tinkered with its internals but I did feed it some very unorthodox formulations. :)
John Forrest is a hell of a character too if you ever talk to him in person. He's retired to a villa in Italy but still works on Clp, Cbc, and various consulting projects for fun: http://fastercoin.com/f2_files/about.html
I'm not working on linear programming stuff at the moment, but i am seriously thinking that if i ever get back to it, i'll ask my boss if we can rustle up some budget to hire him!
>One thing i've learnt is that comparing optimisation algorithms is really hard
It still boggles my mind how simplex, something that theoretically runs worse that interior point methods is competitive (at least based on what I've read online). I guess with these problems, the instances of the problem has a huge effect on how long approaches take.
>In particular, i get the impression that a lot of the cutting-edge research is about being able to solve huge problems at all
This is also the impression I get. When I look at benchmarks (btw, does anyone know an updated publication for these? A lot of what I find seems to be from years ago. I'm not sure how fast development of these are, but it would be nice to be updated once in a while) it always has some measure of time along with number of instances solved.
>Perhaps those algorithms will be more useful for ML hyperparameter optimisation
I thought about this, and maybe the reason why they largely don't use these have to do with getting results that are good enough (generalize well). A global optimum might not be worth the effort, so they stick to some variant of gradient descent. The measure they look at, after all, is performance on the test set.
Aside from that, there may be something specific about a well defined problem that they can use to speed computations up, and a more general approach probably can't assume these for other instances of NLP for example.
> It still boggles my mind how simplex, something that theoretically runs worse that interior point methods is competitive (at least based on what I've read online). I guess with these problems, the instances of the problem has a huge effect on how long approaches take.
Well, theoretical worst cases are just that. They are worst case bounds. People think NP-hard problem are intractable, but this is a misunderstanding. Many NP-hard problems can actually be solved with fairly good average case performance.
Case in point: the Simplex method is worst-case exponential time, but its average case performance is actually pretty good.
When Khachiyan first came up with his elipsoid method, it was polynomial time but in practice it was too slow. Karmarkar's interior point algorithm was polynomial time too, but performs much efficiently.
These days though, solve times are predominantly affected by solver heuristics and randomness. The choice of Simplex vs Interior Point does not make a significant difference in many cases.
I do not have anything to contribute regarding this specific optimization algorithm, but I want to recommend the dlib [0] library it comes from, which is a very high quality library that has machine learning algorithms, numerical analysis algorithms, image processing, network, threading, gui, and a few other things.
The (sole, afaik) library author and maintainer Davis King is also extremely responsive and helpful. dlib and Davis deserve a lot more recognition.
Of course, global optimization using the Lipschitz constant has been around for decades. As far as I understand, this paper also estimates the constant. Is that the novelty or is it the convergence proofs?
Another method for global optimization is interval arithmetic, but it is not a black-box method.
Yes indeed! (to this stuff being around a while) And then what it can do is a subset of the interval toolkit (also around a while). Somehow I get the idea that interval methods are underappreciated and thus lesser known - it seems anecdotally that people hear of the outward rounding bounds growth problem and go no further. (maybe evaluation efficiency compared to floats is the other gripe)
"Interval algorithms for constrained and unconstrained optimization are based on adaptive, exhaustive search of the domain. Their overall structure is virtually identical to Lipschitz optimization as in [4], since interval evaluations of an objective function ϕ over an interval vector x correspond to estimation of the range of ϕ over x with Lipschitz constants. However, there are additional opportunities for acceleration of the process with interval algorithms, and use of outwardly rounded interval arithmetic gives the computations the rigor of a mathematical proof."
I'm also curious what this adds that's new. I'll add one other bit: If anyone ever claims they have an efficient global optimization algorithm for continuous optimization, ask whether or not P=NP. We can encode the integer constraint that x \in {0,1} as x(x-1)=0, so if we can find the global optimum to a continuous problem efficiently, then we can do the same for a discrete. That's probably unlikely, so I'm always a bit skeptical.
By the way, the parent mentioned interval arithmetic. If anyone is looking for optimization algorithm look for "interval Newton". It's a cool algorithm.
Not all continuous optimization algorithms can handle arbitrary equality constraints, especially if they effectively make the optimization domain discrete. You might be able to do an encoding of NP-complete problems that is purely continuous, but if I read the LIPO paper correctly, there is a known exponential lower bound for their problem, so that would just make the NP-complete problem harder.
Instead, the algorithm promises to actually match that lower bound, while always being at least as good as random search, and offering exponentially large improvements in certain cases. Not exactly the kind of claim that has implications for P =? NP.
Interval Newton is probably my favorite algorithm. Intervals in concert with the fixed point theorems - amazing stuff. (and all the rest! - contractors, inclusion/exclusion tests exploiting derivative info, etc.) Good tools for optimization, indeed.
> If anyone ever claims they have an efficient global optimization algorithm for continuous optimization, ask whether or not P=NP
Why? Continuous global optimization on convex problems can be solved in polynomial time. Did you mean on nonconvex problems?
Also, x*(x-1)=0 is a complementarity constraint and it is mathematically degenerate (it violates constraint qualifications) when solved as a continuous optimization problem. There are many ways to work-around it [0, 1, 2] and even solvers built for this of problems (MPECs = Mathematical Programs with Equilibrium Constraints) which I suspect only work well for narrow classes of problems like Linear Complementarity Problem (LCPs).
I have found that ultimately for nonlinear non-convex problems, the solution technique for continuous optimization of complementarity constraint problems has been empirically unreliable or the solution is often low equality (even if it can be shown that the complementarity penalty method does not violate constraint qualifications).
Bilinear constraints generally add nonconvexities and add a lot of difficult terrain for a continuous optimizer.
For general nonconvex problems, it's much more reliable to cast the problem as an NP-hard MINLP.
And if you're solving an LCP, you might as well recast it as an MIP rather than mucking around with an LCP. MIP solvers like Gurobi and CPLEX are so amazingly performant these days that it's just a no-brainer
Oh I very much agree that good MIP solvers work well for complementary problems. And, I suppose my point about P=NP was that if we had a general scalable global optimization solver, then it would have all sorts of implications outside of continuous optimization. I didn't qualify convex vs nonconvex and, of course, many convex problems can be solved in polynomial time. Though, funny enough, not all. De Klerk and Pasechnik have a paper,"Approximation of the stability number of a graph via copositive programming," where they prove that NP hard problems can be encoded as convex problems. Pasechnik has a nice explanation here:
Anyone considering using this for continuous hyper-parameter optimization in machine learning should be warned that the automatic Lipschitz constant optimization (the novelty of the method) will very likely not work well with a noisy objective function (any stochastic learning algorithm).
edit: Any grad students looking for a paper idea might consider extending this result to allow stochastic objective functions satisfying a stochastic Lipschitz condition.
> However, if you want to use LIPO in practice there are some issues that need to be addressed. The rest of this blog post discusses these issues and how the dlib implementation addresses them. First, if f(x) is noisy or discontinuous even a little it's not going to work reliably since k will be infinity.
> ... Now each sample from f(x) has its own noise term, σi, which should be 0 most of the time unless xi is really close to a discontinuity or there is some stochasticity.
The strategy that the dlib implementation uses of combining a global optimization method (LIPO) with one that works better locally (TR) reminds me of another recent paper I ran across, which takes a black-box global optimization algorithm (StoSOO) and improves local convergence behavior using CMA-ES [1]. A main difference seems to be that StoSOO focuses on robustness to noisy functions more than LIPO does.
I'd like to share another global optimization algorithm 'worth using' that I've recently discovered for myself: Hyperband [0]. It's extremely simple to implement (i.e. a dozen lines of code), is based on successive halving principle and is particularly suitable for iterative machine learning algorithms where early stopping is applicable (i.e. NN, GBM).
I don't speak fluent maths, but i'm extremely interested. Is anyone in a position to explain what the initial equation for U(x) means?
I'll be in your debt forever!
A Lipschitz function is one that doesn't change to quickly, such that the difference between two values |f(x) - f(y)| is at most as large as the distance between x and y times a constant k i.e. |f(x) - f(y)| <= k|x-y|. In particular this implies that:
f(x) - f(y) <= k|x-y|
hence
f(x) <= f(y) + k|x-y|.
Therefore the function:
u(x) = f(y) + k|x-y|
is an upper bound for the function f(x). Repeating this for multiple points x1, x2, x3,... you can construct a stricter upper bound by taking the minimum:
U(x) starts by evaluating f(x), but instead of checking for every value of x, it only checks certain values x1, x2, x3... xt. Then U(x) gives an upper bound for what f(x) might be for any x between the ones you actually calculated. Take the difference between any x and the nearest xi for which you know f(xi), and multiply it by the maximum possible slope. Add that to f(xi), and you have the upper bound for how high f(x) might be.
Consider the case where f is an elementary function of one variable and you know that its slope between any two points is at most k and at least -k. And suppose you want an upper bound on f(0) but you know the values of f(-2) and f(3). Then f(0) <= f(-2) + 2 * k and f(0) <= f(3) + 3 * k. Taking the minimum of these two right-hand sides leads to a formula analogous to the one for U(x) that you're asking about (with the norm || x - x_i || taking the place of |0 - (-2)| and |0 - 3| in our example).
if you can program, you can read maths (understanding its purpose is a different issue)
this formula defines a function U(x). This means that it gives a recipe that takes a number x and produces a number U(x). The recipe depends on an already prepared set of ingredients: a set of numbers x1, x2, ..., xt, k, and another function f. Now, given a number x, to compute U(x), you first compute all the numbers f(xi)-k*|x-xi|, and then you pick the smallest one. This smallest value is then U(x).
The graph below the formula displays its interpretation. The graph of the function f is the red curve, the points (xi, f(xi)) are the black dots, and U(x) is in green.
The doc uses x to refer to the new point, and x_i for the ith previously evaluated point. To reduce ambiguity, i'll use a capital X to refer to the new point.
The equation says that U(X) is equal to the minimum value of the expression f(x_i) + k * (Euclidean distance from x to x_i), where i ranges over integers 1 to t.
In Go pseudo-code:
func U(X []float64, x [][]float64, f func([]float64) float64, k float64) float64 {
min := math.Inf(1)
for _, x_i := range x {
e := f(x_i) + k * EuclideanDistance(X, x_i)
if e < min {
min := e
}
}
return min
}
Can someone explain the intuition behind how the lipschitz constant k is chosen?
It seems like even in his demo video, k starts off underestimated which made the upperbound not a upperbound (f is sometimes above the green line U) and is only slowly corrected towards the end. Why does it still work?
Would it be possible to make the hyperparameters optimizable? Maybe it's the biologist in me, but I can't really see a reason why you couldn't just let the algorithm decide for itself which values of those parameters would work best.
Maybe I wasn't clear, or misunderstood some of the article, but I thought the article was about optimising the hyperparameters before the machine learning part happens. I meant optimising the parameters during the machine learning iterations.
These solvers (BARON [1], ANTIGONE [2], Couenne [3] etc.) are capable of solving nonconvex global optimization problems that are structurally more complicated (i.e. non-ML applications) than the ones presented in the article linked above.
They employ a bunch of mathematically-rigorous methods such as interval analysis, branch-and-cut and McCormick relaxations. There's a treasure trove of little ideas from decades of global optimization research that I feel the Deep Learning community could benefit from taking a peek at.
[0] https://neos-server.org/neos/solvers/index.html
[1] http://archimedes.cheme.cmu.edu/?q=baron
[2] http://helios.princeton.edu/ANTIGONE/index.html
[3] https://www.coin-or.org/Couenne/