A minimizer Far, Far Away

A few recent Arxiv papers and some recent conversations during my lectures made me realize that some optimization people might not be fully aware of important details on SGD when used on functions where the minimizer can be arbitrarily far from the initialization or even in the case when the minimizer does not exist. So, let’s talk about it.

First of all, when does this happen? Well, in machine learning it is very common. For example, if you run logistic regression on a separable dataset, or SGD with universal kernels and no repeated data points, or even in deep learning when we assume the so-called interpolation assumption and use cross-entropy + softmax. In all these cases, the minimizer does not exist because we can make the training error arbitrarily close to {\inf_{{\boldsymbol x}} f({\boldsymbol x})} by increasing the norm of {{\boldsymbol x}} and keeping its direction fixed. In more intuitive terms, we can say that the minimizer is “at infinity”.

So, is this a weak or a strong assumption? I’ll let you decide, but for sure if you believe that the interpolation assumption is reasonable, then you also have to believe that assuming the existence of minimizers might not be reasonable in machine learning…

What happens to Stochastic Gradient Descent (SGD) on convex functions in these cases? Consider for simplicity the cases that the stochastic subgradients are bounded by {L} and use SGD with learning rate {\frac{\alpha}{L\sqrt{T}}} starting from {{\boldsymbol x}_1}. (Of course, you can use weaker assumptions, but the core idea will stay the same.) Then, if we take a look at the rates people normally state, we have

\displaystyle \mathop{\mathbb E}[f(\bar{{\boldsymbol x}}_T)]-f({\boldsymbol x}^\star) \leq \frac{L}{2\sqrt{T}}\left(\frac{\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|^2}{\alpha}+ \alpha\right),

where {\bar{{\boldsymbol x}}_T} is the average of the iterates, {{\boldsymbol x}^\star} is the optimal solution, and {\alpha} is the usual hyperparameter in the learning rate that you have to choose. From the rate, we might be tempted to think that if {{\boldsymbol x}^\star} is really far (“in a galaxy far, far away…”), the rate becomes arbitrarily bad. Even worse, we might be tempted to think that if {{\boldsymbol x}^\star} does not exist, that is it is at infinity, the bound becomes vacuous and we cannot guarantee anything. Now, suppose that your algorithm requires an upper bound on {\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|} to properly function and its rate depends on it, what would happen? In the best case, the convergence guarantee would be vacuous, in the worst case you would not be even able to run the algorithm! Yet, one might not be worried thinking that SGD has the same behavior.

If you believe any of the above, it might be the case that you missed some very important details of the theory of SGD. So, let’s try to make some clarity on this point.

It turns out that the above rate is not the best we can prove. What we can actually prove is

\displaystyle \mathop{\mathbb E}[f(\bar{{\boldsymbol x}}_T)]- f({\boldsymbol u}) \leq \frac{L}{2\sqrt{T}}\left(\frac{\|{\boldsymbol x}_1-{\boldsymbol u}\|^2}{\alpha}+ \alpha\right), \forall {\boldsymbol u} \in {\mathbb R}^d~.

The reason is simple: in the entire proof, we never used any property of the minimizer, so we can state the convergence with respect to any arbitrary point. So, the critical difference is that this holds for any {{\boldsymbol u}} rather than only for {{\boldsymbol x}^\star}. So, now this implies a non-vacuous guarantee even when {{\boldsymbol x}^\star} does not exist. In fact, let me rewrite it as

\displaystyle \mathop{\mathbb E}[f(\bar{{\boldsymbol x}}_T)] \leq \frac{\alpha L}{2\sqrt{T}} + \min_{{\boldsymbol u}} \ f({\boldsymbol u})+\frac{L\|{\boldsymbol x}_1-{\boldsymbol u}\|^2}{2\alpha \sqrt{T}} ~.

What does it mean? Well, it means that the rate SGD guarantees is better than what one might have thought. Indeed, we can guarantee that the value of the objective function will be at most {\frac{\alpha L}{2\sqrt{T}}} plus the value of the minimum of a regularized function, where the strength of the regularization depends on the number of iterations and {\alpha}. Take a moment to digest this, because this is deeper than it looks at first sight. The above says that it might not matter how far {{\boldsymbol x}^\star} might be, indeed it might also be at infinity! Of course, a similar reasoning also holds when the minimizer does exist but it is far, far away. So, this is not a corner case, but rather it covers a big chunk of the interesting cases we encounter. Indeed, if the optimal solution is not far and the Lipschitz constant is relatively small, then the initial point would be already a good solution to the optimization problem. Let me also add that this is exactly the reason why the learning rate acts as a regularizer when you use SGD in only one pass over your data. But this is another story.

Note that I wrote {\min_{{\boldsymbol u}}} and not {\inf_{{\boldsymbol u}}} because {f({\boldsymbol u})+\frac{L\|{\boldsymbol x}_1-{\boldsymbol u}\|^2}{2\alpha \sqrt{T}}} has always a minimizer for any convex {f} because {\frac{L\|{\boldsymbol x}_1-{\boldsymbol u}\|^2}{2\alpha \sqrt{T}}} is strongly convex in {{\boldsymbol u}} and convex + strongly convex = strongly convex.

So, we have a non-vacuous bound in all cases, but how good is this? We have to calculate the expression of the minimum over {{\boldsymbol u}} for each specific case. To make it more concrete, let’s consider a simple example: logistic regression on a separable dataset. So,

\displaystyle f({\boldsymbol x}) = \frac{1}{N}\sum_{i=}^N \ln(1+ \exp(-y_i \langle {\boldsymbol x}, {\boldsymbol z}_i\rangle)),

where {{\boldsymbol z}_i \in {\mathbb R}^d} are your input data and {y_i \in \{-1,1\}} the corresponding labels. Let’s also consider the simplest stochastic oracle for this problem: in each iteration, we take one training sample {({\boldsymbol z}_i, y_i)} at random. Now, if the problem is linearly separable, it means that there exists {{\boldsymbol u}^\star} such that {y_i \langle {\boldsymbol u}^\star, {\boldsymbol z}_i\rangle\geq 1} for all the samples in your training set. I want to stress that {{\boldsymbol u}^\star} is not the optimal solution. Instead, this means that {f(\beta {\boldsymbol u}^\star)} goes to 0 when {\beta \rightarrow \infty}. In other words, {\frac{{\boldsymbol u}^\star}{\|{\boldsymbol u}^\star\|}} is the direction of the “optimal solution at infinity”. In this case, we also have {\inf_{{\boldsymbol x}} \ f({\boldsymbol x})=0}.

What is the guarantee of SGD in this case? For simplicity, let’s set {{\boldsymbol x}_1=\boldsymbol{0}}, the general case is equally simple. The first guarantee, the one that assumes the existence of {{\boldsymbol x}^\star}, would be clearly vacuous. However, the second one is still valid, but it is not completely clear what it guarantees. So, consider the minimization in the right hand side and let’s massage it a little bit: For any {\beta \geq 0}, we can upper bound it as

\displaystyle \begin{aligned} \min_{{\boldsymbol u}} \ f({\boldsymbol u})+\frac{L \|{\boldsymbol x}_1-{\boldsymbol u}\|^2}{2\alpha \sqrt{T}} &\leq f(\beta {\boldsymbol u}^\star)+\frac{L\|\beta{\boldsymbol u}^\star\|^2}{2\alpha \sqrt{T}}\\ &= \frac{1}{N}\sum_{i=}^N \ln(1+ \exp(-y_i \beta \langle {\boldsymbol u}^\star, {\boldsymbol z}_i\rangle)) +\frac{L\beta^2\|{\boldsymbol u}^\star\|^2}{2\alpha \sqrt{T}} \\ &\leq \ln(1+\exp(-\beta))+ \frac{L\beta^2\|{\boldsymbol u}^\star\|^2}{2\alpha \sqrt{T}}\\ &\leq \exp(-\beta)+ \frac{L\beta^2\|{\boldsymbol u}^\star\|^2}{2\alpha \sqrt{T}}~. \end{aligned}

This bound holds for any {\beta \geq 0}, so we can minimize it with respect to {\beta}. The exact expression depends on the Lambert function, but I am lazy so let’s just use an approximate expression because any expression of {\beta \geq 0} will give us a valid convergence rate. Let’s pick {\beta=\max\left(\ln\frac{2\alpha \sqrt{T}}{L\|{\boldsymbol u}^\star\|^2},0\right)}, then we have

\displaystyle \min_{{\boldsymbol u}} \ f({\boldsymbol u})+\frac{\|{\boldsymbol x}_1-{\boldsymbol u}\|^2}{2\alpha \sqrt{T}} \leq \frac{L\|{\boldsymbol u}^\star\|^2}{2\alpha \sqrt{T}} \left(1+\max\left(\ln\frac{2\alpha \sqrt{T}}{L\|{\boldsymbol u}^\star\|^2},0\right)^2\right)~.

Hence, putting all together, SGD on this problem will guarantee

\displaystyle \mathop{\mathbb E}[f(\bar{{\boldsymbol x}}_T)] - \inf_{{\boldsymbol x}} f({\boldsymbol x}) = O\left(\frac{\ln^2 T}{\sqrt{T}}\right), \text{ as } T \rightarrow \infty~.

Nice, right? SGD still works on this problem and you only lose some log factors!

Now, can your algorithm guarantee the same? If not, your algorithm might be worse than the plain good old SGD!

I hope this issue is more clear now and I hope this will make you more conscious when assuming an upper bound on {\|{\boldsymbol x}_1-{\boldsymbol x}^\star\|} in machine learning applications.

1. History Bits

The analysis of SGD for any {{\boldsymbol u}} rather than for {{\boldsymbol x}^\star} was present from the very early papers on this topic, see, for example, Zhang (2004). Then, as it usually happens, with time people started simplifying theorems and assumptions and people forgot about these results. Introducing a generic {{\boldsymbol u}} is also one of the classic ways to achieve rates of convergence for SGD in Reproducing Kernel Hilbert Spaces (RHKS) with universal kernels, where in the interesting cases the minimizer does not exist. I also used it in Orabona (2014), but these ideas go back at least to Smale and Yao (2006). Indeed, one can also work out a similar example for SGD in RKHSs, but the logistic regression case is much more digestible.

Let me add that this should also be the standard way to express the regret in online learning, that is, as a function of any competitor {{\boldsymbol u}}. So, for the same reasons expressed above, writing the regret with respect to the minimizer of the losses in the feasible set might be meaningless, as for linear losses and unbounded domains.

The specific example above for logistic regression is from Ji and Telgarsky (2019) and their follow-up papers, where they also study the rate of convergence for the direction of the solution. This line of work then evolved into what now people call “implicit bias”.

6 Comments

  1. Very intriguing post! However, I have two questions that I cannot figure it out. I hope you could kindly answer it for me.

    (1) Will we have any benefit if we use parameter-free algorithm over sgd when minimizer is far, far away?

    (2) How could SGD has a near O(1/sqrt{t}) convergence rate when there is a comparator norm lower bound of Omega(frac{||u||}{sqrt{t}}sqrt{log ||u||})?

    Like

    1. (1) Of course! One example is the regression with kernels I mention at the end, where a parameter-free algorithm will give you an optimal rate, as if you choose the optimal regularization weight.
      (2) Not sure I understand this one, I need more details.

      Like

  2. Very intriguing post! However, I have two questions that I cannot figure it out. I hope you could kindly answer it for me.

    (1) Will we have any benefit if we use parameter-free algorithm over sgd when minimizer is far, far away?

    (2) How could SGD has a near O(1/sqrt{t}) convergence rate when there is a comparator norm lower bound of Omega(frac{||u||}{sqrt{t}}sqrt{log ||u||})?

    Like

  3. As of now, what’s the current state-of-the-art in parameter-free stochastic optimization?

    When can we expect to eliminate stepsize sweeps from our ML experiments, in practice? Lots of people, myself included, would love that. 😉

    Like

    1. I am confident in saying that parameter-free algorithms are already the state-of-the-art for convex non-smooth stochastic optimization! For example, COCOB, DoG, and the variant I coded in in VW (–coin) all work very well on convex stochastic problems.

      But, I don’t think they are ready to be used in non-convex optimization of deep networks, for the simple reason that none of them is designed to work for non-convex functions. Some of them might still work, but it is unclear why and when.

      More generally, despite of what some “experts” might say, we still don’t have a good model of what kind of non-convex functions we encounter in deep learning optimization, so we cannot design optimization algorithms for it. So, I would be very suspicious of any optimization algorithm designed to work in the convex setting, and even more suspicious if designed without taking account the stochasticity of the gradients.

      Like

Leave a comment