Stochastic mirror descent: Convergence analysis and adaptive variants

Stochastic mirror descent

Stochastic optimization is a fundamental technique in machine learning and optimization used to solve complex problems where the objective function is subject to randomness. One of the key challenges in stochastic optimization is designing algorithms that can converge to a solution efficiently, even when the problem is non-smooth, non-convex, or constrained.

In recent years, stochastic mirror descent (SMD) has emerged as a powerful approach for addressing these challenges, offering improved convergence guarantees and the ability to handle non-Euclidean geometries.

In this blog post, we delve into the paper titled "Stochastic Mirror Descent: Convergence Analysis and Adaptive Variants via the Mirror Stochastic Polyak Stepsize," which was recently accepted at Transactions on Machine Learning Research (TMLR). You can find the code for the experiments and implementations on GitHub.

The paper investigates the convergence properties of SMD under different scenarios and introduces novel adaptive variants for improved performance. This research, conducted by a team of experts, makes significant contributions to the field of stochastic optimization.

Stochastic optimization methods

Traditional stochastic optimization methods, such as stochastic gradient descent (SGD), often rely on Euclidean geometry and require bounded gradient or variance assumptions to guarantee convergence. However, these methods may not perform well in non-Euclidean settings or when the problem involves constraints. This is where SMD comes into play.

SMD is a generalization of SGD that can match the geometry of the problem, leading to better convergence guarantees. It leverages non-Euclidean projections, allowing for more flexible choices of constraints and geometries. This flexibility makes SMD suitable for a wide range of optimization tasks.

Contributions of the paper

The paper makes several significant contributions to the field of stochastic optimization:

1. Weak assumptions on noise

Unlike most previous work on SMD, this research does not rely on bounded gradient or variance assumptions. Instead, it introduces a novel measure called the "finite optimal objective difference." This measure, denoted as σ², quantifies the difference between the optimal objective value and the expected value of the objective function. The paper demonstrates that with this measure, SMD can provide convergence guarantees without the need for strong assumptions on noise.

2. Adaptive step sizes

The paper introduces the "mirror stochastic Polyak stepsize" (mSPS) as an adaptive stepsize scheme for SMD. This stepsize adaptation is crucial for achieving efficient convergence, especially in problems with varying degrees of smoothness. The mSPS adapts the stepsize during optimization, allowing for better convergence performance compared to fixed step sizes.

3. Exact convergence with interpolation

In modern machine learning, interpolation is a critical concept where overparameterized models can drive training error to zero. The paper shows that SMD inherits exact convergence guarantees under interpolation, a property previously associated with SGD. This is a significant result, as it provides the first convergence guarantees for the exponentiated gradient algorithm under both relative smoothness and classic smoothness settings.

4. Extensive experimental validation

To demonstrate the effectiveness of the proposed adaptive stepsize and convergence results, the paper conducts extensive numerical experiments across various supervised learning tasks and different instances of SMD. These experiments showcase the practical applicability and efficiency of the introduced methods.

The paper positions its contributions in the context of related work in stochastic optimization. It highlights the differences and advantages of its approach compared to existing methods, emphasizing the weak assumptions on noise, adaptability, and the ability to handle constrained optimization problems.

Experiments: Non-convex deep learning

In this section, we focus on non-convex deep learning experiments to assess the performance of mSPS. Deep learning models often involve non-convex objective functions, making optimization challenging. We demonstrate that mSPS competes effectively in these scenarios without requiring extensive hyperparameter tuning.

Below, you'll find plots and figures showcasing the performance of mSPS in non-convex deep learning experiments.

Comparison between mirror stochastic Polyak stepsize and constant stepsizes on non-convex multiclass classification with deep networks

Figure 1: Comparison between mSPS and constant stepsizes on non-convex multiclass classification with deep networks. The leftmost plot shows the stepsize evolution of mSPS with smoothing and c = 0.2 for different p values.

Conclusion

In summary, the paper titled "Stochastic Mirror Descent: Convergence Analysis and Adaptive Variants via the Mirror Stochastic Polyak Stepsize" introduces innovative techniques and convergence guarantees for SMD in constrained optimization problems.

By leveraging weak assumptions on noise and introducing adaptive step sizes, this research opens up new avenues for efficient and effective optimization in non-Euclidean settings. The exact convergence under interpolation further strengthens the appeal of SMD as a powerful tool in modern machine learning and optimization.

This paper represents a significant advancement in the field of stochastic optimization, offering practical and robust solutions to complex optimization problems. Researchers and practitioners in machine learning and optimization will find the results and techniques presented in this paper invaluable for addressing real-world challenges.

Find out more about the latest ServiceNow Research.