Infeasible methods for smooth objectives

In this page, several infeasible algorithms are presented for solve the following optimization problems on Stiefel manifold.

\[ \begin{aligned} \min\limits_{X \in \mathbb{R}^{n \times p}} \hspace{2mm} & f (X) \\ \mathrm{s.t.} \hspace{3.5mm} & X \in \mathcal{S}_{n, p}. \end{aligned} \qquad \qquad \mathrm{(OCP)} \]

Here \(f(X)\) is a locally Lipschitz smooth function in \(\mathbb{R}^{n\times p}\). The word ‘‘infeasible’’ refers to the fact that the iterate sequences generated by these algorithms are not necessarily restricted on the Stiefel manifold.

smooth 

As mentioned in [1], any feasible approach to solve (OCP) involves an explicit or implicit orthonormalization process, which lacks concurrency and can be a bottleneck of solving (OCP) in certain scenarios.

To this end, several algorithms are proposed where the iterates are not necessarily restricted on \(\mathcal{S}_{n,p}\). By avoiding orthonormalization process, these algorithms achieve high scalibility in practice.

In the rest of this page, a brief introduction of highly efficient infeasible algorithms is presented for (OCP).

PCAL

The essential of PCAL utilizes the closed-form expression

\[ \Lambda = \Phi(X^\top \nabla f(X)) \mbox{ with } \Phi(M) := \frac{1}{2}(M+M^\top) \]

of the multipliers associated with orthogonality constraints at any first-order stationary point, which can be derived by the following optimality condition of (OCP).

\[ \left\{ \begin{aligned} & \nabla f(X) = X \Lambda, \\ & X^\top X = I_p, \\ & \Lambda = \Lambda^\top. \end{aligned} \right. \]

Then inspired by augmented Lagrangian method, the authors in [1] present the following algorithm named Proximal Linearized Augmented Lagrangian Algorithm (PLAM):

\[ \begin{align*} \Lambda_k & := \Phi(X_k^\top \nabla f(X_k)), \\ X_{k+1} & := X_k - \eta_k\left[ \nabla f(X_k) - X_k\Lambda_k + \beta X_k(X_k^\top X_k - I_p) \right], \end{align*} \]

where \(\eta_k > 0\) is a step size. As mentioned in [1], according to the empirical observations, the performance of PLAM is very sensitive to the choice of \(\beta\) although it could achieve high efficiency with fine-tuned \(\beta\). Therefore, the authors further present an improved version of PLAM, named Parallelizable Column-wise Block Minimization for PLAM (PCAL):

\[ \begin{align*} \Lambda_k & := \Phi(X_k^\top \nabla f(X_k)), \\ Y_{k+1} & := X_k - \eta_k\left[ \nabla f(X_k) - X_k\Lambda_k + \beta X_k(X_k^\top X_k - I_p) \right], \\ X_{k+1} & := X_{k+1} \mathrm{Diag}(X_{k+1}^\top X_{k+1})^{-\frac{1}{2}}, \end{align*} \]

where \(\eta_k > 0\) is also a step size. In other words, \(X_{k+1}\) is computed by normalizing each column vectors of \(Y_{k+1}\), which is the major difference between PLAM and PCAL in the aspect of iteration schemes.

Code

You can find the description of codes here.

PenCF

The motivation of PenCF comes from the function

\[ h(X) := f(X) - \frac{1}{2} \langle X^\top X - I_p, \Lambda(X) \rangle + \frac{\beta}{4} ||X^\top X - I_p||_\mathrm{F}^2, \]

where \(\Lambda(X):= \Phi(X^\top \nabla f(X))\). The authors in [2] propose the following exact penalty model (PenC) for (OCP) by restricting \(h(X)\) to a convex compact set \(\mathcal{M}\) that contains the Stiefel manifold.

\[ \min_{X \in \mathcal{M}} \hspace{2mm} h(X). \qquad \qquad \mathrm{(PenC)} \]

However, since \(h(X)\) involves \(\nabla f(X)\) in function value, the gradient of \(h(X)\) is usually difficult to achieve in practice. To this end, the authors in [2] show that the direction

\[ D(X):= \nabla f(X) - X\Lambda(X) + \beta X(X^\top X - I_p) \]

is an approximation to \(\nabla h(X)\) and \(-D(X)\) is a sufficient descent direction for \(h(X)\) when \(||X^\top X - I_p||_\mathrm{F} \leq \frac{1}{6}\) and \(\beta\) is sufficiently large. Based on this direction, the authors propose a first-order method for PenC (PenCF), whose main iteration lopp can be summerized as

\[ \begin{align*} Y_k & := X_k - \eta_kD(X_k)\\ X_{k+1} & := \mathcal{P}_{\mathcal{B}_K} (Y_k). \end{align*} \]

Here \(\mathcal{B}_K := \{X\in \mathbb{R}^{n\times p}: ||X||_\mathrm{F} \leq K\}\) for a pre-fixed \(K> \sqrt{p}\), and \(\mathcal{P}_{\mathcal{B}_K}\) refers to the projection onto \(\mathcal{B}_K\). Noting that computing \(\mathcal{P}_{\mathcal{B}_K}(Y_k)\) can be achieved by scaling \(Y_k\), and when \(Y_k\) is sufficiently close to the Stiefel manifold, the computation of \(\mathcal{P}_{\mathcal{B}_K}(Y_k)\) could be waived. Therefore, PenCF achieves high efficiency while is not sensitive to \(\beta\).

Code

You can find the description of codes here.

PenCS

When \(\nabla^2 f(X)\) is accessible, we can compute the exact gradient of \(h(X)\) by

\[ \nabla h(X) = \nabla f(X) - X\Lambda(X) - \frac{1}{2} \nabla f(X) (X^\top X - I_p) - \frac{1}{2} \nabla^2 f(X)[X(X^\top X - I_p)] + \beta X(X^\top X - I_p). \]

Then we can compute an approximated Hessian of \(\nabla^2 h(X)\) that can be expressed as

\[ \begin{eqnarray*} W(X)[D] & := & \nabla^2 f(X)[D] - D\Lambda(X) - X \left[\Phi(D^\top \nabla f(X)) +\Phi( X^\top \nabla^2 f(X)[D] ) + 2\beta \Phi(X^\top D)\right]\\ & & - \left[\nabla f(X)\Phi(D^\top X)+ \nabla^2 f(X)[X\Phi(X^\top D)]\right]. \end{eqnarray*} \]

With \(W(X)\), the authors in [2] propose a second-order method for PenC (PenCS) by employing a trust-region-type method. In each step of PenCS, the descent direction \(D_k\) can be calculated by solving the following trust-region subproblem,

\[ \begin{aligned} \min_{D \in \mathbb{R}^{n\times p}} \hspace{2mm} & \langle D, \nabla h(X_k) \rangle + \frac{1}{2} \langle D, W(X_k)[D] \rangle \\ \text{s.t.} \hspace{4mm} & \| X_k + D \|_\mathrm{F} \leq K. \end{aligned} \]

Code

You can find the description of codes here.

References

  1. B. Gao, X. Liu, and Y. Yuan, Parallelizable Algorithms for Optimization Problems with Orthogonality Constraints, SIAM Journal on Scientific Computing, 41-3(2019), A1949–A1983.

  2. N. Xiao, X. Liu, and Y. Yuan, A Class of Smooth Exact Penalty Function Methods for Optimization Problems with Orthogonality Constraints, Optimization Methods and Software.