Infeasible methods for smooth objectives

In this page, several infeasible algorithms are presented for solve the following optimization problems on Stiefel manifold.

\[ \begin{aligned} \min\limits_{X \in \mathbb{R}^{n \times p}} \hspace{2mm} & f (X) \\ \mathrm{s.t.} \hspace{3.5mm} & X \in \mathcal{S}_{n, p}. \end{aligned} \qquad \qquad \mathrm{(OCP)} \]

Here \(f(X)\) is a locally Lipschitz smooth function in \(\mathbb{R}^{n\times p}\). The word ‘‘infeasible’’ refers to the fact that the iterate sequences generated by these algorithms are not necessarily restricted on the Stiefel manifold.

PCAL

\[ \Lambda = \Phi(X^\top \nabla f(X)) \mbox{ with } \Phi(M) := \frac{1}{2}(M+M^\top) \]

of the multipliers associated with orthogonality constraints at any first-order stationary point, which can be derived by the following optimality condition of (OCP).

\[ \left\{ \begin{aligned} & \nabla f(X) = X \Lambda, \\ & X^\top X = I_p, \\ & \Lambda = \Lambda^\top. \end{aligned} \right. \]

Then inspired by augmented Lagrangian method, the authors in [1] present the following algorithm named Proximal Linearized Augmented Lagrangian Algorithm (PLAM):

\[ \begin{align*} \Lambda_k & := \Phi(X_k^\top \nabla f(X_k)), \\ X_{k+1} & := X_k - \eta_k\left[ \nabla f(X_k) - X_k\Lambda_k + \beta X_k(X_k^\top X_k - I_p) \right], \end{align*} \]

where \(\eta_k > 0\) is a step size. As mentioned in [1], according to the empirical observations, the performance of PLAM is very sensitive to the choice of \(\beta\) although it could achieve high efficiency with fine-tuned \(\beta\). Therefore, the authors further present an improved version of PLAM, named Parallelizable Column-wise Block Minimization for PLAM (PCAL):

\[ \begin{align*} \Lambda_k & := \Phi(X_k^\top \nabla f(X_k)), \\ Y_{k+1} & := X_k - \eta_k\left[ \nabla f(X_k) - X_k\Lambda_k + \beta X_k(X_k^\top X_k - I_p) \right], \\ X_{k+1} & := X_{k+1} \mathrm{Diag}(X_{k+1}^\top X_{k+1})^{-\frac{1}{2}}, \end{align*} \]

where \(\eta_k > 0\) is also a step size. In other words, \(X_{k+1}\) is computed by normalizing each column vectors of \(Y_{k+1}\), which is the major difference between PLAM and PCAL in the aspect of iteration schemes.

Code

PenCF

\[ h(X) := f(X) - \frac{1}{2} \langle X^\top X - I_p, \Lambda(X) \rangle + \frac{\beta}{4} ||X^\top X - I_p||_\mathrm{F}^2, \]

where \(\Lambda(X):= \Phi(X^\top \nabla f(X))\). The authors in [2] propose the following exact penalty model (PenC) for (OCP) by restricting \(h(X)\) to a convex compact set \(\mathcal{M}\) that contains the Stiefel manifold.

However, since \(h(X)\) involves \(\nabla f(X)\) in function value, the gradient of \(h(X)\) is usually difficult to achieve in practice. To this end, the authors in [2] show that the direction

is an approximation to \(\nabla h(X)\) and \(-D(X)\) is a sufficient descent direction for \(h(X)\) when \(||X^\top X - I_p||_\mathrm{F} \leq \frac{1}{6}\) and \(\beta\) is sufficiently large. Based on this direction, the authors propose a first-order method for PenC (PenCF), whose main iteration lopp can be summerized as

\[ \begin{align*} Y_k & := X_k - \eta_kD(X_k)\\ X_{k+1} & := \mathcal{P}_{\mathcal{B}_K} (Y_k). \end{align*} \]

Here \(\mathcal{B}_K := \{X\in \mathbb{R}^{n\times p}: ||X||_\mathrm{F} \leq K\}\) for a pre-fixed \(K> \sqrt{p}\), and \(\mathcal{P}_{\mathcal{B}_K}\) refers to the projection onto \(\mathcal{B}_K\). Noting that computing \(\mathcal{P}_{\mathcal{B}_K}(Y_k)\) can be achieved by scaling \(Y_k\), and when \(Y_k\) is sufficiently close to the Stiefel manifold, the computation of \(\mathcal{P}_{\mathcal{B}_K}(Y_k)\) could be waived. Therefore, PenCF achieves high efficiency while is not sensitive to \(\beta\).

Code

PenCS

When \(\nabla^2 f(X)\) is accessible, we can compute the exact gradient of \(h(X)\) by

\[ \nabla h(X) = \nabla f(X) - X\Lambda(X) - \frac{1}{2} \nabla f(X) (X^\top X - I_p) - \frac{1}{2} \nabla^2 f(X)[X(X^\top X - I_p)] + \beta X(X^\top X - I_p). \]

Then we can compute an approximated Hessian of \(\nabla^2 h(X)\) that can be expressed as

\[ \begin{eqnarray*} W(X)[D] & := & \nabla^2 f(X)[D] - D\Lambda(X) - X \left[\Phi(D^\top \nabla f(X)) +\Phi( X^\top \nabla^2 f(X)[D] ) + 2\beta \Phi(X^\top D)\right]\\ & & - \left[\nabla f(X)\Phi(D^\top X)+ \nabla^2 f(X)[X\Phi(X^\top D)]\right]. \end{eqnarray*} \]

With \(W(X)\), the authors in [2] propose a second-order method for PenC (PenCS) by employing a trust-region-type method. In each step of PenCS, the descent direction \(D_k\) can be calculated by solving the following trust-region subproblem,

\[ \begin{aligned} \min_{D \in \mathbb{R}^{n\times p}} \hspace{2mm} & \langle D, \nabla h(X_k) \rangle + \frac{1}{2} \langle D, W(X_k)[D] \rangle \\ \text{s.t.} \hspace{4mm} & \| X_k + D \|_\mathrm{F} \leq K. \end{aligned} \]

Infeasible methods for smooth objectives

PCAL

Code

PenCF

Code

PenCS

Code

References