Generalized Soft Impute for Matrix Completion

Max Turgeon

University of Manitoba

Motivation

  • Missing data is common in every data science domain

  • Often, methods assume the data is complete or (silently) perform a complete-case analysis.

    • Data imputation
  • Can we improve performance by leveraging structure in the data?

Common methods

  • Complete-case analysis
  • Mean imputation
    • Also median/mode imputation
  • Often leads to imprecise and/or biased results

Example

Removing missing data can be misleading

Outline

  1. Matrix factorization and matrix completion
  2. Generalized SVD
  3. Generalized Soft Impute
  4. Simulations

Matrix factorization and matrix completion

Non-Negative Matrix Factorization

  • Let \(A\) be a \(n\times p\) non-negative matrix
    • For NNMF, all entries are non-negative.
  • Want: \(G, H\) non-negative such that \(A \approx GH\).
    • Distance? Frobenius norm of difference
    • Can take \(G,H\) of rank \(r < \min(n,p)\).
  • Lee and Seung (2001) gave iterative algorithm with entrywise updates.
    • We can skip over missing data!

Iterative PCA

  • PCA gives us decomposition of \(A = Z V^T\), with \(V\) containing the eigenvectors of \(\mathrm{Cov}(A)\).
  • Missing data?
    • Randomly impute.
    • Compute PCA and use decomposition to update imputation.
    • Repeat until convergence.
  • Control overfitting via 1) Choosing a small number \(r\) of eigenvectors/PCs and 2) \(L_2\) regularization

Soft Impute

  • Define a projection operator:

\[ [P_\Omega(X)]_{ij} = \begin{cases} X_{ij} & (i,j)\in \Omega\\ 0 & (i,j) \notin \Omega\end{cases}.\]

  • \(\Omega\): indices of missing entries.
  • Define objective function:

\[ f(X) = \frac{1}{2} \| P_\Omega(A) - P_\Omega(X) \|_F^2 + \lambda \|X\|_\ast.\]

  • We have:
    • \(\| \cdot \|_F\): Frobenius norm
    • \(\| \cdot \|_\ast\): Nuclear norm (i.e. sum of singular values)
    • \(\lambda\): Tuning parameter
  • Convex optimization! But nuclear norm is not differentiable…
  • Mazumder et al. (2010) proposed an efficient algorithm based on proximal gradient descent.

Summary

  • Matrix factorization = structure
  • Learn structure = make good guesses
  • Avoid overfitting with small rank and regularization

Generalized Singular Value Decomposition

Generalized SVD

  • Generalized SVD differs from SVD by introducing positive-definite matrices representing row and column constraints.
  • \(A\) is \(n\times p\) matrix, \(M\) represents row constraints, \(W\) represents column constraints.
  • We want matrices U, D, V, with D diagonal, such that

\[ A = UDV^T, \qquad U^TMU = V^TWV = I.\]

  • Define \(\tilde{A} = M^{1/2}A W^{1/2}\).
  • Compute SVD: \(\tilde{A} = P D Q^T\).
  • Rescale: \(U=M^{-1/2}P\) and \(V = W^{-1/2}Q\).

We can get a rank \(r\) approximation by taking the first \(r\) columns of \(U,V\), and the first \(r\) diagonal entries of \(D\):

\[A \approx U_{[r]} D_{[r]} V_{[r]}^T.\]

Multiple Correspondence Analysis

  • Think PCA for categorical data.
  • Use one-hot encoding to get 0/1 matrix (observed counts).
  • Compute matrix of expected counts under independence (row frequencies times column frequencies)
  • Run generalized SVD on the difference matrix.
    • \(M\) is a diagonal matrix of inverse row frequencies
    • \(W\) is a diagonal matrix of inverse column frequencies

Generalized Soft Impute

Our method

  • The rank \(r\) matrix \(U_{[r]} D_{[r]} V_{[r]}^T\) (for fixed \(r> 0\)) minimizes the following:

\[f(X) = \frac{1}{2}\mathrm{trace}\left(M(A-X) W (A-X)^T\right),\quad \mathrm{rank}(X) \leq r.\]

  • Key idea: Compute \(f(X)\) for non-missing values of \(A\), and penalize the nuclear norm (i.e. sum of singular values).

\[X_\mathrm{compl} = \mathrm{\arg\!\min}\, \left\{f(X) + \lambda\|X\|_*\right\}\]

Algorithm

  • Iterate until convergence:
    • Fill missing entries of \(A\) using previous iteration.
    • Compute generalized SVD for \(A = UDV^T\).
    • Soft-threshold singular values: \(X = US_\lambda(D)V^T\).

 

Recall: \(S_\lambda(\sigma) = \max(\sigma - \lambda, 0)\).

Aside: Proximal gradient descent

  • Convex and differentiable objective function \(g(X)\):
    • Gradient descent: \(X^{t+1} = X^t - \gamma \nabla g(X^t)\)
  • What can we do if non-differentiable? It depends

  • Assume we can write \(g = f + h\) in two parts:
    • \(f\) is convex and differentiable
    • \(h\) is only convex
  • Define the proximal map:

\[ \mathrm{prox}_h(X) = \mathrm{argmin}_{\Theta} \left\{\frac{1}{2}\|X - \Theta \|_2^2 + h(\Theta)\right\}. \]

  • Sometimes, the proximal map is easy to compute:
    • \(h\) = indicator function of convex set \(\to\) \(\mathrm{prox}_h\) = projection operator
    • \(h\) = L1 norm \(\to\) \(\mathrm{prox}_h\) = soft-thresholding (e.g. Lasso)
    • \(h\) = nuclear norm \(\to\) \(\mathrm{prox}_h\) = matrix soft-thresholding
  • Proximal gradient descent
    • \(X^{t+1} = \mathrm{prox}_{\gamma h}\left(X^t - \gamma \nabla g(X^t)\right)\)

Summary

  • Proximal gradient descent is useful when objective is non-differentiable, with simple non-differentiable part.
  • Convergence guarantees when \(g\) is Lipshitz
  • Can be accelerated just like GD

Simulations

Simulation scenario

  • We use data from ONDRI.1
    • Sex, NIH stroke scale score, APOE genotype, MAPT diplotype
  • We randomly remove a fixed proportion of data (\(\pi_{miss}\)).
  • We compare imputed data to original data.
  • We compare four methods:
    • Generalized Soft Impute, Soft Impute, Iterative PCA, and Regularized Iterative PCA.

Results

\(\pi_{miss}\) GSI SI iPCA riPCA
0.05 0.0974 0.1053 0.1032 0.1032
0.10 0.2040 0.2284 0.2152 0.2152
0.15 0.3037 0.3688 0.3115 0.3115
0.20 0.4318 0.4925 0.4085 0.4085
0.25 0.5796 0.6251 0.5145 0.5145

Discussion

  • Our method outperforms Soft Impute.
  • For small \(\pi_{miss}\), GSI is better than iterative PCA.
  • For larger \(\pi_{miss}\), riPCA is better than GSI.
    • BUT requires the right rank!

 

Overall, leveraging structure in data does improve performance.

Discussion (cont’d)

  • We assumed data is MCAR.
    • Reasonable when data missing because of technical failure
  • Need more simulations to understand behaviour with MAR data.

Summary

  • We presented a matrix completion algorithm specifically designed for methods built on generalized SVD (e.g. weighted PCA, MCA).
  • We achieved robustness by penalizing the nuclear norm of the approximating matrix.
  • Proximal gradient descent theory guarantees convergence.
  • Especially useful for compositional data (e.g. ecology, microbiome data).

Next steps

  • Hyperparameter selection (\(\lambda\) and \(r\))
  • Application to microbiome data
    • Missing data a consequence of low read depth
  • Improve computational efficiency by leveraging sparsity
  • Release GenSoftImpute package

Questions?

 

Slides can be found at maxturgeon.ca/talks