Generalized Soft Impute for Matrix Completion

Max Turgeon

University of Manitoba

Motivation

  • Missing data is common in every data science domain

  • Often, methods assume the data is complete or (silently) perform a complete-case analysis.

    • Data imputation
  • Can we improve performance by leveraging structure in the data?

Proof of Concept

Removing missing data can be misleading

Summary

  • We present a matrix completion algorithm specifically designed for methods built on generalized SVD (e.g. weighted PCA, MCA).
  • We achieve robustness by penalizing the nuclear norm of the approximating matrix.
  • Proximal gradient descent theory guarantees convergence.
  • Especially useful for compositional data (e.g. ecology, microbiome data).

Generalized Singular Value Decomposition

  • Generalized SVD differs from SVD by introducing positive-definite matrices representing row and column constraints.
  • \(A\) is \(n\times p\) matrix, \(M\) represents row constraints, \(W\) represents column constraints.
  • We want matrices U, D, V, with D diagonal, such that

\[ A = UDV^T, \qquad U^TMU = V^TWV = I.\]

Multiple Correspondence Analysis

  • Think PCA for categorical data.
  • Use one-hot encoding to get 0/1 matrix (observed counts).
  • Compute matrix of expected counts under independence (row frequencies times column frequencies)
  • Run generalized SVD on the difference matrix.
    • \(M\) is a diagonal matrix of inverse row frequencies
    • \(W\) is a diagonal matrix of inverse column frequencies

Our method

  • The rank \(r\) matrix \(U_{[r]} D_{[r]} V_{[r]}^T\) (for fixed \(r> 0\)) minimizes the following:

\[f(X) = \frac{1}{2}\mathrm{trace}\left(M(A-X) W (A-X)^T\right),\quad \mathrm{rank}(X) \leq r.\]

  • Key idea: Compute \(f(X)\) for non-missing values of \(A\), and penalize the nuclear norm (i.e. sum of singular values).

\[X_\mathrm{compl} = \mathrm{\arg\!\min}\, f(X) + \lambda\|X\|_*\]

Algorithm

  • Iterate until convergence:
    • Fill missing entries of \(A\) using previous iteration.
    • Compute generalized SVD for \(A = UDV^T\).
    • Soft-threshold singular values: \(X = US_\lambda(D)V^T\).

 

Recall: \(S_\lambda(\sigma) = \max(\sigma - \lambda, 0)\).

Simulations

  • We use data from ONDRI.1
    • Sex, NIH stroke scale score, APOE genotype, MAPT diplotype
  • We randomly remove a fixed proportion of data (\(\pi_{miss}\)).
  • We compare imputed data to original data.
  • We compare four methods:
    • Generalized Soft Impute, Soft Impute, Iterative PCA, and Regularized Iterative PCA.

Results

\(\pi_{miss}\) GSI SI iPCA riPCA
0.05 0.0974 0.1053 0.1032 0.1032
0.10 0.2040 0.2284 0.2152 0.2152
0.15 0.3037 0.3688 0.3115 0.3115
0.20 0.4318 0.4925 0.4085 0.4085
0.25 0.5796 0.6251 0.5145 0.5145

Discussion

  • Our method outperforms Soft Impute.
  • For small \(\pi_{miss}\), GSI is better than iterative PCA.
  • For larger \(\pi_{miss}\), riPCA is better than GSI.
    • BUT requires the right rank!

 

Overall, leveraging structure in data does improve performance.

Next steps

  • Hyperparameter selection (\(\lambda\) and \(r\))
  • Application to microbiome data
    • Missing data a consequence of low read depth
  • Improve computational efficiency by leveraging sparsity
  • Release GenSoftImpute package

Questions?

 

Slides can be found at maxturgeon.ca/talks