Generalized Soft Impute for Matrix Completion

Max Turgeon

University of Manitoba

Motivation

Missing data is common in every data science domain
Often, methods assume the data is complete or (silently) perform a complete-case analysis.
- Data imputation
Can we improve performance by leveraging structure in the data?

Removing missing data can be misleading

We present a matrix completion algorithm specifically designed for methods built on generalized SVD (e.g. weighted PCA, MCA).
We achieve robustness by penalizing the nuclear norm of the approximating matrix.
Proximal gradient descent theory guarantees convergence.
Especially useful for compositional data (e.g. ecology, microbiome data).

Generalized SVD differs from SVD by introducing positive-definite matrices representing row and column constraints.
\(A\) is \(n\times p\) matrix, \(M\) represents row constraints, \(W\) represents column constraints.
We want matrices U, D, V, with D diagonal, such that

\[ A = UDV^T, \qquad U^TMU = V^TWV = I.\]

Think PCA for categorical data.
Use one-hot encoding to get 0/1 matrix (observed counts).
Compute matrix of expected counts under independence (row frequencies times column frequencies)
Run generalized SVD on the difference matrix.
- \(M\) is a diagonal matrix of inverse row frequencies
- \(W\) is a diagonal matrix of inverse column frequencies

The rank \(r\) matrix \(U_{[r]} D_{[r]} V_{[r]}^T\) (for fixed \(r> 0\)) minimizes the following:

\[f(X) = \frac{1}{2}\mathrm{trace}\left(M(A-X) W (A-X)^T\right),\quad \mathrm{rank}(X) \leq r.\]

Key idea: Compute \(f(X)\) for non-missing values of \(A\), and penalize the nuclear norm (i.e. sum of singular values).

\[X_\mathrm{compl} = \mathrm{\arg\!\min}\, f(X) + \lambda\|X\|_*\]

Iterate until convergence:
- Fill missing entries of \(A\) using previous iteration.
- Compute generalized SVD for \(A = UDV^T\).
- Soft-threshold singular values: \(X = US_\lambda(D)V^T\).

Recall: \(S_\lambda(\sigma) = \max(\sigma - \lambda, 0)\).

We use data from ONDRI.¹
- Sex, NIH stroke scale score, APOE genotype, MAPT diplotype
We randomly remove a fixed proportion of data (\(\pi_{miss}\)).
We compare imputed data to original data.
We compare four methods:
- Generalized Soft Impute, Soft Impute, Iterative PCA, and Regularized Iterative PCA.

\(\pi_{miss}\)	GSI	SI	iPCA	riPCA
0.05	0.0974	0.1053	0.1032	0.1032
0.10	0.2040	0.2284	0.2152	0.2152
0.15	0.3037	0.3688	0.3115	0.3115
0.20	0.4318	0.4925	0.4085	0.4085
0.25	0.5796	0.6251	0.5145	0.5145

Our method outperforms Soft Impute.
For small \(\pi_{miss}\), GSI is better than iterative PCA.
For larger \(\pi_{miss}\), riPCA is better than GSI.
- BUT requires the right rank!

Overall, leveraging structure in data does improve performance.