Bidiagonalization of banded matrices

In the past, bidiagonalization through bulge-chasing was believed to be a CPU-only algorithm since it is memory bound. Not anymore. Low-level GPU memory has increased and we present the first GPU algorithm for reducing a banded matrix to bidiagonal form, ouperforming HPC libraries PLASMA and SLATE by orders of magnitude.

Read more

Unified GPU Kernels for the SVD

High-level level HPC libraries typically rely on low-level hardware-optimized functions. We show that the performance of hardware-specialized functions can be matched or exceeded using abstract functions, through hyperparemeter optimization by data precision and hardware.

Read more

Unified Recursive TRMM and TRSM

A single high-level recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) for all hardware and data types that leverages more general matrix-matrix multiplications (GEMM) through recursions, achieves performance in line with hardware-optimized functions.

Read more