LU Decomposition in C (and under CUDA)
As part of any major project, it occasionally happens that you assume something is a ‘solved problem’ when its really not. In my case it was solving small linear systems, of the form Ax=B, where A is an nxn matrix, B is a n vector. This is a problem that’s been solved in libraries such as LAPACK, LINPACK, BLAS, etc etc. The issue appears when you’re trying to do this stuff within a specific hardware environment (CUDA), and you cannot call host functions from the device, and the cuBLAS libraries cater only to large matrices processed in parallel ...