MKL, BLAS, LPACK, cuBLAS, cuSPARSE Equivalents RRS feed

  • Question

  • So Intel has MKL for the CPU; nVidia has cuBLAS and cuSPARSE for CUDA; and there are various implementations of BLAS and LAPACK for the CPU as jumping off points for developing applications using linear algebra.

    I'd like to build an application that uses AMP to provide the underlying linear algebra techniques necessary for my application's functionality. The impetus for doing so is the expected performance improvement over using the CPU alone (CUDA documentation indicates that cuBLAS provides at least order of magnitude performance improvement over MKL for a wide variety of techniques applicable to matrices of 4K rows/columns) along with the abstraction of the underlying hardware provided by AMP.

    To use AMP to do the same stuff, I (apparently) need to first construct (at least a subset of) the aforementioned underlying libraries. For me, this as a non-trivial exercise.

    With a modest understanding of linear algebra, limiting my research to only matrix multiplication, it took me a full day to research the current state-of-the-art with respect to algorithms - taking into account the underlying architecture, the big "O" performance expectations, relative stability, etc... to get to the point where I believe I have a reasonably good idea of likely candidate algorithms to implement in order to compare empirical performance results.

    Add the implementation, optimization, and testing - for each of the necessary methods; and this is going to be a substantial investment of time.

    Is anyone else already working on this? 

    Is anyone aware of a synopsis of algorithms applicable to parallelization on a GPU? Of special interest is the partitioning of large problems and how algorithms might be modified to better accommodate them (for example, Morton ordering for improved memory locality).

    Any other suggestions on getting to the point where I have an AMP BLAS/LAPACK implementation?


    Ken Miller
    Monday, October 24, 2011 11:28 PM


All replies

  • Ken,

    Like you I would like to see numerical libraries for C++ AMP - IMNSHO it makes those of us using it significantly more productive. Some questions that spring to mind are:

    • Should the team at Microsoft be spending their time on this or on other C++ AMP features?
    • What could be done to create a vibrant open source community?
    • How costly would it be to sub contract out needs (such as my need for FFT)?
    • How long would it take me to write and validate a production quality algorithm using C++ AMP compared to straight C++?

    It would be interesting to hear how important such libraries are to others - for me they are very important.


    Wednesday, October 26, 2011 2:41 PM
  • My company has one project currently underway for which we have committed to using AMP and another being prototyped for which we would very much like to use AMP.

    So, we're going to be writing our own (limited) libraries in-house for at least some of this stuff because we simply can't wait for some other solution to arise.

    Where we are now is pondering how much effort to put into those libraries. Is there a commercial market that would justify taking a more comprehensive approach? Perhaps something more like Thrust which can then be used to build up basic linear algebra capabilities and then maybe a 3rd level of purpose-specific functionality (multivariate statistics, integer programming, etc...)

    I don't think Microsoft can expect wide adoption without having something comparable to what's available for CUDA. It was touch and go on our internal decision to use AMP for a while primarily based on the huge (understandable at this time - but inexcusable later) disparity in building-block resources.

    Unfortunately, more questions than answers at this point.

    Ken Miller
    Wednesday, October 26, 2011 7:55 PM
  • It would make more sense for MS to simply define a set of utility functions (fft, sort, scan, split, merge, select, GEMM, sparse, etc) and allow the vendors to provide hardware-specific implementations. Of course they should provide reference implementations of these as well. The same goes with OpenCL, and its lack of libraries for data-parallel programming.

    C++ AMP doesn't have nearly as much low-level power as CUDA, and it's not likely that a portable AMP kernel would perform well compared to CUDA kernel for a complicated operation.

    Wednesday, October 26, 2011 8:42 PM
  • I'm writing a .NET class library for scientific and industrial image processing applications that uses quite a bit of linear algebra. One of the design issues I struggled with is whether or not to include a GPU branch to the execution model. I've designed the SDK to allow for this in the future should I want to incorporate it, but IMO for applications that aren't real-time there is rarely a true need for this. Properly parallelized code (even managed code) can perform very nicely indeed. Of course, for real-time stuff it's a different story.

    Most industry people I spoke with didn't even list run-time performance as their first priority, it was down in 5th or 6th place behind things like time-to-market, feature set, ease of use, compatibility, ease of distribution, etc.

    The problem I have with the GPGPU model is that it tends to ignore some practical issues of product development. GPGPU advocates will compare an optimized version of some algorithm (that maps optimally to the GPU of course) with a single-threaded version running on a CPU. When properly threaded, most linear algebra operations will scale very well on multicore machines and you can get very good performance as a result. Remember: it doesn't have to be the fastest, it only has to be fast enough.

    The other practical issue is that to get the super-speedups people think they're going to get, you essentially have to do your own caching using shared memory. When you open that can of worms you have to start worrying about how to shoe-horn your algorithm into the shared memory, whether the target machines have the mimum amount of such memory, what to do if they don't, how are you going to test/debug, etc. It becomes an ugly time-consuming prospect and most developers I know have just said "Screw it. I don't have time, I'll just do everything in global memory." and all of a sudden the best their software gets is about 10x vs 6x-7x on an i7.

    Oh yeah, don't forget you have to consider how to handle calls coming in from OEMs when they get weird results using your library because they're running it on a cheap card with no ECC memory and a lousy driver. And we haven't even considered precision yet and how that increases the likelihood that you'll have to handle some numerically degenerate case.

    However, one area where I think AMP and GPGPU has a bright future is on tablets. Consumer apps on tablets aren't usually super-precision items, and the ARM CPUs are a little slow for FPU. In this case a quick-and-dirty single-precision global memory implementation can really shine and may make the difference, especially with power consumption.


    • Edited by LKeene Thursday, October 27, 2011 9:46 PM
    Thursday, October 27, 2011 9:43 PM
  • Most video cards handle double-precision just fine. It's standard IEEE-754, and all current-gen cards also support fused multiply-and-add, which has higher precision for those ops than CPUs using SSE.

    For things like image processing and linear algebra, which are often bandwidth limited, GPU has a huge advantage. Bandwidth-limited algorithms are typically pretty simple and can be coded for GPU without too much hassle, and a high-end GPU has about 10x the theoretical bandwidth as a high-end CPU. It's also much easier to realize this high bandwidth on a GPU, as arithmetic and memory operations are overlapped.

    Transitionally you can just use a library of helpful GPU functions so you don't have to roll your own kernels every time.. like how Matlab has CUDA backing for many of its vectorized functions. It's not as fast as it could be, but you get easy of use, fast deployment, etc., compared to rolling everything yourself. For arithmetically dense operations (such as matrix multiplication and sort) the GPU is really dominant, and has a speed advantage that just leaves multi-core CPU in the dust.

    Friday, October 28, 2011 1:03 AM
  • This question was posted October 2011, when C++ AMP was at Developer Preview stage. Since then it reached Beta in February 2012, and RC in May 2012. Also since then we have made progress on the libraries front, which is the question raised here. In case someone encounters this thread now, please find a list of libraries from our corresponding blog post:



    Sunday, June 24, 2012 5:22 AM