1.1 BackgroundMatrix and vector operations are a substantial part of the scientific computingworkload, and have been subject to much work and many optimisations inorder to increase the performance and efficiency. The introduction of parallelcomputing and distributed data has complicated the work required to achievegains in performance, and several libraries have been written in an attemptto hide much of the details and the difficulty involved with high-performanceparallel programming. With parallel computing came many new concepts andchallenges to computer science. For example the gain by using several processorsto do the work previously done by one was defined as speedup (Equation1.1), where Tp is the time it takes the parallel implementation to complete thesame task (On p processors) as the serial implementation can do in time Ts.Sp = TsTp(1.1)Speedup equal to p (The number of processors) is called linear speedup, andimplies that doubling the number of processors halves the wall-time requiredto complete the specific task. This is considered good speedup, but is difficultto achieve due to the added communication between the processors in additionto the computations they had to do initially. Sometimes super-linear speedup(Sp > P) is observed, as splitting the domain over several processors will makethe sub-domains fit in a higher level of cache at the processors. One industrystandard library for inter-processor communication is the Message Passing Interface(MPI ). To minimise the effects of communication it supports severalcommunication methods, deciding which one is best suited depends on thesituation.