Historically, the use of graphics cards for scientific computing has yielded great performance. Order-of-magnitude performance gains has been obtained over the CPU (Owens et al. 2008). This is mainly due to their massively parallel nature, delivering a possible performance gain over the traditional CPU. In this thesis, I extend a current single node/GPU shallow water simulator (Brodtkorb et al. 2012) to utilize a parallel environment composed of multiple nodes and multiple graphics cards, enabling larger and faster simulations. For this purpose, a row domain decomposition technique is implemented, dividing a global domain into several subdomains to be distributed between the different processing units. In an attempt to minimize communication latency between the units, a technique called Ghost Cell Expansion is also implemented. The main work of the thesis looks into load-balancing the workload between the units, in an attempt to achieve more efficient multi-GPU/node simulations. To load-balance the workload appropriately, several challenges arise. For example, one need to take into account the computational power of the graphics cards to correctly determine the amount of workload for each card. Also, one should take into account the underlying water placement, i.e., wet and dry cells of the domain throughout the simulation run. For this purpose, dynamic auto-tuning techniques are demonstrated and discussed. The implementation has been applied to the shallow water equations, but is general in use, and will work equally well for any systems of conservation laws. Finally, the implementation is thoroughly benchmarked on both multi-node and multi-GPU parallel environments.