Modern graphical processing units (GPU) are powerful parallel processors, capable of running thousands of concurrent threads. While originally limited to graphics processing, newer generations can be used for general computing (GPGPU). Through frameworks such as nVidia Compute Unified Device Architecture (CUDA) and OpenCL,GPU programs can be written using established programming languages (with minorextensions) such as C and C++. The extensiveness of GPU deployment, low cost of entry and high performance makes GPUs an attractive target for workloads formerly reserved for supercomputers or special hardware. While the programming language is similar, the hardware architecture itself is significantly different than a CPU. In ad-dition, the GPU is connected through a comparably slow interconnect, the PCI Express bus. Hence, it is easy to fall into performance pitfalls if these characteristics are nottaken into account.
In this thesis, we have investigated the performance pitfalls of a H.264 encoder written for nVidia GPUs. More specifically, we looked into the interaction between the host CPU and the GPU. We did not focus on optimizing GPU code, but rather how the execution and communication was handled by the CPU code. As much manual labouris required to optimize GPU code, it is easy to neglect the CPU part of accelerated applications.
Through our experiments, we have looked into multiple issues in the host application that can effect performance. By moving IO operations into separate host threads, we masked away the latencies associated with reading input from secondary storage. By analyzing the state shared between the host and the device, we where able to reducethe time spent synchronizing data by only transferring actual changes.Using CUDA streams, we further enhanced our work on input prefetching by transferring input frames to device memory in parallel with the encoding. We also experimented with concurrent kernel execution to perform preprocessing of future frames in parallel with encoding. While we only touched upon the possibilities in concurrentkernel execution, the results where promising.
Our results show that a significant improvement can be achieved by focusing optimizing effort on the host part of a GPU application. To reach peak performance, the host code must be designed for low latency in job dispatching and GPU memory management. Otherwise the GPU will idle while waiting for more work. With the rapidadvancement of GPU technology, this trend is likely to escalate.