Abstract
Today the demand for large-scale Machine Learning (ML) models is increasing. Training such models require more and more hardware resources. Distributing ML training is a way to reduce training time. However, this depends on the ability of machines to work together. In this thesis, we have developed a proof of concept plugin for the NVIDIA Collective Communication Library (NCCL), enabling inter- machine PCIe communication. NCCL is a state-of-the-art Collective Operations library for Nvidia GPUs. Our plugin is implemented using Dolphin NTB adapters, allowing for inter-machine PCIe communication. We are able to show that network interconnects do affect distributed ML training time. Our plugin is able to make the Collective Operation time insignificant compared to the computation time when training ML models.