Abstract
This thesis proposes deep learning architectures for sound event detection that aims to work fully end-to-end, working with raw audio as input, which can be directly compared to models using fixed graphical time-frequency representations as input. The primary objective is to assess the effectiveness of employing raw audio input in comparison to the conventional fixed graphical time-frequency representations. To achieve this, pairs of similar models based on convolutional recurrent neural networks commonly utilized in sound event detection, are trained using either raw audio or fixed graphical time-frequency representations to enable a comprehensive comparison. The findings reveal that the proposed deep learning architectures, operating on raw audio input, can achieve comparable performance to models based on fixed graphical time-frequency representations in sound event detection. Moreover, in specific applications where high temporal resolution is of importance, the architectures utilizing raw audio input showcase superior performance when compared to their fixed graphical counterparts. This finding highlights the potential of raw audio end-to-end deep learning architectures in capturing fine-grained temporal information critical for accurate sound event detection.