Advances in technologies are making the generation and usage ofmultimedia data easier, which is reflected by a society where multimediadata is becoming a central part of life. Databases are at the base ofmany applications, and are required to manage these new types of data.
Traditional databases are not designed to manage multimedia data andmust therefore be redesigned and reimplemented. Among the importantissues to incorporate are; the large size of the data, the continuousnature of data such as audio and video, and the potentially complexrelationships between different data elements.
One of the essential functions a database must provide is the ability toquery the stored data. For speech data, Automatic Speech Recognition(ASR) has provided a form of access to the content of the data byautomatically generating transcripts. The quality of these transcriptsis at the current time more than ample to query for content.
For speech data, however, it is not enough to merely identify a topicwithin a longer audio stream. Speech data is temporal, and it can not beconsidered acceptable to present a user with a two hours long radio showin response to a query, if only ten minutes are pertinent to that query.A method for segmenting speech data based on the topic of conversationshould therefore be enabled.
For this thesis an experiment was performed to test the validity oftopic segmentation through ASR generated transcripts. Streamed audiodata was downloaded from BBC World Service and divided into smallersegments of equal length. These segments were then transcribed by theASR engine ViaVoice, and the resulting text files queried for varioustopics known to be present in the audio data.
The results of the experiment indicate that topic segmentation of speechthrough ASR generated transcripts is a valid approach. By querying thetranscripts the majority of the topics were correctly identified andlocated within the audio stream. There were no major retrieval errorscausing non-relevant data to receive unjust attention. However, certaintopics were lost in the transcripts: This included both topics where thekeywords were entirely new words unknown to the engine, and topics withwell established keywords.