Huge amounts of data, like nucleotide and amino acid sequences, are regularly added to the biological databases. These data are then being used by researchers in various ways depending on needs. A common task for many researchers utilizing such data is to perform a sequence alignment, and to maintain high sensitivity the problem becomes quite compute intensive due to the methods applied.
A common strategy to improve performance for this problem is to parallelize it on multiple CPU cores. However, some jobs are too large for one machine to complete in reasonable time, so in order for these jobs to finish in somewhat reasonable time, these jobs are spread onto a computer cluster.
The ParAlign application is one tool to perform rapid sequence searchs and alignments, and has the ability to utilize multiple cores in the computer as well as participating in a cluster job. In this thesis we have investigated certain areas which may influence the performance for this or similar applications.
Our work shows that the most promising area for improvements is within the distribution of the workload among the nodes in a cluster. In some cases the throughput with a dynamic distribution scheme nearly doubles for a given job.