The item response theory (IRT) framework is commonly used for scoring and equating in high stakes tests to classify individuals into performance categories. However, IRT model misspecification such as assuming unidimensional IRT model for multidimensional items can mitigate any advantages gained making sum score-based equating techniques more desirable. We conducted a Monte Carlo simulation to investigate the classification accuracy of IRT score and sum score-based equating methods when there was model misspecification. The four equating methods used in this study were the Haebara method with expected a priori ability estimator (IRT score-based equating), chained equated log linear pre-smoothed kernel equating and chained equated IRT observed score kernel equating (sum score-based equating). We simulated three realistic test settings based on the Australian Citizenship test, German university psychology exam and the Swedish Scholastic Aptitude Test (SweSAT). There were three significant findings. First, the IRT score based equating method had higher classification accuracy than sum score-based methods in most test situations when there was no model misspecification. Second, single cutoff scores had higher classification accuracy for all test settings, whereas there was a significant decrease in classification accuracy when four and 21 proficiency levels were used. Third, model misspecification led to reduced classification accuracy for all test settings and cutoff score settings, especially for tests with 21 proficiency levels. Test agencies should preferably use IRT equating methods due to their greater capacity to correctly classify more participants, while at the same time minimize the number of proficiency levels, especially if test items are multidimensional.