Abstract
Differences in performance can be related to the item format. This is a controversial issue that poses a problem for the validity of test-score interpretations. Using the mathematics items of PISA 2018, organized in rotated booklets answered by 3122 Norwegian students (50.54% males), explanatory IRT models (considering a response of a person on an item as the outcome measure) were used to focus on whether -and to what extent- people of the same ability had the same probability of answering correctly to a multiple-choice item and a constructed-response item; additionally, it was explored if this potential difference was the same for males and females. Format accounted for 11.5% of the difficulty differences, while gender accounted for <1% of the ability differences. A constructed-response item had approximately 3-times lower odds of being answered correctly than a multiple-choice item (-1.09 (.37), p = .003; OR=.0.33) when comparing people of the same ability and gender. The format difference was slightly larger for males compared to females. The hardest items would only be given in a constructed-response format with no equivalent multiple-choice counterpart, and the variance in difficulty for constructed-response items was 3-times larger than for multiple-choice items. We discuss potential explanations, and implications for the interpretations educational stakeholders give to this data and argue that these types of empirical findings stress the importance of improving item development.