This paper argues that the common practice of benchmarking is inadequate as a scientific evaluation methodology. It further attempts to introduce the empirical tradition of the physical sciences by using techniques from Statistical Design of Experiments applied to the example of SPARQL endpoint performance evaluation. It does so by studying full as well as fractional factorial experiments designed to evaluate an assertion that some change introduced in a system has improved performance. This paper does not present a finished experimental design, rather its main focus is didactical, to shift the focus of the community away from benchmarking towards higher scientific rigor.
The Semantic Web – ISWC 2013. Lecture Notes in Computer Science Volume 8219, 2013, pp 360-375. The final publication is available at Springer