We live in the information age. The development of tools contributing to more efficient use of the Internet, the world’s largest source of information, is therefore important. This thesis contributes to this development by providing an algorithm capable of dividing web pages into semantically related parts called objects, and then group similar objects into partitions. Intuitively the algorithm tries to grasp what we humans do when we make sense of a web page, by exploiting the structural information contained inside an HTML document.
There are several ways to visually partition a web page due to that it is a question of human perception. The first part of this thesis therefore details visual partitioning and defines a specific partitioning method that I have called object partitioning. The nature of object partitioning is visual. Computers cannot work with abstract visual concepts, so a translation of object partitioning to structural terms is needed. The translation is not trivial however, and I have therefore devoted a fair amount of this thesis to this translation. The main result of this thesis is an algorithm that implicitly solves the visual object partitioning problem by solving the structural counterpart. The core of the object partitioning algorithm can be reduced to two sub challenges: calculation of a concept I have called the maximal repeating weighted substring (MRWS) in a string, and calculation of a top-down distance between trees. To accomplishing the main result, this thesis proposes solutions to both these sub problems by further development of known solutions to similar problems.A central part of the development of the object partitioning algorithm is the creation of a simple term language capable of modeling HTML documents. Due to the need for processing HTML documents, a function that maps HTML documents into this term language is created. This filtering and mapping function makes the creation of the object partitioning algorithm easier due to that it removes all but the information needed in order to solve the object partitioning problem.