Abstract
Data preprocessing is a crucial step in data analysis. For preparing data for analysis, different actions may need to be performed on a given dataset including formatting, modification, extraction and enrichment. Data scientists and data analysts need to prepare their data before feeding it to Machine Learning (ML) algorithms to find useful insights. Quality of data used to train ML algorithms is one of the challenges faced by data scientists. Having a tool that facilitates data preparation intelligently can significantly reduce the time consumed in creating a rich dataset for analysis. Since there is potentially a large variety of data transformations that could be applied to a tabular dataset, it is typically more convenient for a user to use a system that recommends a subset of the transformations that are most relevant for the given dataset, taking into account possible user preferences. That subset of data transformations can be based on data about user interactions, user preferences, and the type of data in the dataset. This thesis explores approaches for providing transformation suggestions which are of users’ interest in the context of tabular data preprocessing. The proposed approach is realized in an application prototype which uses ML algorithms to provide the user the recommendations best suited for her needs. It takes into account feedback from the user regarding usability of recommendations in addition to the data collected from actions users have performed in the past as deciding factors for predicting transformation suggestions. The thesis includes discussions about common transformation suggestions provided by stateof- the-art Extract-Transform-Load (ETL) tools and evaluation of the reference application to infer transformations based on user interactions. The application prototype was validated and tested in terms of effectiveness, efficiency and satisfaction to determine the usefulness of intelligently generated data transformation recommendations among potential users. The validation consisted of a usability study involving test and a survey to evaluate the prototype, and to identify limitations with the proposed approach. Based on this evaluation, we argue that a recommender system for data transformations is useful for making data preprocessing more efficient.