R.ROSETTA: a package for analysis of rule-based classification models

ROSETTA is a rough set-based classification toolkit that aims at identifying semantics from various data types. Here we present the R.ROSETTA package, which is an R wrapper of ROSETTA. The package significantly enhances the accessibility of the existing machine learning environment and the interpretability of the results. The ROSETTA functions have been enriched and improved by the incorporation of novel components targeting bioinformatics applications. Such improvements include: undersampling imbalanced datasets, estimation of the statistical significance of classification rules, retrieval of support sets, prediction of external data and integration with rule visualization frameworks. We tested the performance of R.ROSETTA on a complex dataset involving gene expression measurements for autistic and non-autistic young males. We demonstrated that R.ROSETTA facilitated the detection of novel gene-gene interactions. The results demonstrated the potential of R.ROSETTA classifiers to identify putative biomarkers and novel biological interactions


Introduction
Rough set classification is a transparent machine learning technique that has been widely applied in various scientific areas (Kumar and Inbarani, 2018;Zhang et al., 2014). The rough set methodology creates rule-based classification models that consist of minimal sets of IF-THEN rules that uncover interactions among variables (Pawlak, 1982). The ROSETTA software is an implementation of rule-based classification modeling (Øhrn and Komorowski, 1997). The framework has been used to solve biology-related issues e.g. (Gil-Herrera et al., 2011;Komorowski, 2014;Setiawan et al., 2009). Here we present a more accessible and flexible implementation of ROSETTA in a form of an R package. R.ROSETTA substantially extends the functionality of the existing software towards analyzing complex and ill-defined bioinformatics datasets. Among others, we have implemented functions (Figure 1) such as undersampling, rule p-value estimation, class prediction, support sets retrieval and rule visualization approaches. To evaluate R.ROSETTA performance, we explored rule-based models for a complex dataset of gene expressions measured for autistic and non-autistic samples (Supplementary Table S1).

Implementation
R.ROSETTA was implemented under R (R Core Team, 2018) version 3.5.2 and the open-source R package is available on GitHub repository at https://github.com/mategarb/R.ROSETTA. The R.ROSETTA package is a wrapper (Supplementary Note -Package architecture) around command line version of ROSETTA system. In contrast to ROSETTA, R.ROSETTA is a cross-platform application with additional functionalities (Figure 1). In the supplementary material, we are providing an overview of the main upgrades (Supplementary Note -Main upgrades) introduced in R.ROSETTA.

Results
We examined gene expression levels (Alter et al., 2011) of autistic and non-autistic (control) male children (Supplementary Table S1). The dataset has been preprocessed (Supplementary Note -Data preprocessing) and corrected for subject age effect (Supplementary Figure S1). In the next step, we employed the Fast Correlation-Based Filter dimensionality reduction method (Novoselova et al., 2018) Figure S2).
We constructed models (Supplementary Note -Classification) in R.ROSETTA for Johnson and Genetic reducers with 82% and 91% accuracy, respectively (Supplementary Table S4, Supplementary Table S5). Even though the overall performance of the Genetic algorithm was better than Johnson's, its tendency of generating numerous rules reduced the significance of individual rules after correcting for multiple testing (Supplementary Figure S3; Supplementary Table S4).
We selected highly significant rules from the Johnson model to identify the most important genes for each decision class separately. Among the 15 genes occurring in the rules 10 genes are likely to be associated to autism (Supplementary Table S6) e.g. elevated expression of COX2 has been earlier identified with autism (Yoo et al., 2008). TSPOAP1 has been associated (Bucan et al., 2009) with autism through a deletion in an exonic locus. Furthermore, we identified genes related to calcium homeostasis control such as NCS1 and SCIN. Previous studies (Palmieri et al., 2010) have demonstrated that calcium homeostasis is altered in autism disorders. We detected that expression of antisense RNA of TMLHE (TMLHE-AS1) is down-regulated (Supplementary Figure S2). The TMLHE gene is a well-known risk factor of autism (Celestino-Soper et al., 2011). We discovered also a zinc finger gene ZFP36L2. The association of zinc fingers to autism was previously described by the dataset authors (Alter et al., 2011).
Finally, two strongest interactions (Supplementary Table S7) for autism-related rules contained genes previously linked to autism or its symptoms. Our study showed (Supplementary Figure S4) that up-regulated expression of TSPOAP1 was associated with up-regulated expression of PSMG4. The second interaction in autism class consist of unchanged expression of NCS1 and down-regulated expression of CSTB. The reduced expression of CSTB has been linked to the mechanism of pathogenesis in epilepsy (Lalioti et al., 1997).

Conclusions
R.ROSETTA is a tool that gathers fundamental components of statistics for rule-based modelling. Additionally, the package provides hypotheses about potential interactions between features that discern phenotypic classes.