RuleGO is web-based application for describing gene groups using decision rules based on Gene Ontology terms. It takes as an input two lists of genes: a list of genes to be described and a reference list of genes. As a result one obtain a list of the rules that allow to describe an input list of genes with the use of the conjunction of gene ontology terms. The rules determined have the following meaning:
if a gene is described by a conjunction of gene ontology terms appearing in a rule premise, then it belongs to an analyzed group of genes
The rules have a statistical significance level determined by a user and are sorted according to the ranking obtained by a rule quality measure. Obtained rules also consider co-occurrence of the terms in a given gene group and the presented method guarantees that the co-occurrence will not be trivial (for example, resulting from hierarchy of the ontology graph).
Rules generation process consist of the following steps:
Each of these steps can be controlled by the user by setting appropriate parameters values. Click on the chart below to see the diagram describing whole process and names of attributes that are involved in each of the above steps.
It is a list of genes symbols for which rules are to be generated. Genes in the list should be separated by comma, colon or each gene symbol should be in a separate line. A list of allowed gene symbols is available here.
List of genes symbols used as a reference set. A user can paste a list of its own genes or choose Rest of genome option which will include as a reference set the rest of genome of the selected organism.
To evaluate a statistical significance of created rules a hypergeometric test is used. Assuming that nφ is a number of genes that are described by gene ontology terms appearing in a rule premise and nψ is a number of genes belonging to the primary set we can create the following contingency table:
where:
Using information from the above contingency table we calculate p-value of the hypergeometric test using the following formula:
In the current version of application only a hypergeometric test for over-representation of gene ontology terms in the primary set is available. A one-side (right-side) test is used because we assume that for descriptive purposes we are interested only in conjunctions of gene ontology terms with the bigger frequency than frequency that would result from random assignment of gene ontology terms to genes composing the group. For each rule there is also provided a corrected p-value based on Benjamini and Hochberg method for false discovery rate computation.
Gene Ontology consortium provides structured and controlled vocabulary that is used to describe genes and their products independently of the species. The GO database is organized into three disjoint directed-acyclic graphs (DAGs) describing biological process (BP), molecular function (MF) and cellular component (CC). Each node of the graph is called a GO term, and it is a single unit that describes some known biological process or function of the gene. The dependences between GO terms are hierarchical and as the DAG is traversed from the root into its leafs, the terms are inspected from the general ones to the more specific concepts. The section Gene Ontology allows selecting which ontology should be used used to annotate analyzed group of genes. You can select any combination of biological process, molecular function and cellular component aspects.
The section Gene Ontology allows selecting which ontology should be used used to annotate analyzed group of genes. You can select any combination of biological process, molecular function and cellular component aspects.
The hierarchical structure of the ontology database allows representing biological knowledge on the multiple levels of details. Terms at the higher levels (closer to the root) describe more general function or process while terms at the lower levels are more specific. To preserve the clarity of the ontology, the annotation files that are available at the Gene Ontology consortium website include only "original" annotations, that is annotations that were assigned to the particular GO terms by curators. The annotations resulting from the "true path rule" (annotation of that gene to all parent nodes of that term) are not included in the annotation files. If the option hierarichal annotations is checked then true path rule is satisfied
This option allows excluding IEA annotations from analysis
This option is used to set minimal and maximal level of a GO terms that are used for description
This is a list of important GO terms which can be used to build the
rules. Provided terms are used as a "seed terms" and each rule will
include at least one GO term from the provided list.
During
filtration process the rules that include longer combinations of GO
terms from the list provided have higher position in the ranking and
the RuleGO method tries not to remove them from the output set of
rules.
This option can be particularly useful when analyzing
set of genes from experiments related to particular biological
processes or functions. For example when analyzing set of genes from
tumor samples one may be interested in annotations related to
so-called hallmarks of cancer. In order to facilitate analysis,
several pre-defined lists of GO terms are also provided. These terms
are related to some important mechanisms altered in cells which
develop cancer. If option Use only above terms to create
rules is selected, then RuleGO algorithm will generate rules
using only such terms. In this particular case we suggest to
provide the longer list of GO terms. If the list is small it is
very likely that no statistically significant combinations of GO
terms will be generated.
This option allows the user to provide a list of GO terms which should be excluded from analysis.
This secion allows setting options of rules generation algorithm.
This option eliminates from analysis genes that are described by a less number of GO terms. Value of this parameter cannot be lower than minimal support value. This parameter settings can have a big influence for the computation time thus it is not recommended to set its value below 3.
This option is used to set a maximal number of elements included in a rule premise. Increasing the value of this parameter will result in generation of more specific rules (described by the lower number of genes). One could expect that increasing the value of this parameter will result in increasing the number of output rules. However this is true only to some limited value, due to the restrictions that are applied to generated rules (i.e., statistical significance, minimal number of genes described by the rule).
This option is used to set a minimal number of genes that describe generated rules. Only rules described be more or equal number of genes that are defined in Minimal support option will be presented to the user. Value of this parameter cannot be bigger than value of minimal number of genes described by GO term. This parameter settings can have a big influence for the computation time thus it is not recommenced to set its value below 3.
This option is used to set a maximal number of generated rules. Only the N best rules, where N is a number defined by a user will be provided to the users. The quality criterion is defined by a user in Output rules order section.
Please, take into consideration, that algorithm option settings may have big influence for a time of computation. By clicking the chart below you can see how different settings of rule generation parameters can influence time of computations. The analysis were performed for fixed p-value and for several different values of minimal support parameter.
Filtration is the last step of analysis and allows extracting from the set of all generated rules only the best and the most interesting ones.
The filtration algorithm is executed in a loop. Beginning from the best rule in the ranking, all rules covering the same set of genes or its subset are candidates to be removed from the result rules set. However, before removing any rule, its similarity to the reference rule is verified. If a rule is similar to the reference rule in more than a threshold defined by the user, it is removed from the set of determined rules, otherwise it remains in the output rules set.
The parameters that influences the results of filtration are:
The user can choose between two quality measures: p-value and compound quality measure. P-value is computed based on the hypergeometric test as described in Statistical Test section.
Compound quality measure is computed as the product of the three component measures:
where: mWS(r) - is denoted as rule quality option and is a rule quality computed using the following equation:
where:
where:
According to the user requirements any element of the compound quality measure can be removed from the measure by deselecting its corresponding checkbox. For example, if the user is interested in obtaining the rules which include many GO terms in their premises, he or she can deselect rule quality and ontology level checkboxes in Rules filtration section.
Similarity of two rules (ri and rj) is computed according to the following formula:
where:
The GO-term a from the rule ri is recognized as the unique if it does not occur directly in the rule rj and there is no path in GO graph that includes both term a and any term b from rule rj premise.
Generated rules are sorted according to one of the selected criteria: the compound quality measure which is described in the above section or by a p-value computed using hypergeometric test as described in Statistical Test section.
We encourage the RuleGO users to experiment with the filtration parameters settings. In the Example of different settings of rules filtration parameters section we present several different output set of rules obtianed for the same GO annotation and rules generation parameters.
The results of analysis (output set of rules) are presented on the RuleGO website. For each set of generated rules we provide number of output rules and information about the coverage (the percentage of gnes form the Primary Set which are described by generated rules).
The text file with output rules can be downloaded by clicking [download rule file] link on the result page. The list of output rules is also presented on the website.
The result set of rules is presented in the form of list which can
be also downloaded as a text file or pdf file.
For each
rule the following information is provided:
Below we present the Table incuding five different sets of rules. Each set of rules was generated for the same input lists of signature and reference genes, using the same GO annotation and rules generation parameters. The only difference was in the settings of the filtration parameters.
Filtration | Output sort order | Number of output rules | Link to file |
---|---|---|---|
NO | p-value | 28539 | download rules file |
YES | compound quality measure (quality,length,depth) | 23 | download rules file |
YES | compound quality measure (quality only) | 25 | download rules file |
YES | compound quality measure (length only) | 26 | download rules file |
YES | compound quality measure (depth only) | 19 | download rules file |