About This Project

Establishment of weight matrix for TAG domain

The two dimensional term frequency (2DTF) formula is adopted from the tf–idf weight (term frequency–inverse document frequency) which is a weight often used in information retrieval and text mining. The tf–idf weight is a statistical measure used to evaluate how important a word is to a document. The importance increases proportionally to the number of times a word appears in the document (i.e., tf) but is offset by how common the word is in all of the documents in the collection (i.e., df). In our case, we don’t take df offset from the tf to remove the commoness. Alternatively, the tf is modified by the relative importance of the given term in multiple document in the collection. That is, we multiply the term frequencies from two dimentions to calculate a final weight score for the relative importance of a given term in a document.

The term frequency in the given document gives a measure of the importance of the term t_i within the particular document.

In our case, n_i being the number of occurrences of a given protein domain, and the denominator is the number of occurrences of all protein domain in the oncogene or TSG groups.

Then the two dimensional Term Frequency (2DTF) formula is

Where N i,j being the number of occurrences of a given protein domain, and the denominator is the number of occurrences of all protein domain in the oncogene or TSG groups.

A high weight in 2DTF is reached by a high term frequency in the given category and a low term frequency across mutiple categories; the weights hence tends to enhance the relative importance of terms in one category verses the other. A weight matrix table was generated based on calculated score for each protein domain and the oncogenic potenial of any protein is the score sum from all domains present in the given protein.