Interpreting Results | RepeatMasker uses| Sensitivity | Selectivity | Repeat databases | References | Changes
RepeatMasker is a program that screens DNA sequences for interspersed
repeats and low complexity DNA sequences. The output of the program is
a detailed annotation of the repeats that are present in the query
sequence as well as a modified version of the query sequence in which
all the annotated repeats have been masked (default: replaced by
Ns). On average, almost 50% of a human genomic DNA sequence currently
will be masked by the program. Sequence comparisons in RepeatMasker
are performed by the program cross_match, an efficient implementation
of the Smith-Waterman-Gotoh algorithm developed by Phil Green.
Sequences can be pasted in or uploaded as files, both in fasta format.
Multiple fasta format sequences may be pasted in at once or may be contained
within a file. Fasta format looks like this:
>Sequence1 ACGTGCGCGATCGCCTGCTAGGCGTACGTCGCAG GCGATCGATGTGCTAGATCAGATGACA >Sequence2 GGGCTAGATTAGCACCACATACATCGCTCA
The submission form contains a text field for the full pathname of the
file containing the sequence data on the local system (i.e. where the
Netscape browser is running). By pressing the "Browse..."
button, you can use a file selection box to select the file without
having to type the path. When running the browser on a MacIntosh the
browse button works but the file name can not be typed in. On both the
PC and Mac the sequence file needs to be saved as 'text only'.
-div 20 -inv -GC 45which will cause the program to only annotate and mask repeats less than 20% diverged, return the alignments in the orientation of the repeat consensus sequences, and use matrices optimal for a 45% GC background nucleotide distribution.
# of repeats total bp primates 563 664160 rodents 466 487006 other mammals 347 243730 other vertebrates 52 53994 Drosophila 65 167423 Arabidopsis 98 275516 grasses 27 67789Note that the majority of sequences against which rodent and especially other mammalian queries are compared are repeats identified in the human genome and thought to predate the mammalian radiation.
Predicting genes from a masked sequence faces several problems. First,
one should not mask low complexity regions, e.g. to avoid masking
trinucleotide repeats in coding regions. But even with only
interspersed repeats masked, gene prediction programs may fail to
identify exons correctly. As mentioned above, sometimes tail ends of
coding regions may have originated from transposable elements. Even if
no coding regions have been masked, splice sites may be compromised;
e.g. the polypyrimidine region that is part of the acceptor splice
site may be contained within a repeat.
Thus, I generally recommend to run a gene prediction program on
unmasked DNA (as well) and compare the predicted genes and exons with
the RepeatMasker output. Some gene prediction program allow you to
force certain exons out of the predictions (e.g. often the old ORFs of
LINE1 elements and endogenous retroviruses are included in
genes). Work is also in progress at several sites to incorporate
RepeatMasker into gene prediction programs, in which cases matches to
repeats are weighted in along with the other parameters used.
The annotation file contains the cross_match output lines. It lists
all best matches (above a set minimum score) between the query
sequence and any of the sequences in the repeat database or with low
complexity DNA. The term "best matches" reflects that a
match is not shown if its domain is over 80% contained within the
domain of a higher scoring match, where the "domain" of a
match is the region in the query sequence that is defined by the
alignment start and stop. These domains have been masked in the
returned masked sequence file. In the output, matches are ordered by
query name, and for each query by position of the start of the
alignment.
Example:
1306 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/MER2_type (0) 336 103 12204 10.0 2.4 1.8 HSU08988 6782 7714 (21529) C TIGGER1 DNA/MER2_type (0) 2418 1493 279 3.0 0.0 0.0 HSU08988 7719 7751 (21492) + (TTTTA)n Simple_repeat 1 33 (0) 1765 13.4 6.5 1.8 HSU08988 7752 8022 (21221) C AluSx SINE/Alu (23) 289 1 12204 10.0 2.4 1.8 HSU08988 8023 8694 (20549) C TIGGER1 DNA/MER2_type (925) 1493 827 1984 11.1 0.3 0.7 HSU08988 8695 9000 (20243) C AluSg SINE/Alu (5) 305 1 12204 10.0 2.4 1.8 HSU08988 9001 9695 (19548) C TIGGER1 DNA/MER2_type (1591) 827 2 711 21.2 1.4 0.0 HSU08988 9696 9816 (19427) C MER7A DNA/MER2_type (224) 122 2
This is a sequence in which a Tigger1 DNA transposon has integrated into a MER7 DNA transposon copy. Subsequently two Alus integrated in the Tigger1 sequence. The simple repeat is derived from the poly A of the Alu element. The first line is interpreted like this:
1306 = Smith-Waterman score of the match, usually complexity adjusted The SW scores are not always directly comparable. Sometimes the complexity adjustment has been turned off, and a variety of scoring-matrices are used. 15.6 = % substitutions in matching region compared to the consensus 6.2 = % of bases opposite a gap in the query sequence (deleted bp) 0.0 = % of bases opposite a gap in the repeat consensus (inserted bp) HSU08988 = name of query sequence 6563 = starting position of match in query sequence 7714 = ending position of match in query sequence (22462) = no. of bases in query sequence past the ending position of match C = match is with the Complement of the consensus sequence in the database MER7A = name of the matching interspersed repeat DNA/MER2_type = the class of the repeat, in this case a DNA transposon fossil of the MER2 group (see below for list and references) (0) = no. of bases in (complement of) the repeat consensus sequence prior to beginning of the match (so 0 means that the match extended all the way to the end of the repeat consensus sequence) 2418 = starting position of match in database sequence (using top-strand numbering) 1465 = ending position of match in database sequence
An asterisk (*) in the final column (no example shown) indicates that
there is a higher-scoring match whose domain partly (<80%) includes
the domain of this match.
Note that the SW score and divergence numbers for the three Tigger1
lines are identical. This is because the information is derived from a
single alignment (the Alus were deleted from the query before the
alignment with the Tigger element was performed). The program makes
educated guesses about many fragments if they are derived from the
same element (e.g. it knows that the MER7A fragments represent one
insert). In a next version I can identify each element with a unique
ID, if interest exists (this could help to represent repeats cleaner
in graphic displays).
Alignments are shown in order of appearance in the query sequence.
These alignments may be most generally useful for designing PCR
primers in a region full of repeats. It is possible to get primers
that work in a whole genome, when the 3' end of it lies in a region of
(even a common) repeat that is very different from the consensus.
Alignments are shown in the orientation of the query sequence unless
the option -inv is typed in in the option box.
Here is an example of an alignment of a MIR spanning an Alu element
deleted in an earlier step:
665 28.45 2.93 5.02 g5129s420 7350 7882 (1924) C MIR#SINE/MIR (1) 261 28 3 g5129s420 7350 ATCATAACAAACATTTAT--GGTGCCTCCTATGGAGCAGGGATTTTGCTT 7397 v v i i i v viv v i v v v C MIR#SINE/MIR 261 ATAATAACCAACATTTATTGAGCGCTTACTATGTGCCAGGCACTGTTCTA 212 g5129s420 7398 AGGACTCTGAACTATAT---CTTACTT-GTCTTCATTAAAAACCTTATGA 7443 vi i iv i i i i i i v i C MIR#SINE/MIR 211 AGCGCTTTACA-TGTATTAACTCATTTAATCCTCA-CAACAACCCTATGA 164 g5129s420 7444 AAAAGGTACTATTATTAACTGGGGXTGGGTTGTTTAACAGATAAGAAAGC 7787 iiv v i iii v i i i C MIR#SINE/MIR 163 GGTAGGTACTATTATTATCC---------CCATTTTACAGATGAGGAAAC 123 g5129s420 7788 TTAAGAATTAGAGAGATAAATTATCTTGCTTAAGGTAACACAGTTAACAA 7837 v i v i i v v v ii v i ii C MIR#SINE/MIR 122 TGAGGCA-CAGAGAGGTTAAGTAACTTGCCCAAGGTCACACAGCTAGTAA 74 g5129s420 7838 GCATTAG-GTCAAAGTTTGAACTCGGGCAGTCTGACTACAGAGCCC 7882 iivi i iiii i i i i v i C MIR#SINE/MIR 73 GTGGCAGAGCCGGGATTCGAACCCAGGCAGTCTGGCTCCAGAGTCC 28 Transitions / transversions = 1.96 (45 / 23) Gap_init rate = 0.03 (8 / 234), avg. gap size = 2.38 (19 / 8)
In cross_match alignments the mismatches are indicated, where
"-" indicates an insertion/deletion, "i" a
transition (G<->A, C<->T) and "v" a transversion
(all other substitutions). The position of the deleted Alu in the
query is indicated with an "X".
The lines in the annotation table describing this match appear as:
665 28.4 2.9 5.0 g5129s420 7350 7467 (533) C MIR SINE/MIR (1) 261 149 2222 10.2 2.7 0.0 g5129s420 7468 7762 (238) C AluSg SINE/Alu (7) 303 1 665 28.4 2.9 5.0 g5129s420 7763 7882 (118) C MIR SINE/MIR (113) 149 28
Most discrepancies between alignments and annotation result from
adjustments made to produce more legible annotation. This annotation
also tends to be closer to the biological reality than the raw
cross_match output. For example, adjustments often are necessary
when a repeat is fragmented through deletions, insertions, or an
inversion. Many subfamilies of repeats closely resemble each other,
and when a repeat is fragmented these fragments can be assigned
different subfamily names in the raw output. The program often can
decide if fragments are derived from the same integrated transposable
element and which subfamily name is appropriate (subsequently given to
all fragments). This can result in discrepancies in the repeat name
and matching positions in the consensus sequence (subfamily consensus
sequences differ in length).
Some other discrepancies are specific to LINE elements. These repeats
do not appear as complete elements in the consensus database. This is
mostly a result of the contrast in conservation over the length of its
sequence during its evolution in the mammalian genome; the ~3 kb ORF2
region of LINE1 has been very conserved, whereas the untranslated
regions and ORF1 to a lesser degree have evolved very fast. Thus the
3' end or 5' end of an ancient LINE1 does not even remotely resemble
that of the currently active LINE1, whereas the coding region for
reverse transcriptase is closely related. Thus, many subfamilies have
been defined for both the 5' and 3' UTRs (25 and 50, resp.) of LINE1
elements in human DNA, whereas only three ORF2 entries are present in
the database. It is not only hard to extend all subfamilies from the
beginning to the end, but it also appears that different 3' ends may
have been associated with the same 3' ends, and vice versa. On top of
that, including 50 full length (6.2-8 kb) LINE1 elements in the
database would make the program very slow. LINE1 elements therefore
are presented in the database in 3 (or more) pieces, and the program
tries to put these pieces together as well as possible. As a result
both the names of the repeats and position numbering in the consensus
sequence are generally different in the alignments than in the output
file. The LINE2 elements are likewise broken up in the databases, in
3' UTRs for different subfamilies and one ORF2 region.
The 3' UTR of LINE1 subfamilies ranges from 500 bp to over 2000 bp (in
L1MC/D3), and the length of the 5' UTR is even more variable, even
between subfamilies that show strong similarity in the 3' UTR. To
allow the LINE1 fragments to be put together, all position numbers in
older LINE1 subfamilies are adjusted to the position of ORF2 (the
conserved part of LINE1) in a complete L1PA2 element. Since some older
elements have much longer 5' UTRs or ORF1-ORF2 linker regions than
L1PA2, this sometimes results in the assignment of negative position
numbers for the 5' end of LINEs.
Finally, you may find large discrepancies in position numbering if an
element includes tandem repeat units. For example, MER109 contains
multiple ~300 bp repeat units; this can lead to overlapping
matches. In the output such matches are fused.
================================================== file name: A-355G7.fasta sequences: 1 total length: 139958 bp GC level: 41.03 % bases masked 91491 bp ( 65.37 %) ================================================== number of length percentage elements* occupied of sequence -------------------------------------------------- SINEs: 46 12182 bp 8.70 % ALUs 41 11603 bp 8.29 % MIRs 5 579 bp 0.41 % LINEs: 42 52641 bp 37.61 % LINE1 38 52296 bp 37.37 % LINE2 4 345 bp 0.25 % LTR elements: 20 13441 bp 9.60 % MaLRs 10 5618 bp 4.01 % Retrov. 4 5131 bp 3.67 % MER4_group 3 1439 bp 1.03 % DNA elements: 8 1741 bp 1.24 % MER1_type 7 1114 bp 0.80 % MER2_type 1 627 bp 0.45 % Mariners 0 0 bp 0.00 % Unclassified: 5 9215 bp 6.58 % Total interspersed repeats: 89220 bp 63.75 % Small RNA: 0 0 bp 0.00 % Satellites: 0 0 bp 0.00 % Simple repeats: 20 1647 bp 1.18 % Low complexity: 9 437 bp 0.31 % ================================================== * most repeats fragmented by insertions or deletions have been counted as one element The sequence(s) were assumed to be of primate origin. RepeatMasker version 11/06/98 default ProcessRepeats version 06/16/98
The four main classes mentioned in this table are well defined (see my 1996 review in COGD) and form a good basis for a
summary or visual presentation of the repeats in a locus. Among the
subclasses, some uncertainty of classification remains; it is
especially hard to predict if an LTR is derived from an endogenous
retrovirus or a non-autonomous LTR element. Also, not all subclasses
are listed and the total for the classes is often higher than the sum
of the sub classes. Note that the "MER" subclasses and the different
MER interspersed repeats are not necessarily related to each
other. The term MER (MEdium Reiterated repeats) was introduced for
purely administrative purposes to give the beast a name. I named the
MER1, MER2, and MER4 groups after the first member of each group that
was identified as an interspersed repeat.
The program tries very hard to find out which repeat fragments were
derived from the same insertion event of a transposable element. The
estimated number of events still tend to be an overestimate.
The 'bases masked' number is calculated from the total number of Xs in
the masked sequences (before these are changed to Ns or lower case
letters). The other numbers are derived from the annotation (.out)
file. Discrepancies between the 'bases masked' number and the sum of
'total interspersed repeats', small RNA, satellites and low complexity
are generally very small. They are mostly accounted for by unmasked
regions between flanking identical simple repeats, annotated as one
stretch if fewer than 10 bases separate them, and fragments of repeats
shorter than 10 bp which are not annotated but are masked.
OVERVIEW
Smit, A.F.A. (1996) Origin of interspersed repeats
in the human genome. Curr. Opin. Genet. Devel. 6 (6), 743-749.
Smit, A.F.A. (1996) Structure and evolution of mammalian interspersed
repeats. PhD dissertation, USC. (lots of otherwise unpublished
information here, available under order number 9636751 at the UMI web site)
SINE/Alu
Schmid, C. W. (1996). Alu: structure, origin, evolution, significance,
and function of one-tenth of human DNA. Prog Nucleic Acids Res Mol Biol
53, 283-319.
Jurka, J. (1996) Origin and evolution of Alu repetitive elements. In "
The impact of short interspersed elements (SINEs) on the host genome. Maraia,
R.J., editor. Springer Verlag.
Batzer, M. A., Deininger, P. L., Hellmann Blumberg, U., Jurka, J., Labuda,
D., Rubin, C. M., Schmid, C. W., Zietkiewicz, E., and Zuckerkandl, E. (1996).
Standardized nomenclature for Alu repeats. J Mol Evol 42, 3-6.
SINE/MIR & LINE/L2
Smit, A. F. A., and Riggs, A. D. (1995). MIRs are classic, tRNA-derived
SINEs that amplified before the mammalian radiation. Nucleic Acids Res
23, 98-102.
LINE/L1
Smit, A. F. A., Toth, G., Riggs, A. D., Jurka, J., Ancestral mammalian-wide
subfamilies of LINE-1 repetitive sequences. J Mol Biol 246, 401-417.
LTR/MaLR
Smit, A. F. A. (1993). Identification of a new, abundant superfamily of
mammalian LTR-transposons. Nucleic Acids Res 21, 1863-72.
LTR/Retroviral
Wilkinson, D. A., Mager, D. L., and Leong, J. C. (1994). Endogenous Human
Retroviruses. In The Retroviridae, J. A. Levy, ed. (New York: Plenum Press),
pp. 465-535.
DNA/all types
Smit, A. F. A., and Riggs, A. D. (1996). Tiggers and other DNA transposon
fossils in the human genome. Proc Natl Acad Sci USA 93, 1443-8.
The database of human/mammalian-wide repeats was expanded 2.5
fold. Among the new additions are the (long) internal sequences of
endogenous retroviruses.
Databases of repeats from other species than primates, rodents
or artiodactyls can now be screened, although the program is not optimized
to do so and the quality of the databases is not at the same level.
Through optimization of the cross_match searches, the program more
sensitive and selective, especially with regard to detection of low
complexity sequences and old LINE1 elements.
The RepeatMasker output is now processed by a second script to create annotation
ready for database submission. Some of the more obvious improvements in
the output are (i) overlapping matches are generally resolved, (ii) LINE1
fragments are annotated with position numbers as in a full L1 element,
and (iii) when an Alu or LINE1 is fragmented information from both or all
fragments is used to assign a subfamily name.
Alignments are shown without interruption by other cross_match output
and in the order of appearance in the query sequence.
A summary table has been added which shows, among other things, the repeat
composition of the query sequence.
- major expansion of the rodent libraries and significant update
of the human libraries as well, especially in LINE1 elements.
- scripts modified to accommodate new entries in databases
- simple repeats masking optimized by including pentamers and
using a more stringent matrix
- several bugs fixed (e.g. sequences without repeats are now counted)
- table now displays the parameters used
- the program is more robust and accepts most 'almost but not quite
fasta' format files
- large sequences are analyzed in fragments of 100 kb to reduce the
memory requirements of the program. Similarly files with very many
sequence entries are divided up. You shouldn't notice any of this in
the output files.
- matrices are used that are optimal for the divergence level of the
repeats to which the query is compared and the background nucleotide
composition.
- another big update of the human repeat databases.
- the small RNA sequences have been corrected and expanded (all tRNAs
should be there now)
- the summary table now lists the amount of small RNA (pseudo)genes,
simple repeats and low complexity DNA identified
- close to perfect simple repeats, full-length shorter interspersed
repeats and young LINE1 3' ends are temporarily excised from the
sequence (in both human and rodent analysis) to allow better detection
of any underlying repeats.
- the "Skip simple, low complexity region masking" really skips all
simple repeats now
- alignments are shown in the orientation of the query sequence
- among many bugs fixed is one involving sequence names including a
number between parentheses
This version uses the 1998 cross_match release. The difference for
RepeatMasker is mainly in the complexity adjusted length of the
matches that function as kernels for Smith Waterman alignments and the
matrix dependent adjustment of the score for complexity of the
alignment.
The full description ('>') lines are now retained in the masked file.
The .out file table is returned with flexible length columns allowing
the full length of long query sequence names to be displayed.
Optionally, the old fixed width table can still be obtained.
Simple repeat and satellite masking has been improved again; their
annotation has changed a bit, most notably they are now all listed in
the orientation of the query sequence
Several new options are available:
- A mRNA/EST option prevents false masking due to inappropriate matrix
choice and low complexity matches to LINE1 elements in short GC rich
regions like coding regions.
- You can limit the masking to Alus when masking primate DNA
- You can limit the masking to younger repeats by setting a maximum
allowed divergence to the consensus sequence
- The sequences identified as repeats can be returned in lower case
(rest in capitals) rather than masked out by Ns or Xs.
- You can set the background GC level (determining which matrices are
used) overriding the program's calculations.
Among bugs fixed since May 1998 are those responsible for distorted
output for sequences with names ending in .seq and for sequences
without a header line. Also, sequence files from PCs and Mac with
hidden carriage returns are handled appropriately.
For further information and to obtain a local copy contact:
Arian Smit
For information on a commercial license please contact::
Chuck Williams