The
fast DB documentation

Exons and splicing events definition
BLAST
against EST and mRNA sequence databanks
Alignment
of selected transcript sequence with genomic sequence
Primary definition
of alternative splicing events
Refined definition
of alternative splicing events
Features available from the “Transcripts view” page
Link between
transcripts and PubMed
Prediction of Open
Reading Frames (ORF)
Prediction of
nonsense-mediated mRNA decay (NMD)
Prediction of the
microRNA/transcript interaction sites
Association of
transcript with tissue
Database filling and creation of PDF and SWF files
Multi-alignment of
transcript sequences and in silico
PCR
Multi-alignment
of transcript sequences
Tissue
distribution histogram of all gene transcripts
Tissue
distribution histogram of gene transcripts for a specific event
Analysis of
alternative first exons
Analysis of
alternative terminal exons
Alternative 3’and
5’ splice sites
All scripts have been written in
PERL v5.8.7 (http://www.perl.org/) using the following modules:
Data processed by our scripts have
been retrieved from the following public databanks:
All data are stored in a MySQL
relational database (http://www.mysql.com/). MySQL v4.1.12 is used and all
tables were created in MyISAM format.
Fast DB provides three kinds of analysis: human
mRNAs, human mRNAs and ESTs, and mouse mRNAs.
For analysis of human genes (the two first
analysis), we recovered genomic sequences of the 22,218 “protein-coding” genes from
the homo_sapiens_core_31_35d EnsEMBL database. Mouse genomic sequences come
from the mus_musculus_core_35_34d EnsEMBL database. We used the
ensemble_compara_36 database to associate each human gene with its orthologous
mouse gene.
All recovered genomic sequences were upstream
and downstream extended compared to genomic area defined by EnsEMBL to make
sure to include additional exons at the gene borders.
In order to define the exon/intron structure
and splicing events of the gene, we aligned transcripts against genomic
sequence. These selected transcripts come from several sources. “Full-length”
mRNAs and ESTs come from the UCSC website and “partial” mRNAs were downloaded
from GenBank using this query “((splic*[Text Word] OR (variant*[Text Word]) OR
(isoform*[Text Word])) AND (homo sapiens[Organism]) AND (mRNA[Text Word]) AND
(partial[Text Word]) NOT (DNA[Text Word]) NOT (BAC[Text Word]) NOT (contig[Text
Word]) NOT (cosmid[Text Word])”. Then, each bank of sequences has been
formatted by formatdb in order to be queried by BLAST (6).
Sequence of each exon defined by EnsEMBL is
blasted against these sequence banks in order to recovered transcripts to be
further analyzed.
Each recovered transcript is aligned with
genomic sequence with sim4 (7). Very stringent criteria have been
defined in order to eliminate potential bad-quality transcript. This selection
gathers two distinct steps. Transcripts have to pass these conditions to
successfully complete the first step:
* If all selected transcripts have
one exon, this criterion is not used.
Using all pre-selected transcripts,
we defined another criterion: average of ratios between length of all defined
exons and length of all defined introns. To successfully pass this second step,
this ratio calculated for a given transcript has to lesser than three times the
average ratio. Off course, this second step is not applied if all pre-selected
transcripts have only one exon.
Each selected transcript alignment
is parsed in order to define the exon positions on the transcript sequence and on the
genomic sequence. We have defined a
"genomic exon" as the most frequent exon among all transcript exons
at a given genomic position (Figure 1). To do this analysis, all transcript
exons were sorted by ascendant order with respect to their first position in
the genomic sequence. Then, they were gathered by "bag", each
"bag" corresponding to one "genomic exon".
We defined the first and the last positions of a
genomic exon as the most frequent first and last positions occupied by the
transcript exons belonging to this genomic exon “bag” (see Figure 1). However,
the first position of the first exon and the last position of the last exon of
the gene were defined differently. The first position of the first exon was
defined as the lowest position among the transcript exons. The end position of
the last exon was defined as the highest position among the transcript exons. In
case of “intron-less” gene, the first and last positions were defined as respectively,
the lowest and the highest positions of the single exon.

Figure 1: Genomic exon definition
After genomic exons have been defined,
alternative events were defined by comparing transcript exons with the corresponding
genomic exons. Seven types of alternative events were defined:
To define a transcript exon as an alternative
first exon of a gene, we used several criteria. First, the exon must be the
first exon of one or more transcripts. Second, if the first exon of a transcript
does not co-localize with any genomic exon (case 1, transcript

Figure 2: Alternative first exon definition
It is important to underline that one can not
exclude that upstream exon(s) that is (are) not present in the available
transcript(s), exist. To gain more confidence on a candidate “first exon”,
users can perform analyses for promoter and transcription factor binding site
prediction and for 5’-UTR features using the candidate “first exon” and
upstream sequences with fast DB interface (see User’s guide).
Alternative last exons (other than the last
exon) are shown in red colour on the example in figure 3. Last exons are
defined as being the last exon of at least one transcript and exceeding at
least by 10 bases the last position of the corresponding genomic exon
(transcript

Figure 3: Alternative last exon definition
It is important to underline that one can not
exclude that downstream exon(s) that is (are) not present in the available
transcript(s), exist. To gain more confidence on a candidate “last exon”, users
can perform several analyses for polyadenylation site prediction and 3’-UTR
features using the candidate “last exon” and downstream sequences with fast DB
interface (see User’s guide).
A transcript exon that ends downstream the
first position of the next genomic exon is defined as a retained intron (red
exon of transcript

Figure 4: Retained intron definition
We defined an exon skipping event when two
consecutive exons from the same transcript present the following criterion: the
genomic position of the second exon is higher than the genomic position of the
first exon plus one. For example, figure 5 shows that the second exon of
transcript 2 is at position three, which is higher than two (position of the
first exon plus one).

Figure 5: Exon skipping definition
An alternative 3’ splicing site was defined
when a transcript exon starts at least 3 nucleotides downstream or upstream the
start of the corresponding genomic exon (red exon of transcript

Figure 6: Alternative 3’ splicing site definition
An alternative 5’ splicing site was defined
when a transcript exon ended at least 3 nucleotides downstream or upstream the end
of the corresponding genomic exon (red exon of transcript

Figure 7: Alternative 5’ splicing site definition
An internal exon deletion was defined if at
least two transcript exons from the same transcript were at the same genomic exon
position and if the length between these exons was equal to or higher than ten
nucleotides (red exons of transcript

Figure 8: EID definition
In order to make further analyses, we have
redefined some alternative splice types: exon skipping and intron retention.
Consecutive skipped exons in the same
transcript are considered as a single event (figure 9, item 1). However, in
some cases, exons that seem to be skipped are no longer taking into account in
the alternative splicing events definition since these exons are only
associated with a specific promoter or a specific alternative terminal exon as
shown on figure 9, items 2 and 3. The case gathering exon associated with a
specific promoter (or a specific alternative terminal exon) adjacent to a
“real” skipped exon can also exist (figure 9, items 4 and 5).

Figure 9: refined definition of exon skipping events
In the same principle as exon skipping,
consecutive retained introns are considered as a single event (figure 10, item 4).

Figure 10: refined definition of intron retention events
In order to link each transcript with the
corresponding articles stored in PubMed, we used the HTTP::Lite Perl module
with the followed link for each transcript of fast DB: “http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=search&db=pubmed&term=$t”
where “$t” corresponds to the corresponding GenBank accession number of the
transcript. In case of successful request, corresponding PubMed IDs (PMID) were
stored in our database.
Fast DB provides a predicted ORF for most transcripts
(but not for ESTs). In some cases, two ORFs are predicted for the same
transcript (ORFs blue and red on the “Transcripts view” page, see User’s
guide). For each transcript, we used the algorithm described on figure 11. We
used the “getorf” program from the EMBOSS package (8) to find all ORFs with at least 120
nucleotides (“getorf –minsize 120 –reverse No file_in file_out”). Then, we
selected the two longest ORFs. The first selected ORF (ORF1 on figure 11) corresponds
to ORF covering the most higher number of transcript exons. In case of
alternative splicing events in the first exon covered by ORF1, the second
selected ORF (ORF2 on figure 11) is one where start position is as much
upstream as possible among ORFs with at least 120 nucleotides.

Figure 11: algorithm of ORF prediction
For each transcript (not for ESTs) and for each
predicted ORF, we calculated nucleotide length between the ORF stop position
(on transcript sequence) and the position of the last exon-exon junction (figure
12). If this length is greater than 50, corresponding transcript is predicted
to be targeted by NMD (one prediction per ORF for each transcript).

We downloaded a file with all predicted
microRNA/transcript interaction sites from the miRBase Targets database version
2.0. This file contains name of microRNA, chromosome name, and chromosome
positions. From these chromosomal positions, we realigned microRNA sequence
with corresponding genomic sequence using the miRanda program. Interaction
sites with a maximum of energy of -19 were stored in the fast DB database. Each
transcript sequence within the alignment region is aligned with the genomic
sequence using Clustalw (9) and sequence variations between
genomic and transcript sequences are highlighted in red in order to predict if
such variations potentially affect miRNA/transcript interaction.
We tried to associate each transcript (cDNAs
and ESTs) with tissue where it has been cloned. Algorithm used is described on
figure 13. When information was available, we recovered name of the transcript
library from CGAP and if library is associated with a tissue, we associated
tissue with transcript using a keyword search system among a 36 tissues and
groups of tissues collection. Indeed, all possible tissues have been gathered
in 36 tissues or groups of tissues in order to make further statistical
analysis (more transcripts per tissue). In case transcript is not associated
with a library or if its library is not associated with a tissue, we used our
keyword search system on the tissue_type field of the transcript GenBank file
(if this field is available). 87% of the fast DB transcripts are associated
with one of the 36 tissues (figure 14). From the 1,154,554 transcripts stored
in fast DB database, 875,479 (76%) were associated with a tissue using their
library information and 132,740 (11%) were associated with a tissue using their
tissue_type information.

Figure 13: algorithm for association between transcript and tissue

Figure 14: Tissue distribution histogram of transcripts
Once splicing events defined, all the results
are stored in a MySQL database by requests of insertion.
The program dynamically generates PDF file
corresponding to each gene (Perl module PDF::API2). All pictures in the PDF
files were made using the GD module. These files are stored on the hard disk of
the server in order to increase the downloading rate for the user.
Interactive graphical representations of each gene
and its transcripts (figure 15) are dynamically generated (module SWF). Legend
of these graphs is described in user’s guide. It is important to underline that
on the gene graphical representation, the exons (represented by green
rectangles) do not correspond to genomic exons (more frequent exons), but to
the longest exon for a given genomic position. In other words, the beginning
and the end of each exon are defined by the first and last positions occupied
by the transcript exons belonging to this genomic exon. A black bent line connecting
two consecutive genomic exons represents a splicing event between two genomic
exons. All other splicing events are represented by a red bent line under the
exons. Taken this into account, it is important to keep in mind that a “black
connection” does not correspond to the most frequent splicing event: for
example, even if the skipping of exon 3 is the most frequent splicing event, it
will be represented by a “red connection”. Note that some inconsistencies can
be observed between the alternative splicing graphical representation and the alternative
splicing events defined by fast DB. This is due to the highly stringent
criteria established by fast DB to define the alternative splicing events. One
example is illustrated in figure 15 (orange exon of transcript 2): here a
transcript exon presents 2 additional nucleotides in at its 3’ end, and even if
fast DB hasn’t defined an alternative 5' splice site (need at least 3 nucleotides),
the graphical representation will show a red line under exons. This system was
used to bring the attention of users about the fact that there can be
additional alternative splicing events to which fast DB do not grant enough
confidence.

Figure 15: Gene graphical representation
The fast DB web interface was made in Perl CGI
on an APACHE server (http://www.apache.org/).
This interface has several roles:
Several
possibilities allow the user to retrieve a given gene or a gene list of
interest: it is possible to use a keyword search, to paste a sequence, to
upload a file gathering list of gene IDs and it is also possible to find genes
with common characteristics using the “advanced search” page. The search engine
using sequence is done by BLAST. The user has to paste a sequence with at least
20 nucleotides. For multiple queries, users
might upload a file (rtf, doc, txt…) with EnsEMBL stable ID of several genes. Only
one EnsEMBL stable ID per line is allowed.
The fast DB program uses Clustalw to multi-align
all transcript sequences from a given gene. However, to avoid mistakes in alignments
due to the amount of sequences to be aligned and to the great differences among
them, fast DB program “prepares” these sequences to be aligned. First, all
transcript exons localized at the same genomic position are identified. In case
of a retained intron, the program separates the corresponding transcript exon
into several exons and intron(s) (figure

Figure 16: Multi-alignment including a retained intron
Users can define PCR primers directly on the multi-alignment
(copy sequence) and paste them in the corresponding boxes. Once the “run PCR”
button is clicked, information concerning the primers is available and a table
gives the length of the predicted PCR product for each transcript. Please, note
that the primer sequence must be identical to the transcript sequence in order
to get a result, no mismatch is tolerated.
The “probe alignment” tool allows to rapidly
localize any sequences within the gene exon/intron structure (such as probes
used in micro-arrays). Each inputted sequences is aligned with the genomic
sequence using blat (10). Corresponding provided interactive
scheme is made dynamically by the Ming’s SWF module. Alignment provided on the
bottom of the page comes from blat output.
For each human gene, the “tissue specificity”
page provides the tissue distribution of its transcripts in 36 different organs
or tissue types. Furthermore, the tissue distribution for each alternative
splicing event defined by human cDNAs is available.
We used the Perl module GD::Graph in order to
draw histogram of tissue distribution of all gene transcripts. Transcripts not
associated with a tissue are gathered in the “n/a” group. It seems important to
underline that only tissues, in which transcripts are expressed, are
represented on the chart.
Tissue distribution analysis is available for 6
of the 7 events defined by fast DB: IED events can not be analyzed for tissue
specificity in fast DB at this time.
All the different alternative first exons are
represented on the same chart. For each alternative first exon, we gathered
transcripts that defined this event (transcripts 1 and 2 on figure 17). All
5’-partial transcripts were excluded from study (transcript 3 on figure 17).

Figure 17: Transcript selection for tissue specificity histogram of alternative first exons
As the same principle than alternative first
exons, all the different alternative terminal exons are represented on the same
chart. For each alternative terminal exon, we gathered transcripts that defined
this event. All 3’-partial transcripts were excluded from study.
Several distinct groups of transcripts are
represented on the chart. The first one corresponds to transcripts that define
splicing event. The other corresponds to transcripts that include genomic exon(s),
which is(are) skipped in transcripts defining the alternative event. The latter
group can be divided into subgroups according to splicing events defined
adjacently to the studied event. For this reason, a pair of values, A and B,
was defined (figure 18). For each different pair of values, a group is defined
to be represented on the chart, in addition to group of transcripts defining
the splicing event.

Figure 18: Definition of the (A, B) pair and consequent definition of transcript groups
As the same principle than exon skipping, several
distinct groups of transcripts are represented on the chart. The first one
corresponds to transcripts that define splicing event. The other corresponds to
transcripts that splice introns(s) (genomic intron(s) or not), which is(are)
included in transcripts defining the alternative event. In case of single retained
intron, 4 values are defined (figure 19). This number increases in case of
consecutive retained introns. In case of 3’-, 5’-partial transcripts, or first
or last intron retained, number of values can decrease (figure 19, cases 6 and
7). For each different group of values, a transcript group is defined to be
represented on the chart, in addition to group of transcripts defining the
splicing event.

Figure 19: Definition of (A, B, C, D) and consequent definition of transcript groups
Several distinct groups of transcripts are
represented on the chart. Each group corresponds to a different pair of
acceptor/donor splice sites, as shown on figure 20.
Figure 20: Definition of transcript groups for alternative 3’ and 5’splice sites
Figure
1: Genomic exon definition
Figure
2: Alternative first exon definition
Figure
3: Alternative last exon definition
Figure
4: Retained intron definition
Figure
5: Exon skipping definition
Figure
6: Alternative 3’ splicing site definition
Figure
7: Alternative 5’ splicing site definition
Figure
9: refined definition of exon skipping events
Figure
10: refined definition of intron retention events
Figure
11: algorithm of ORF prediction
Figure
13: algorithm for association between transcript and tissue
Figure
14: Tissue distribution histogram of transcripts
Figure
15: Gene graphical representation
Figure
16: Multi-alignment including a retained intron
Figure
17: Transcript selection for tissue specificity histogram of alternative first
exons
Figure
18: Definition of the (A, B) pair and consequent definition of transcript
groups
Figure
19: Definition of (A, B, C, D) and consequent definition of transcript groups
Figure
20: Definition of transcript groups for alternative 3’ and 5’splice sites
1. Birney, E., Andrews, D.,
Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen,
V., Cutts, T. et al. (2006) Ensembl
2006. Nucleic Acids Res, 34, D556-561.
2. Hinrichs,
A.S., Karolchik, D., Baertsch, R., Barber, G.P., Bejerano, G., Clawson, H.,
Diekhans, M., Furey, T.S., Harte, R.A., Hsu, F. et al. (2006) The UCSC Genome Browser Database: update 2006. Nucleic Acids Res, 34, D590-598.
3. Benson,
D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. and Wheeler, D.L. (2006)
GenBank. Nucleic Acids Res, 34, D16-20.
4. Eyre,
T.A., Ducluzeau, F., Sneddon, T.P., Povey, S., Bruford, E.A. and Lush, M.J.
(2006) The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res, 34, D319-321.
5. Enright,
A.J., John, B., Gaul, U., Tuschl, T., Sander, C. and Marks, D.S. (2003)
MicroRNA targets in Drosophila. Genome
Biol, 5, R1.
6. Ye,
J., McGinnis, S. and Madden, T.L. (2006) BLAST: improvements for better
sequence analysis. Nucleic Acids Res,
34, W6-9.
7. Florea,
L., Hartzell, G., Zhang, Z., Rubin, G.M. and Miller, W. (1998) A computer
program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res, 8, 967-974.
8. Rice,
P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology
Open Software Suite. Trends Genet, 16, 276-277.
9. Aiyar,
A. (2000) The use of CLUSTAL W and CLUSTAL X for multiple sequence alignment. Methods Mol Biol, 132, 221-241.
10. Kent,
W.J. (2002) BLAT--the BLAST-like alignment tool. Genome Res, 12, 656-664.