The fast DB documentation

 

 

 

 

 


Table of content

 

Table of content.. 2

General information.. 4

Exons and splicing events definition.. 5

Genomic sequence. 5

Transcripts selection. 5

BLAST against EST and mRNA sequence databanks  5

Alignment of selected transcript sequence with genomic sequence  5

Genomic exons definition. 6

Primary definition of alternative splicing events. 6

Alternative first exon  7

Alternative last exon  7

Retained intron  8

Exon skipping  8

Alternative 3’ splicing site  9

Alternative 5’ splicing site  9

Internal Exon Deletion (IED)  10

Refined definition of alternative splicing events. 10

Exon skipping  10

Intron retention  11

Features available from the “Transcripts view” page.. 12

Link between transcripts and PubMed. 12

Prediction of Open Reading Frames (ORF) 12

Prediction of nonsense-mediated mRNA decay (NMD) 13

Prediction of the microRNA/transcript interaction sites. 13

Association of transcript with tissue. 13

Database filling and creation of PDF and SWF files.. 15

Database filling. 15

Creation of PDF files. 15

Creation of SWF files. 15

The fast DB web interface.. 17

Search engine. 17

Multi-alignment of transcript sequences and in silico PCR.. 17

Multi-alignment of transcript sequences  17

In silico PCR   18

Probe alignment 18

Tissue-specificity analysis. 18

Tissue distribution histogram of all gene transcripts  18

Tissue distribution histogram of gene transcripts for a specific event 19

Analysis of alternative first exons. 19

Analysis of alternative terminal exons. 19

Exon skipping. 19

Intron retention. 20

Alternative 3’and 5’ splice sites. 20

FIGURE INDEX.. 21

References.. 22

 


General information

 

All scripts have been written in PERL v5.8.7 (http://www.perl.org/) using the following modules:

 

 

 

Data processed by our scripts have been retrieved from the following public databanks:

 

 

 

All data are stored in a MySQL relational database (http://www.mysql.com/). MySQL v4.1.12 is used and all tables were created in MyISAM format.

 

Exons and splicing events definition

 

Genomic sequence

 

Fast DB provides three kinds of analysis: human mRNAs, human mRNAs and ESTs, and mouse mRNAs.

 

For analysis of human genes (the two first analysis), we recovered genomic sequences of the 22,218 “protein-coding” genes from the homo_sapiens_core_31_35d EnsEMBL database. Mouse genomic sequences come from the mus_musculus_core_35_34d EnsEMBL database. We used the ensemble_compara_36 database to associate each human gene with its orthologous mouse gene.

 

All recovered genomic sequences were upstream and downstream extended compared to genomic area defined by EnsEMBL to make sure to include additional exons at the gene borders.

 

 

Transcripts selection

BLAST against EST and mRNA sequence databanks 

In order to define the exon/intron structure and splicing events of the gene, we aligned transcripts against genomic sequence. These selected transcripts come from several sources. “Full-length” mRNAs and ESTs come from the UCSC website and “partial” mRNAs were downloaded from GenBank using this query “((splic*[Text Word] OR (variant*[Text Word]) OR (isoform*[Text Word])) AND (homo sapiens[Organism]) AND (mRNA[Text Word]) AND (partial[Text Word]) NOT (DNA[Text Word]) NOT (BAC[Text Word]) NOT (contig[Text Word]) NOT (cosmid[Text Word])”. Then, each bank of sequences has been formatted by formatdb in order to be queried by BLAST (6).

Sequence of each exon defined by EnsEMBL is blasted against these sequence banks in order to recovered transcripts to be further analyzed.

 

Alignment of selected transcript sequence with genomic sequence

Each recovered transcript is aligned with genomic sequence with sim4 (7). Very stringent criteria have been defined in order to eliminate potential bad-quality transcript. This selection gathers two distinct steps. Transcripts have to pass these conditions to successfully complete the first step:

 

  • At least 95% of the transcript sequence has to be aligned,
  • Number of defined exons greater than one *,
  • From beginning to end of alignment, at least 10% of the genomic region has to be covered,
  • Global percent of identity of alignment has to be at least 98%.

 

* If all selected transcripts have one exon, this criterion is not used.

 

Using all pre-selected transcripts, we defined another criterion: average of ratios between length of all defined exons and length of all defined introns. To successfully pass this second step, this ratio calculated for a given transcript has to lesser than three times the average ratio. Off course, this second step is not applied if all pre-selected transcripts have only one exon.

 

 

Genomic exons definition

 

Each selected transcript alignment is parsed in order to define the exon positions on the transcript sequence and on the genomic sequence. We have defined a "genomic exon" as the most frequent exon among all transcript exons at a given genomic position (Figure 1). To do this analysis, all transcript exons were sorted by ascendant order with respect to their first position in the genomic sequence. Then, they were gathered by "bag", each "bag" corresponding to one "genomic exon".

 

We defined the first and the last positions of a genomic exon as the most frequent first and last positions occupied by the transcript exons belonging to this genomic exon “bag” (see Figure 1). However, the first position of the first exon and the last position of the last exon of the gene were defined differently. The first position of the first exon was defined as the lowest position among the transcript exons. The end position of the last exon was defined as the highest position among the transcript exons. In case of “intron-less” gene, the first and last positions were defined as respectively, the lowest and the highest positions of the single exon.

 

 

 

Figure 1: Genomic exon definition

 

 

Primary definition of alternative splicing events

 

After genomic exons have been defined, alternative events were defined by comparing transcript exons with the corresponding genomic exons. Seven types of alternative events were defined:

 

  • Alternative first exon,
  • Alternative last exon,
  • Retained intron,
  • Exon skipping,
  • Alternative 3’ splice site,
  • Alternative 5’ splice site,
  • IED (Internal Exon Deletion): in most cases, this event applies to small introns that are rarely eliminated.

 

 

Alternative first exon

To define a transcript exon as an alternative first exon of a gene, we used several criteria. First, the exon must be the first exon of one or more transcripts. Second, if the first exon of a transcript does not co-localize with any genomic exon (case 1, transcript 2 in figure 2), it is defined as a first exon. If on the contrary, the first exon of a transcript lies within the genomic sequence corresponding to a genomic exon, it is defined as a first exon only if it starts at least 10 bp upstream the first position of the corresponding genomic exon (case 2, transcript 2’ in figure 2); this criterion prevents to define “false” first exon due to sequencing errors.

 

 

 

Figure 2: Alternative first exon definition

 

It is important to underline that one can not exclude that upstream exon(s) that is (are) not present in the available transcript(s), exist. To gain more confidence on a candidate “first exon”, users can perform analyses for promoter and transcription factor binding site prediction and for 5’-UTR features using the candidate “first exon” and upstream sequences with fast DB interface (see User’s guide).

 

Alternative last exon

Alternative last exons (other than the last exon) are shown in red colour on the example in figure 3. Last exons are defined as being the last exon of at least one transcript and exceeding at least by 10 bases the last position of the corresponding genomic exon (transcript 2’ in figure 3). If there are only last transcript exons at this position, the exon is always considered as an alternative last exon (transcript 2 in Figure 3). We have decided to have more than 10 nucleotides in the next intron to limit the number of false positives due to sequencing errors.

 

 

 

Figure 3: Alternative last exon definition

 

It is important to underline that one can not exclude that downstream exon(s) that is (are) not present in the available transcript(s), exist. To gain more confidence on a candidate “last exon”, users can perform several analyses for polyadenylation site prediction and 3’-UTR features using the candidate “last exon” and downstream sequences with fast DB interface (see User’s guide).

 

 

Retained intron

A transcript exon that ends downstream the first position of the next genomic exon is defined as a retained intron (red exon of transcript 2 in figure 4). To avoid “false retained introns” in fast DB coming from genomic or pre-mRNA contamination in the transcript databases, we introduced an optimized criterion during the data processing, which is the ratio of intron length to exon length (see “Transcripts selection”).

 

 

 

Figure 4: Retained intron definition

 

Exon skipping

We defined an exon skipping event when two consecutive exons from the same transcript present the following criterion: the genomic position of the second exon is higher than the genomic position of the first exon plus one. For example, figure 5 shows that the second exon of transcript 2 is at position three, which is higher than two (position of the first exon plus one).

 

 

 

Figure 5: Exon skipping definition

 

 

Alternative 3’ splicing site

An alternative 3’ splicing site was defined when a transcript exon starts at least 3 nucleotides downstream or upstream the start of the corresponding genomic exon (red exon of transcript 2 in figure 6). This value was fixed to exclude false positives due to sequencing errors. To exclude false positives due to sim4 alignment problems, another criterion was that the previous transcript exon (from the same transcript) had no alternative 5’ splicing site with the sign of its difference of length opposite to the sign of the defined alternative 3’ splicing site (orange exons of transcript 3 on figure 6). Such events seem in many situations to be due to sim4 alignment problems. Finally, to define an alternative 3’ splicing site, a transcript exon can not be the first exon of its transcript, since first exons do not follow introns.

 

 

Figure 6: Alternative 3’ splicing site definition

 

 

Alternative 5’ splicing site

An alternative 5’ splicing site was defined when a transcript exon ended at least 3 nucleotides downstream or upstream the end of the corresponding genomic exon (red exon of transcript 2 in figure 7). This value was fixed to exclude any false positives due to sequencing errors. To exclude false positives due to sim4 alignment problems, another criterion was that the next transcript exon (from the same transcript) had no alternative 3’ splicing site with the sign of its difference of length opposite to the sign of the defined alternative 5’ splicing site (orange exons of transcript 3 in figure 7). Indeed, such events seem in many situations to be due to sim4 alignment problems. Finally, to define an alternative 5’ splicing site within a transcript exon, this exon can’t be the last exon of its transcript, since last exons are not followed by introns.

 

Figure 7: Alternative 5’ splicing site definition

 

 

Internal Exon Deletion (IED)

An internal exon deletion was defined if at least two transcript exons from the same transcript were at the same genomic exon position and if the length between these exons was equal to or higher than ten nucleotides (red exons of transcript 2 in figure 8). The limit was fixed at 10 nucleotides to avoid deletions wing to sequencing errors. In most cases, these events seem to correspond to small introns that are rarely spliced. Indeed, these sequences had consensus acceptor/donor splicing sites (data not shown).

 


Figure 8: EID definition

 

 

Refined definition of alternative splicing events

 

In order to make further analyses, we have redefined some alternative splice types: exon skipping and intron retention.

 

Exon skipping

Consecutive skipped exons in the same transcript are considered as a single event (figure 9, item 1). However, in some cases, exons that seem to be skipped are no longer taking into account in the alternative splicing events definition since these exons are only associated with a specific promoter or a specific alternative terminal exon as shown on figure 9, items 2 and 3. The case gathering exon associated with a specific promoter (or a specific alternative terminal exon) adjacent to a “real” skipped exon can also exist (figure 9, items 4 and 5).

 

 

Figure 9: refined definition of exon skipping events

 

 

Intron retention

In the same principle as exon skipping, consecutive retained introns are considered as a single event (figure 10, item 4).

 

 

Figure 10: refined definition of intron retention events

 

 

Features available from the “Transcripts view” page

 

Link between transcripts and PubMed

 

In order to link each transcript with the corresponding articles stored in PubMed, we used the HTTP::Lite Perl module with the followed link for each transcript of fast DB:  “http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=search&db=pubmed&term=$t” where “$t” corresponds to the corresponding GenBank accession number of the transcript. In case of successful request, corresponding PubMed IDs (PMID) were stored in our database.  

 

 

Prediction of Open Reading Frames (ORF)

 

Fast DB provides a predicted ORF for most transcripts (but not for ESTs). In some cases, two ORFs are predicted for the same transcript (ORFs blue and red on the “Transcripts view” page, see User’s guide). For each transcript, we used the algorithm described on figure 11. We used the “getorf” program from the EMBOSS package (8) to find all ORFs with at least 120 nucleotides (“getorf –minsize 120 –reverse No file_in file_out”). Then, we selected the two longest ORFs. The first selected ORF (ORF1 on figure 11) corresponds to ORF covering the most higher number of transcript exons. In case of alternative splicing events in the first exon covered by ORF1, the second selected ORF (ORF2 on figure 11) is one where start position is as much upstream as possible among ORFs with at least 120 nucleotides.

 

 

 

 

Figure 11: algorithm of ORF prediction

 

 

Prediction of nonsense-mediated mRNA decay (NMD)

 

For each transcript (not for ESTs) and for each predicted ORF, we calculated nucleotide length between the ORF stop position (on transcript sequence) and the position of the last exon-exon junction (figure 12). If this length is greater than 50, corresponding transcript is predicted to be targeted by NMD (one prediction per ORF for each transcript).

 

 

Figure 12: prediction of NMD

 

 

Prediction of the microRNA/transcript interaction sites

 

We downloaded a file with all predicted microRNA/transcript interaction sites from the miRBase Targets database version 2.0. This file contains name of microRNA, chromosome name, and chromosome positions. From these chromosomal positions, we realigned microRNA sequence with corresponding genomic sequence using the miRanda program. Interaction sites with a maximum of energy of -19 were stored in the fast DB database. Each transcript sequence within the alignment region is aligned with the genomic sequence using Clustalw (9) and sequence variations between genomic and transcript sequences are highlighted in red in order to predict if such variations potentially affect miRNA/transcript interaction.

 

 

Association of transcript with tissue

 

We tried to associate each transcript (cDNAs and ESTs) with tissue where it has been cloned. Algorithm used is described on figure 13. When information was available, we recovered name of the transcript library from CGAP and if library is associated with a tissue, we associated tissue with transcript using a keyword search system among a 36 tissues and groups of tissues collection. Indeed, all possible tissues have been gathered in 36 tissues or groups of tissues in order to make further statistical analysis (more transcripts per tissue). In case transcript is not associated with a library or if its library is not associated with a tissue, we used our keyword search system on the tissue_type field of the transcript GenBank file (if this field is available). 87% of the fast DB transcripts are associated with one of the 36 tissues (figure 14). From the 1,154,554 transcripts stored in fast DB database, 875,479 (76%) were associated with a tissue using their library information and 132,740 (11%) were associated with a tissue using their tissue_type information.

 

 

Figure 13: algorithm for association between transcript and tissue

 

 

 

Figure 14: Tissue distribution histogram of transcripts

 

 


Database filling and creation of PDF and SWF files

 

Database filling

 

Once splicing events defined, all the results are stored in a MySQL database by requests of insertion.

 

 

Creation of PDF files

 

The program dynamically generates PDF file corresponding to each gene (Perl module PDF::API2). All pictures in the PDF files were made using the GD module. These files are stored on the hard disk of the server in order to increase the downloading rate for the user.

 

 

Creation of SWF files

 

Interactive graphical representations of each gene and its transcripts (figure 15) are dynamically generated (module SWF). Legend of these graphs is described in user’s guide. It is important to underline that on the gene graphical representation, the exons (represented by green rectangles) do not correspond to genomic exons (more frequent exons), but to the longest exon for a given genomic position. In other words, the beginning and the end of each exon are defined by the first and last positions occupied by the transcript exons belonging to this genomic exon. A black bent line connecting two consecutive genomic exons represents a splicing event between two genomic exons. All other splicing events are represented by a red bent line under the exons. Taken this into account, it is important to keep in mind that a “black connection” does not correspond to the most frequent splicing event: for example, even if the skipping of exon 3 is the most frequent splicing event, it will be represented by a “red connection”. Note that some inconsistencies can be observed between the alternative splicing graphical representation and the alternative splicing events defined by fast DB. This is due to the highly stringent criteria established by fast DB to define the alternative splicing events. One example is illustrated in figure 15 (orange exon of transcript 2): here a transcript exon presents 2 additional nucleotides in at its 3’ end, and even if fast DB hasn’t defined an alternative 5' splice site (need at least 3 nucleotides), the graphical representation will show a red line under exons. This system was used to bring the attention of users about the fact that there can be additional alternative splicing events to which fast DB do not grant enough confidence.

 

 

Figure 15: Gene graphical representation

 

 

The fast DB web interface

 

 

The fast DB web interface was made in Perl CGI on an APACHE server (http://www.apache.org/). This interface has several roles:

 

  • To allow the user to easily find its gene of interest,
  • To present data clearly,
  • To Navigate through the website,
  • To provide links to other websites,
  • To provide tools, particularly “in silico PCR”, “probe alignment”, functional protein domain search, tissue-specificity analysis.

 

 

Search engine

 

Several possibilities allow the user to retrieve a given gene or a gene list of interest: it is possible to use a keyword search, to paste a sequence, to upload a file gathering list of gene IDs and it is also possible to find genes with common characteristics using the “advanced search” page. The search engine using sequence is done by BLAST. The user has to paste a sequence with at least 20 nucleotides. For multiple queries, users might upload a file (rtf, doc, txt…) with EnsEMBL stable ID of several genes. Only one EnsEMBL stable ID per line is allowed.

 

 

 

Multi-alignment of transcript sequences and in silico PCR

Multi-alignment of transcript sequences

The fast DB program uses Clustalw to multi-align all transcript sequences from a given gene. However, to avoid mistakes in alignments due to the amount of sequences to be aligned and to the great differences among them, fast DB program “prepares” these sequences to be aligned. First, all transcript exons localized at the same genomic position are identified. In case of a retained intron, the program separates the corresponding transcript exon into several exons and intron(s) (figure 16 in red): in this case fast DB recovers the longest genomic exon for each position, and not the genomic exon. In the next step all exons localized at the same genomic position are aligned. In case of a genomic position with an EID, fast DB uses Clustalw with global alignment option.

 

 

Figure 16: Multi-alignment including a retained intron

 

In silico PCR

Users can define PCR primers directly on the multi-alignment (copy sequence) and paste them in the corresponding boxes. Once the “run PCR” button is clicked, information concerning the primers is available and a table gives the length of the predicted PCR product for each transcript. Please, note that the primer sequence must be identical to the transcript sequence in order to get a result, no mismatch is tolerated.

 

 

Probe alignment

 

The “probe alignment” tool allows to rapidly localize any sequences within the gene exon/intron structure (such as probes used in micro-arrays). Each inputted sequences is aligned with the genomic sequence using blat (10). Corresponding provided interactive scheme is made dynamically by the Ming’s SWF module. Alignment provided on the bottom of the page comes from blat output.

 

 

Tissue-specificity analysis

 

For each human gene, the “tissue specificity” page provides the tissue distribution of its transcripts in 36 different organs or tissue types. Furthermore, the tissue distribution for each alternative splicing event defined by human cDNAs is available.

 

Tissue distribution histogram of all gene transcripts

We used the Perl module GD::Graph in order to draw histogram of tissue distribution of all gene transcripts. Transcripts not associated with a tissue are gathered in the “n/a” group. It seems important to underline that only tissues, in which transcripts are expressed, are represented on the chart.

Tissue distribution histogram of gene transcripts for a specific event

Tissue distribution analysis is available for 6 of the 7 events defined by fast DB: IED events can not be analyzed for tissue specificity in fast DB at this time.

 

Analysis of alternative first exons

All the different alternative first exons are represented on the same chart. For each alternative first exon, we gathered transcripts that defined this event (transcripts 1 and 2 on figure 17). All 5’-partial transcripts were excluded from study (transcript 3 on figure 17).

 

 

Figure 17: Transcript selection for tissue specificity histogram of alternative first exons

 

Analysis of alternative terminal exons

As the same principle than alternative first exons, all the different alternative terminal exons are represented on the same chart. For each alternative terminal exon, we gathered transcripts that defined this event. All 3’-partial transcripts were excluded from study.

 

Exon skipping

Several distinct groups of transcripts are represented on the chart. The first one corresponds to transcripts that define splicing event. The other corresponds to transcripts that include genomic exon(s), which is(are) skipped in transcripts defining the alternative event. The latter group can be divided into subgroups according to splicing events defined adjacently to the studied event. For this reason, a pair of values, A and B, was defined (figure 18). For each different pair of values, a group is defined to be represented on the chart, in addition to group of transcripts defining the splicing event.

 

 

Figure 18: Definition of the (A, B) pair and consequent definition of transcript groups

Intron retention

As the same principle than exon skipping, several distinct groups of transcripts are represented on the chart. The first one corresponds to transcripts that define splicing event. The other corresponds to transcripts that splice introns(s) (genomic intron(s) or not), which is(are) included in transcripts defining the alternative event. In case of single retained intron, 4 values are defined (figure 19). This number increases in case of consecutive retained introns. In case of 3’-, 5’-partial transcripts, or first or last intron retained, number of values can decrease (figure 19, cases 6 and 7). For each different group of values, a transcript group is defined to be represented on the chart, in addition to group of transcripts defining the splicing event.

 

 

Figure 19: Definition of (A, B, C, D) and consequent definition of transcript groups

 

Alternative 3’and 5’ splice sites

Several distinct groups of transcripts are represented on the chart. Each group corresponds to a different pair of acceptor/donor splice sites, as shown on figure 20.

 

 

 

Figure 20: Definition of transcript groups for alternative 3’ and 5’splice sites

 

 

 


FIGURE INDEX

 

Figure 1: Genomic exon definition. 6

Figure 2: Alternative first exon definition. 7

Figure 3: Alternative last exon definition. 8

Figure 4: Retained intron definition. 8

Figure 5: Exon skipping definition. 9

Figure 6: Alternative 3’ splicing site definition. 9

Figure 7: Alternative 5’ splicing site definition. 10

Figure 9: refined definition of exon skipping events. 11

Figure 10: refined definition of intron retention events. 11

Figure 11: algorithm of ORF prediction. 12

Figure 12: prediction of NMD.. 13

Figure 13: algorithm for association between transcript and tissue. 14

Figure 14: Tissue distribution histogram of transcripts. 14

Figure 15: Gene graphical representation. 16

Figure 16: Multi-alignment including a retained intron. 18

Figure 17: Transcript selection for tissue specificity histogram of alternative first exons. 19

Figure 18: Definition of the (A, B) pair and consequent definition of transcript groups. 19

Figure 19: Definition of (A, B, C, D) and consequent definition of transcript groups. 20

Figure 20: Definition of transcript groups for alternative 3’ and 5’splice sites. 20

 


References

 

 

1.         Birney, E., Andrews, D., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen, V., Cutts, T. et al. (2006) Ensembl 2006. Nucleic Acids Res, 34, D556-561.

2.         Hinrichs, A.S., Karolchik, D., Baertsch, R., Barber, G.P., Bejerano, G., Clawson, H., Diekhans, M., Furey, T.S., Harte, R.A., Hsu, F. et al. (2006) The UCSC Genome Browser Database: update 2006. Nucleic Acids Res, 34, D590-598.

3.         Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. and Wheeler, D.L. (2006) GenBank. Nucleic Acids Res, 34, D16-20.

4.         Eyre, T.A., Ducluzeau, F., Sneddon, T.P., Povey, S., Bruford, E.A. and Lush, M.J. (2006) The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res, 34, D319-321.

5.         Enright, A.J., John, B., Gaul, U., Tuschl, T., Sander, C. and Marks, D.S. (2003) MicroRNA targets in Drosophila. Genome Biol, 5, R1.

6.         Ye, J., McGinnis, S. and Madden, T.L. (2006) BLAST: improvements for better sequence analysis. Nucleic Acids Res, 34, W6-9.

7.         Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M. and Miller, W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res, 8, 967-974.

8.         Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet, 16, 276-277.

9.         Aiyar, A. (2000) The use of CLUSTAL W and CLUSTAL X for multiple sequence alignment. Methods Mol Biol, 132, 221-241.

10.       Kent, W.J. (2002) BLAT--the BLAST-like alignment tool. Genome Res, 12, 656-664.