Aplysia RNA-seq Assembly utilities
Recommended approach to searching Aplysia transcriptome and interpreting the results
This powerful resource was assembled from massive, high throughput Illumina sequencing of the Aplysia transcriptome (the RNAs) using the Trinity assembler, developed at the Broad Institute and the Hebrew University of Jerusalem.
If you wish to identify an Aplysia ortholog of a mammalian gene or a gene from another species, we recommend beginning with the protein sequence of the gene of interest, as these are more highly conserved.
From the Current IGS transcriptome assembly go to Web-based BLAST form (2014 assemblies) and BLAST against the nucleotide sequences using the tblastn program. You need to select a database. “A1 CNS (all)” is a good choice for initial blasts, as it combines several CNS assemblies from the same animal (normalization for these libraries is explained below). You might also start with “All of the above combined” (for all tissues) or a genome-guided CNS assembly (which is still in a pilot stage). (“A1 – A10” correspond to tissues from a single animal and B4 and B6 correspond to tissues from a second animal.)
This blast website installed at IGS is slightly different from similar sites at NCBI. In order to retrieve a sequence of interest, you should return to the Current IGS transcriptome assembly page (or open it in a second window), go to the Transcript lookup utility (2014 assemblies) and paste in the transcript ID and the sequence length. For example, for ">comp16460_c0_seq1 len=7225" you would use:
comp16460_c0_seq1 and 7225
(Assemblies were done separately for each tissue and for various libraries from CNS, so transcript IDs may not be unique, which is why sequence length is needed.)
Some sequences are only partially represented in individual tissues because of low expression levels. Thus, transcripts of interest in neurons may be most complete in the data from another tissue (e.g. chemosensory tissue or heart).
There are some confusing things that we have observed in blast results:
-
There are multiple contigs for many mRNAs. There are a number of reasons
for this, some interesting and others technical.
Interesting reasons:
- Multiple start sites and stop sites for mRNAs are treated as distinct transcripts
- Alternative splicing will lead to distinct transcripts
Technical reasons:
- If there is a gap in the reads for a single transcript, the different regions of the mRNA will be on distinct contigs.
- The reads from different tissues have been assembled separately, so you will may multiple hits using the “all of the above combined” pooled assembly. You may get information on expression of a specific transcript by searching different tissues, but if you don’t see full length hits this may be because the transcript is rare and not fully assembled or because some sequences shear preferentially in specific spots. Also, there have been multiple assemblies done for CNS, with and without Digital Normalization (DigiNorm) and there is sequencing for CNS where the library was prepared with Double-strand-specific DNase normalization (DSN).
- Many contigs are reverse sequences. These have not yet been corrected based on strand specific reads. You will see this because the nucleotide sequence is reversed compared with the starting protein sequence used for tblastn. For reverse sequences, when you copy and paste the DNA sequence into your program, you will have to use the reverse complement of the sequence to see the alignment.
- Some of the contigs that you examine may be incorrect or incomplete because the results of the assembler are affected by: read coverage, which is a function of the abundance of a message; repeat sequences, which impair assembly; and excessive shearing of some specific sequences. You will need to be careful to look carefully for anomalies. You may see occasional frameshifts where the reading frame changes; this is probably a sequencing error.
After obtaining the sequence of a contig with high similarity to your original gene, perform a reverse BLAST (blastx) on another species (not the species from which you obtained the query protein sequence to confirm that the contig is most similar to the gene of interest.
Finally, one can assess whether multiple contigs represent the same transcript or alternatively spliced produces of a single gene by performing Clustal alignments or using NCBI blastn to align two or more sequences.
Wayne Sossin
Tom Abrams