De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis

Haas, Brian J; Papanicolaou, Alexie; Yassour, Moran; Grabherr, Manfred; Blood, Philip D; Bowden, Joshua; Couger, Matthew Brian; Eccles, David; Li, Bo; Lieber, Matthias; MacManes, Matthew D; Ott, Michael; Orvis, Joshua; Pochet, Nathalie; Strozzi, Francesco; Weeks, Nathan; Westerman, Rick; William, Thomas; Dewey, Colin N; Henschel, Robert; LeDuc, Richard D; Friedman, Nir; Regev, Aviv

doi:10.1038/nprot.2013.084

Protocol
Published: 11 July 2013

De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis

Brian J Haas¹^na1,
Alexie Papanicolaou²^na1,
Moran Yassour^1,3,
Manfred Grabherr⁴,
Philip D Blood⁵,
Joshua Bowden⁶,
Matthew Brian Couger⁷,
David Eccles⁸,
Bo Li⁹,
Matthias Lieber¹⁰,
Matthew D MacManes¹¹,
Michael Ott²,
Joshua Orvis¹²,
Nathalie Pochet^1,13,
Francesco Strozzi¹⁴,
Nathan Weeks¹⁵,
Rick Westerman¹⁶,
Thomas William¹⁷,
Colin N Dewey^9,18,
Robert Henschel¹⁹,
Richard D LeDuc¹⁹,
Nir Friedman³ &
…
Aviv Regev^1,20

Nature Protocols volume 8, pages 1494–1512 (2013)Cite this article

61k Accesses
44 Altmetric
Metrics details

Subjects

Abstract

De novo assembly of RNA-seq data enables researchers to study transcriptomes without the need for a genome sequence; this approach can be usefully applied, for instance, in research on 'non-model organisms' of ecological and evolutionary importance, cancer samples or the microbiome. In this protocol we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-seq data in non-model organisms. We also present Trinity-supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples and approaches to identify protein-coding genes. In the procedure, we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sourceforge.net. The run time of this protocol is highly dependent on the size and complexity of data to be analyzed. The example data set analyzed in the procedure detailed herein can be processed in less than 5 h.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Overview of Trinity assembly and analysis pipeline.**

**Figure 2: Effects of *in silico* fragment normalization of RNA-seq data on Trinity full-length transcript reconstruction.**

**Figure 3: Transcriptome and genome representations of alternatively spliced transcripts.**

**Figure 4: Strand-specific library types.**

**Figure 5: Full-length transcript reconstruction by Trinity in different organisms, sequencing depths and parameters.**

**Figure 6: Evaluating paired-read support via the Jaccard similarity coefficient.**

**Figure 7: *De novo* transcriptome assembly and analysis workflow.**

**Figure 8: Abundance estimation via expectation maximization by RSEM.**

**Figure 9: Pairwise comparisons of transcript abundance.**

**Figure 10: Comparisons of transcriptional profiles across samples.**

Partitioning RNAs by length improves transcriptome reconstruction from short-read RNA-seq data

Article 10 January 2022

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Article Open access 07 June 2024

A Bayesian approach for accurate de novo transcriptome assembly

Article Open access 03 September 2021

References

Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).
Article CAS PubMed PubMed Central Google Scholar
Haas, B.J. & Zody, M.C. Advancing RNA-seq analysis. Nat. Biotechnol. 28, 421–423 (2010).
Article CAS PubMed Google Scholar
Martin, J.A. & Wang, Z. Next-generation transcriptome assembly. Nat. Rev. Genet. 12, 671–682 (2011).
Article CAS PubMed Google Scholar
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).
Article CAS PubMed PubMed Central Google Scholar
Guttman, M. et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).
Article CAS PubMed PubMed Central Google Scholar
Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909–912 (2010).
Article CAS PubMed Google Scholar
Schulz, M.H., Zerbino, D.R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 1086–1092 (2012).
Article CAS PubMed PubMed Central Google Scholar
Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
Article CAS PubMed PubMed Central Google Scholar
Duan, J., Xia, C., Zhao, G., Jia, J. & Kong, X. Optimizing de novo common wheat transcriptome assembly using short-read RNA-seq data. BMC Genomics 13, 392 (2012).
Article CAS PubMed PubMed Central Google Scholar
Xu, D.L. et al. De novo assembly and characterization of the root transcriptome of Aegilops variabilis during an interaction with the cereal cyst nematode. BMC Genomics 13, 133 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zhao, Q.Y. et al. Optimizing de novo transcriptome assembly from short-read RNA-seq data: a comparative study. BMC Bioinformatics 12 (suppl. 14), S2 (2011).
Article CAS PubMed PubMed Central Google Scholar
Henschel, R. et al. Trinity RNA-seq assembler performance optimization. XSEDE '12 Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: bridging from the eXtreme to the campus and beyond (Chicago, Illinois, USA, July 16–20, 2012) http://dx.doi.org/10.1145/2335755.2335842 (2012).
Marcais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, B. & Dewey, C.N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Article CAS PubMed PubMed Central Google Scholar
Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Article CAS PubMed Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
Article CAS PubMed PubMed Central Google Scholar
Bullard, J.H., Purdom, E., Hansen, K.D. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 11, 94 (2010).
Article PubMed PubMed Central CAS Google Scholar
Fang, Z. & Cui, X. Design and validation issues in RNA-seq experiments. Briefi. Bioinform. 12, 280–287 (2011).
Article CAS Google Scholar
Auer, P.L. & Doerge, R.W. Statistical design and analysis of RNA sequencing data. Genetics 185, 405–416 (2010).
Article CAS PubMed PubMed Central Google Scholar
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5, 621–628 (2008).
Article CAS PubMed Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Article CAS PubMed PubMed Central Google Scholar
Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).
Article CAS PubMed Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Article PubMed PubMed Central CAS Google Scholar
Robinson, M.D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
Article PubMed PubMed Central CAS Google Scholar
Dillies, M.A. et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. http://dx.doi.org/10.1093/bib/bbs046 (17 September 2012).
Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).
Article CAS PubMed PubMed Central Google Scholar
Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Article CAS PubMed PubMed Central Google Scholar
Abeel, T., Van Parys, T., Saeys, Y., Galagan, J. & Van de Peer, Y. GenomeView: a next-generation genome browser. Nucleic Acids Res. 40, e12 (2012).
Article CAS PubMed Google Scholar
Liu, L. et al. Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012, 251364 (2012).
PubMed PubMed Central Google Scholar
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).
Article CAS PubMed Google Scholar
Rothberg, J.M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348–352 (2011).
Article CAS PubMed Google Scholar
Van Belleghem, S.M., Roelofs, D., Van Houdt, J. & Hendrickx, F. De novo transcriptome assembly and SNP discovery in the wing polymorphic salt marsh beetle Pogonus chalceus (Coleoptera, Carabidae). PLoS ONE 7, e42605 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kleinman, C.L. & Majewski, J. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome”. Science 335, 1302 (2012).
Article CAS PubMed Google Scholar
Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS PubMed PubMed Central Google Scholar
Pounds, S.B., Gao, C.L. & Zhang, H. Empirical Bayesian selection of hypothesis testing procedures for analysis of sequence count expression data. Stat. Appl. Genet. Mol. Biol. http://dx.doi.org/10.1515/1544-6115.1773 (2012).
Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res. 21, 2213–2223 (2011).
Article CAS PubMed PubMed Central Google Scholar
Cumbie, J.S. et al. GENE-counter: a computational pipeline for the analysis of RNA-seq data for gene expression differences. PLoS ONE 6, e25279 (2011).
Article CAS PubMed PubMed Central Google Scholar
Hardcastle, T.J. & Kelly, K.A. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11, 422 (2010).
Article PubMed PubMed Central Google Scholar
Leng, N. et al. An empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29, 1035–1043 (2012).
Article CAS Google Scholar
Tuna, M. & Amos, C.I. Genomic sequencing in cancer. Cancer Lett. http://dx.doi.org/doi:10.1016/j.canlet.2012.11.004 (2012).
Rhind, N. et al. Comparative functional genomics of the fission yeasts. Science 332, 930–936 (2011).
Article CAS PubMed PubMed Central Google Scholar
Kumar, S. & Blaxter, M.L. Comparing de novo assemblers for 454 transcriptome data. BMC Genomics 11, 571 (2010).
Article PubMed PubMed Central Google Scholar
Papanicolaou, A., Stierli, R., Ffrench-Constant, R.H. & Heckel, D.G. Next generation transcriptomes for next generation genomes using est2assembly. BMC Bioinformatics 10, 447 (2009).
Article PubMed PubMed Central CAS Google Scholar
Lohse, M. et al. RobiNA: a user-friendly, integrated software solution for RNA-seq–based transcriptomics. Nucleic Acids Res. 40, W622–W627 (2012).
Article CAS PubMed PubMed Central Google Scholar
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17 http://journal.embnet.org/index.php/embnetjournal/article/view/200/479 (2011).
Haas, B.J., Chin, M., Nusbaum, C., Birren, B.W. & Livny, J. How deep is deep enough for RNA-seq profiling of bacterial transcriptomes? BMC Genomics 13, 734 (2012).
Article CAS PubMed PubMed Central Google Scholar
Brown, C.T., Howe, A., Zhang, Q., Pryrkosz, A.B. & Brom, T.H. A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802 [q-bio.GN] (2012).
Borodina, T., Adjaye, J. & Sultan, M. A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 500, 79–98 (2011).
Article CAS PubMed Google Scholar
Parkhomchuk, D. et al. Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. 37, e123 (2009).
Article PubMed PubMed Central CAS Google Scholar
Sung, W.K. et al. Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat. Genet. 44, 765–769 (2012).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We are grateful to D. Jaffe and S. Young for access to additional computing resources, to Z. Chen for help in R-scripting, to L. Gaffney for help with figure illustrations, to C. Titus Brown for essential discussions and inspiration related to digital normalization strategies, to G. Marcais and C. Kingsford for supporting the use of their Jellyfish software in Trinity and to B. Walenz for supporting our earlier use of Meryl. We are grateful to our users and their feedback, in particular J. Wortman and P. Bain for comments on earlier drafts of the manuscript. This project has been funded in part (B.J.H.) with Federal funds from the National Institute of Allergy and Infectious Diseases (NIAID), US National Institutes of Health (NIH), Department of Health and Human Services (DHHS), under contract no. HHSN272200900018C. Work was supported by Howard Hughes Medical Institute (HHMI), a NIH PIONEER award, a Center for Excellence in Genome Science grant no. 5P50HG006193-02 from the National Human Genome Research Institute (NHGRI) and the Klarman Cell Observatory at the Broad Institute (A.R.). A.P. was supported by the CSIRO Office of the Chief Executive (OCE). M.Y. was supported by the Clore Foundation. P.B. was supported by the National Science Foundation (NSF) grant no. OCI-1053575 for the Extreme Science and Engineering Discovery Environment (XSEDE) project. B.L. and C.D. were partially supported by NIH grant no.1R01HG005232-01A1. In addition, B.L. was partially funded by J. Thomson's MacArthur Professorship and by the Morgridge Institute for Research support for Computation and Informatics in Biology and Medicine. M.L. was supported by the Bundesministerium für Bildung und Forschung via the project 'NGSgoesHPC'. N.P. was funded by the Fund for Scientific Research, Flanders (Fonds Wetenschappelijk Onderzoek (FWO) Vlaanderen), Belgium. R.H. and R.D.L. were funded by the NSF under grant nos. ABI-1062432 and CNS-0521433 to Indiana University, and by Indiana METACyt Initiative, which is supported in part by Lilly Endowment, Inc. J.B. was supported through a CSIRO eResearch Accelerated Computing Project. Any opinions, findings and conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of any of the funding bodies and institutions including the National Science Foundation, the National Center for Genome Analysis Support and Indiana University.

Author information

Brian J Haas and Alexie Papanicolaou: These authors contributed equally to this work.

Authors and Affiliations

Broad Institute of Massachusetts Institute of Technology (MIT) and Harvard, Cambridge, Massachusetts, USA
Brian J Haas, Moran Yassour, Nathalie Pochet & Aviv Regev
Commonwealth Scientific and Industrial Research Organisation (CSIRO) Ecosystem Sciences, Black Mountain Laboratories, Canberra, Australian Capital Territory, Australia
Alexie Papanicolaou & Michael Ott
The Selim and Rachel Benin School of Computer Science, The Hebrew University of Jerusalem, Jerusalem, Israel
Moran Yassour & Nir Friedman
Department of Medical Biochemistry and Microbiology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
Manfred Grabherr
Pittsburgh Supercomputing Center, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Philip D Blood
CSIRO Information Management & Technology, St. Lucia, Queensland, Australia
Joshua Bowden
Department of Microbiology and Molecular Genetics, Oklahoma State University, Stillwater, Oklahoma, USA
Matthew Brian Couger
Genomics Research Centre, Griffith University, Gold Coast Campus, Gold Coast, Queensland, Australia
David Eccles
Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin, USA
Bo Li & Colin N Dewey
Center for Information Services and High-performance Computing (ZIH), Technische Universität Dresden, Dresden, Germany
Matthias Lieber
California Institute for Quantitative Biosciences, University of California, Berkeley, Berkeley, California, USA
Matthew D MacManes
Institute for Genome Sciences, Baltimore, Maryland, USA
Joshua Orvis
Department of Plant Systems Biology, Department of Plant Biotechnology and Bioinformatics, Vlaams Instituut voor Biotechnologie (VIB), Ghent University, Ghent, Belgium
Nathalie Pochet
Parco Tecnologico Padano, Località Cascina Codazza, Lodi, Italy
Francesco Strozzi
United States Department of Agriculture–Agricultural Research Service, Corn Insects and Crop Genetics Research Unit, Ames, Iowa, USA
Nathan Weeks
Genomics facility, Purdue University, West Lafayette, Indiana, USA
Rick Westerman
GWT-TUD GmbH, Saxony, Germany
Thomas William
Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin, USA
Colin N Dewey
Research Technologies Division, University Information Technology Services, Indiana University, Bloomington, Indiana, USA
Robert Henschel & Richard D LeDuc
Department of Biology, Howard Hughes Medical Institute, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Aviv Regev

Authors

Brian J Haas
View author publications
You can also search for this author in PubMed Google Scholar
Alexie Papanicolaou
View author publications
You can also search for this author in PubMed Google Scholar
Moran Yassour
View author publications
You can also search for this author in PubMed Google Scholar
Manfred Grabherr
View author publications
You can also search for this author in PubMed Google Scholar
Philip D Blood
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Bowden
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Brian Couger
View author publications
You can also search for this author in PubMed Google Scholar
David Eccles
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Lieber
View author publications
You can also search for this author in PubMed Google Scholar
Matthew D MacManes
View author publications
You can also search for this author in PubMed Google Scholar
Michael Ott
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Orvis
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Pochet
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Strozzi
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Weeks
View author publications
You can also search for this author in PubMed Google Scholar
Rick Westerman
View author publications
You can also search for this author in PubMed Google Scholar
Thomas William
View author publications
You can also search for this author in PubMed Google Scholar
Colin N Dewey
View author publications
You can also search for this author in PubMed Google Scholar
Robert Henschel
View author publications
You can also search for this author in PubMed Google Scholar
Richard D LeDuc
View author publications
You can also search for this author in PubMed Google Scholar
Nir Friedman
View author publications
You can also search for this author in PubMed Google Scholar
Aviv Regev
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.J.H. is the current lead developer of Trinity and is additionally responsible for the development of the companion in silico normalization and TransDecoder utilities described herein. M.Y. contributed to Butterfly software enhancements, generating figures and to the manuscript text. B.L. and C.N.D. developed RSEM and are responsible for enhancements related to improved Trinity support. B.J.H. and A.P. wrote the initial draft of the manuscript. A.R. is the Principal Investigator. All authors contributed to Trinity development and/or writing of the final manuscript, and all authors approved the final text.

Corresponding authors

Correspondence to Brian J Haas or Aviv Regev.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Note

Supplementary materials for de novo transcript sequence reconstruction from RNA-seq: reference generation and analysis with Trinity. (PDF 699 kb)

Supplementary Figure 1

Defining minimum edge thresholds during initial Butterfly graph pruning. (PDF 554 kb)

Supplementary Figure 2

Butterfly's minimum support requirement for path extension during transcript reconstruction. (PDF 551 kb)

Supplementary Figure 3

Merging of insufficiently different path sequences. (PDF 530 kb)

Supplementary Figure 4

Enforcing path restrictions via triplet locking. (PDF 536 kb)

Supplementary Figure 5

Restrictions on the number of paths to be extended at each node. (PDF 540 kb)

Supplementary Figure 6

Evaluating assembly completeness for the S. pombe transcriptome. (PDF 636 kb)

Supplementary Figure 7

Evaluating assembly completeness for the mouse dendritic cell transcriptome. (PDF 584 kb)

Supplementary Figure 8

Correlation of expression values between reference transcripts and Trinity transcript components according to percent length agreement in S. pombe. (PDF 551 kb)

Supplementary Figure 9

Agreement between expression profiles calculated based on reference transcripts and trinity components at different S. pombe samples. (PDF 584 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Haas, B., Papanicolaou, A., Yassour, M. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8, 1494–1512 (2013). https://doi.org/10.1038/nprot.2013.084

Download citation

Published: 11 July 2013
Issue Date: August 2013
DOI: https://doi.org/10.1038/nprot.2013.084