Data Management Challenges in Next Generation Sequencing

Wandelt, Sebastian; Rheinländer, Astrid; Bux, Marc; Thalheim, Lisa; Haldemann, Berit; Leser, Ulf

doi:10.1007/s13222-012-0098-2

Data Management Challenges in Next Generation Sequencing

Schwerpunktbeitrag
Published: 01 August 2012

Volume 12, pages 161–171, (2012)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

Sebastian Wandelt¹,
Astrid Rheinländer¹,
Marc Bux¹,
Lisa Thalheim¹,
Berit Haldemann¹ &
…
Ulf Leser¹

852 Accesses
Explore all metrics

Abstract

Since the early days of the Human Genome Project, data management has been recognized as a key challenge for modern molecular biology research. By the end of the nineties, technologies had been established that adequately supported most ongoing projects, typically built upon relational database management systems. However, recent years have seen a dramatic increase in the amount of data produced by typical projects in this domain. While it took more than ten years, approximately three billion USD, and more than 200 groups worldwide to assemble the first human genome, today’s sequencing machines produce the same amount of raw data within a week, at a cost of approximately 2000 USD, and on a single device. Several national and international projects now deal with (tens of) thousands of genomes, and trends like personalized medicine call for efforts to sequence entire populations. In this paper, we highlight challenges that emerge from this flood of data, such as parallelization of algorithms, compression of genomic sequences, and cloud-based execution of complex scientific workflows. We also point to a number of further challenges that lie ahead due to the increasing demand for translational medicine, i.e., the accelerated transition of biomedical research results into medical practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing

The real cost of sequencing: scaling computation to keep pace with data generation

Article Open access 23 March 2016

Closha: bioinformatics workflow system for the analysis of massive sequencing data

Article Open access 19 February 2018

Notes

http://aws.amazon.com/ec2/.
http://hadoop.apache.org/.
SNP calling attempts to predict which of the disagreements between reference and query sequences are due to Single Nucleotide Polymorphisms.
Insertions and deletions.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Article Google Scholar
Antoniou D, Theodoridis E, Tsakalidis A (2010) Compressing biological sequences using self adjusting data structures. In: Information technology and applications in biomedicine
Google Scholar
Baeza-Yates RA, Perleberg CH (1992) Fast and practical approximate string matching. In: Proceedings of the third annual symposium on combinatorial pattern matching (CPM ’92), London, UK. Springer, Berlin, pp 185–192
Chapter Google Scholar
Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/PACTs: a programming model and execution framework for web-scale analytical processing categories and subject descriptors. In: Proceedings of the 1st ACM symposium on cloud computing
Google Scholar
Bharti RK, Verma A, Singh R (2011) A biological sequence compression based on cross chromosomal similarities using variable length lut. Int J Biometr Bioinf 4:217–223
Google Scholar
Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14):1731–1738
Article Google Scholar
Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE Eng Med Biol Mag 20(4):61–66
Article Google Scholar
Chen Y, Peng B, Wang X, Tang H (2012) Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds. In: Proceeding of the 19th network & distributed system security symposium
Google Scholar
Chiang GT, Clapham P, Qi G, Sale K, Coates G (2011) Implementing a genomic data management system using irods in the wellcome trust sanger institute. BMC Bioinform 12:361
Article Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107
Article Google Scholar
Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener Comput Syst 25(5):528–540
Article Google Scholar
Deelman E, Singh G, Su M, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman G, Good J et al. (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Program 13(3):219–237
Google Scholar
Dennis C, Gallagher R (eds) (2002) The human genome. Palgrave Macmillan, Basingstoke
Google Scholar
Duc Cao M, Dix TI, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression. In: Proceedings of the 2007 data compression conference. IEEE Computer Society, Washington, DC, USA, pp 43–52
Google Scholar
Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proc annual IEEE symposium on foundations of computer science (FOCS), Los Alamitos, CA, USA, IEEE Comput Soc, Los Alamitos, pp 390–398
Google Scholar
Foster I (1995) Designing and building parallel programs: concepts and tools for parallel software engineering. Parallel programming/scientific computing. Addison-Wesley, Reading
MATH Google Scholar
Goecks J, Nekrutenko A, Taylor J, Team T (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8):R86
Article Google Scholar
Grabowski S, Deorowicz S (2011) Engineering relative compression of genomes. CoRR abs/1103.2351
Grumbach S, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf Process Manag 30(6):875–886
Article MATH Google Scholar
Hoffa C, Mehta G, Freeman T, Deelman E, Keahey K, Good J (2008) On the use of cloud computing for scientific workflows. In: Proceedings of the 2008 fourth IEEE international conference on escience, pp 640–645
Chapter Google Scholar
Holtgrewe M, Emde A-K, Weese D, Reinert K (2011) A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinform 12:210
Article Google Scholar
Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, Bhan MK, Calvo F, Eerola I, Gerhard DS et al. (2010) International network of cancer genome projects. Nature 464(7291):993–998
Article Google Scholar
Juve G, Deelman E, Vahi K, Mehta G, Berriman B, Berman BP, Maechling P (2010) Data sharing options for scientific workflows on Amazon EC2. In: 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, pp 1–9
Chapter Google Scholar
Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12(4):656–664
MathSciNet Google Scholar
Kuruppu S, Beresford-Smith B, Conway T, Zobel J (2012) Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Trans Comput Biol Bioinform 9(1):137–149
Article Google Scholar
Kuruppu S, Puglisi SJ, Zobel J (2010) Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Proceedings of the 17th international conference on string processing and information retrieval (SPIRE’10). Springer, Berlin, pp 201–206
Chapter Google Scholar
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with bowtie 2. Nat Methods 9(4):357–359
Article Google Scholar
Langmead B, Schatz M, Lin J, Pop M, Salzberg S (2009) Searching for snps with cloud computing. Genome Biol 10(11):R134
Article Google Scholar
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25
Article Google Scholar
Li B, Leal SM (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83(3):311–321
Article Google Scholar
Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26(5):589–595
Article Google Scholar
Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483
Article Google Scholar
Li Y, Zhong S (2009) Seqmapreduce: software and web service for accelerating sequence mapping. In: Proceedings of the 9th international conference for the critical assessment of massive data analysis (CAMDA 2009)
Google Scholar
Liu Y, Schmidt B (2012) Long read alignment based on maximal exact match seeds. In: Bioinformatics (ECCB 2012 special issue)
Google Scholar
Mount DW (2004) Bioinformatics: sequence and genome analysis. CSHL Press, New York
Google Scholar
Nguyen T, Shi W, Ruden D (2011) CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 4(1):171
Article Google Scholar
US Department of Health and Human Services (2003) OCR privacy brief: summary of the HIPAA privacy rule. In: HIPAA compliance assistance
Google Scholar
Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17):3045–3054
Article Google Scholar
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, New York, pp 1099–1110
Chapter Google Scholar
Pennisi E (2011) Will computers crash genomics? Science 331(6018):666–668
Article Google Scholar
Rivals E, Salmela L, Kiiskinen P, Kalsi P, Tarhio J (2009) Mpscan: fast localisation of multiple reads in genomes. In: Proc. 9th international workshop on algorithms in bioinformatics (WABI). Lecture notes in computer science, vol 5724. Springer, Berlin, pp 246–260
Google Scholar
Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467
Article Google Scholar
Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet 11(9):647–657
Article Google Scholar
Schadt EE, Turner S, Kasarskis A (2010) A window into third-generation sequencing. Hum Mol Genet 19(R2):R227–R240
Article Google Scholar
Schatz MC (2009) Cloudburst. Bioinform 25(11):1363–1369
Article Google Scholar
Smith AD, Chung W-Y, Hodges E, Kendall J, Hannon G, Hicks J, Xuan Z, Zhang MQ (2009) Updates to the RMAP short-read mapping software. Bioinformatics 25(21):2841–2842
Article Google Scholar
Smith AD, Xuan Z, Zhang MQ (2008) Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinform 9:128
Article Google Scholar
Stein LD (2010) The case for cloud computing in genome informatics. Genome Biol 11(5):207
Article Google Scholar
Trapnell C, Salzberg SL (2009) How to map billions of short reads onto genomes. Nat Biotechnol 27(5):455–457
Article Google Scholar
Välimäki N, Gerlach W, Dixit K, Mäkinen V (2007) Compressed suffix tree—a basis for genome-scale sequence analysis. Bioinformatics 23(5):629–630
Article Google Scholar
Vey G (2009) Differential direct coding: a compression algorithm for nucleotide sequence data. J Biol Database Curation
Warneke D, Kao O (2009) Nephele: efficient parallel data processing in the cloud categories and subject descriptors. In: Proceedings of the 2nd workshop on many-task computing on grids and supercomputers
Google Scholar
Weese D, Emde A, Rausch T, Döring A, Reinert K (2009) RazerS—fast read mapping with sensitivity control. Genome Res 19(9):1646–1654
Article Google Scholar
White T (2010) Hadoop: the definitive guide. Yahoo Press
Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX conference on operating systems design and implementation, pp 29–42
Google Scholar

Download references

Acknowledgements

Astrid Rheinländer is funded by the Deutsche Forschungsgemeinschaft through the Stratosphere project. Marc Bux is funded by the Deutsche Forschungsgemeinschaft through the SOAMED research unit. Berit Haldemann is funded by the Bundesministerium f. Bildung und Forschung through the project Prositu.

Author information

Authors and Affiliations

Department of Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
Sebastian Wandelt, Astrid Rheinländer, Marc Bux, Lisa Thalheim, Berit Haldemann & Ulf Leser

Authors

Sebastian Wandelt
View author publications
You can also search for this author inPubMed Google Scholar
Astrid Rheinländer
View author publications
You can also search for this author inPubMed Google Scholar
Marc Bux
View author publications
You can also search for this author inPubMed Google Scholar
Lisa Thalheim
View author publications
You can also search for this author inPubMed Google Scholar
Berit Haldemann
View author publications
You can also search for this author inPubMed Google Scholar
Ulf Leser
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Sebastian Wandelt.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wandelt, S., Rheinländer, A., Bux, M. et al. Data Management Challenges in Next Generation Sequencing. Datenbank Spektrum 12, 161–171 (2012). https://doi.org/10.1007/s13222-012-0098-2

Download citation

Received: 07 June 2012
Accepted: 18 July 2012
Published: 01 August 2012
Issue Date: November 2012
DOI: https://doi.org/10.1007/s13222-012-0098-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Management Challenges in Next Generation Sequencing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scalable Cloud-Based Data Analysis Software Systems for Big Data from Next Generation Sequencing

The real cost of sequencing: scaling computation to keep pace with data generation

Closha: bioinformatics workflow system for the analysis of massive sequencing data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now