Next-generation sequencing technology produces large amounts of data, in the order of magnitude of 100 Gb per genome. Besides the needs for data storage, sequence analysis methods are based on computational methods that require high computational efforts. Studies like genome-wide association, comparing genome sequences, demand even more storage capacity. Computational processes such as alignments for large amounts of data can typically be parallelised such that the application can be efficiently run on the Big Grid infrastructure. Smaller projects may also be suiteable for Cloud computing.
e-BioGrid is open to take more Next-Generation Sequencing support projects on board. Contact us if you are involved in NGS and you need support in software or hardware infrastructure.
Metagenomics analyses are based on next-generation sequence data. The assembly of reads into contigs, and functional annotation of either contigs or reads in next-generation sequencing requires significant computing resources. Creating Grid and Cloud computing pipeline solutions for next-generation sequence data analysis would be an beneficial contribution to effective metagenomics research.
Sacha van Hijum, Center for Molecular and Biomolecular Informatics
Developed infrastructure. A grid-enabled protein function annotation pipeline using InterProScan is in development and almost finished.
Performance improvement. Quality control and assembly of large metagenomics data from Unilever was achieved in a single day compared to more than a week period with single high-end local PC. Unilever metatranscriptomics analysis that can hardly be handled in our local PC was performed in HPC-Cloud using more than 20 cores.
Data analysis. Quality control, assembly and functional annotation of dozens of bacterial strains from NIZO have been performed.
Publications. Bas E. Dutilh, Lennart Backus, Robert A. Edwards, Michiel Wels, Jumamurat R. Bayjanov, Sacha A.F.T. van Hijum: Explaining microbial phenotypes on a genomic scale: GWAS for microbes, Submitted.
Sacha van Hijum (CMBI), Victor de Jager (CMBI), Machiel Jansen, Niek Bosch, Jumamurat Bayjan (CMBI)
BreeDB is a framework that stores and analyse phenotyping and genotyping data from large-scale plant breeding experiments. BreeDB can be used through a web-based interface, which offers data exploration & analysis tools such as box plots, histograms, PCA, and QTL analysis. R is used as the principal statistical framework to execute these analyses. Graphical genotyping tools are available to show molecular marker data and QTL data in relation to genetic linkage maps. BreeDB is used within national and international consortia, such as EU-SOL and CBSG. This e-BioGrid project will be performed as integrated part of the Technology Top Institute Green Genetics (TTI-GG) funded project Virtual Lab for contemporary Plant Breeding-I (VLPB-I). In VLPB academic partners primarily develop open-source tools for breeding companies. The aim of this proposal is to improve data management by end-users analysis of large-scale genotyping and phenotyping datasets using grid-based computing. The rationale is that the involved VLPB partners would like to use BreeDB for storage and analysis of their own data, in combination with shared datasets. An example of such a shared dataset is genotyping data (SNP data) extracted from next-generation sequencing data that will be generated within the 150 Tomato Genome Project (mid-2012). Due to the rapid increase in phenotyping and genotyping data-points, the currently implemented data analysis tools are not powerful enough. Grid-based computing, in which the statistical procedures for genome-wide association studies are executed in parallel over many compute nodes, may be a solution to this problem. All developed tools will become available as open-source software.
Richard Finkers, Wageningen University and Research Centre
Infrastructure developed. The BreeDB software has been further developed within the scope of this combined eBioGrid / TTI-GG program. BreeDB is actively used by our industrial and research community and is used as the main application behind database sites such as CBSG database and EU-SOL BreeDB database. Within this joint VLPB and e-BioGrid program, we further developed the BreeDB framework to enable large-scale data analysis infrastructure for sequencing based or array based SNP <-> phenotype association analysis and manners to communicate this information back to end-users via custom made visualisation.
Performance improvement. We have successfully written and deployed several R based analysis on Big Grid. Currently, we have not invested heavily in further parallelization of the jobs (easily > 15.000 independent statistical analysis / job) into smaller chucks of, lets say, 500 analyses. Additional gain of efficiency / reduction of computing can still be achieved. The R grid enabled analysis methodology focused on several aspects:
Calculation of estimated means from multiple field trials.
Calculation of allele dosage from illumina infinium genome-wide SNP arrays.
Calculation of population structure (or alternatively via the command line program structure).
Calculation of multiple trait<-> Marker associations.
Calculation of a minimal marker model explaining a trait of interest.
In general, several of the computational steps are brought-back from several days to a time-span between 30-60 minutes. We have benchmarked the association analysis methodology and see that the time increase with increasing job size is almost linear (maybe slightly exponential). The implemented methodology is therefore suitable to analyze the genome wide gbs datasets, which we expect to analyse in 2013.
Knowledge. The involvement within the e-BioGrid program has been instrumental in the process of how-to develop and use grid ready applications within our current, and more especially, future needs of e-Infrastructure within Plant Breeding, and in more general, Breeding research. Other. The e-BioGrid project has been instrumental in enabling grid technology within Plant Breeding research. The major step, which was made within this e-BioGrid project, is that we could incorporate Big Grid within our strategy to develop BreeDB applications for analysis of large-scale genome-wide polymorphism <-> trait analysis.
In this project we will like to explore a solution to enable high throughput processing of next-gen sequence data in grid or cloud. We have two large next-gen sequence available or coming in the April or May, 2011: Dutch Genome Project (250 Dutch trios, parents plus child) and Leiden Longevity Studies (222 individuals with longevity phenotype) while the raw data will be about 60T and 100T, respectively. A simplified pipeline has been prepared for a local cluster to process the pilot data of Dutch genome project and this pipeline will be a starting point to explore a comprehensive solution to port all necessary tools to a grid environment.
Kai Ye, Leiden University Medical Centre
The analyis of GoNL data using Pindel has been completed. For this project, the GoNL data was first filtered and split into smaller regions, after which this data was collected across al samples and analysed simultaneously. This allowed for a large amount of parallellization, demonstrating the high troughput capacity of the Dutch Life Science grid. The experience accumulated in this project is now being used to run the Unified Genotyper on the same GoNL data. For this project, the data also needs to be split into smaller chunks to make analysis feasible. This part of the project has already been completed, resulting in roughly 478 thousand files. The process of combining this data has already started and the first results are being shared with our GoNL partners. Both analyses were carried out using the PiCaS token pool system and are based on one unified meta-data set, allowing users to transparently track the projects progress.
In a genome wide association (GWA) analysis, genetic variants (Single Nucleotide Polymorphisms: SNPs) across the whole genome are tested for the association with a certain trait (such as body weight or a certain disorder). With the data that is currently available, this signifies that 1.5 to 4.5 million tests are performed. These tests can be set up using structural equation modeling in which covariance structures with fixed effects are analyzed. Due to the large amount of tests GWA analysis is computationally expensive. Because genomic data are produced at increasing density and rapidly decreasing cost the need to apply state-of-the-art high performance computing methods in GWA analyses becomes urgent. Approaches to solve this problem are to use grid technology, and to use the computer hardware more efficiently either by making use of GPUs or by optimizing the algorithms used.
Han Rauwerda, University of Amsterdam
20-40 times gain in computing time by algorithm using symbolic algebra
Marijn van Eupen, Matthijs Kattenberg, Michel Nivard, Han Rauwerda, Dorret Boomsma
The e-infrastructure for bioscience research e-bioinfra is routinely used by researchers at the AMC to perform analysis of genomics data on the Dutch Grid, in particular for Next Generation Sequencing (NGS). The analysis steps are implemented as workflows that are executed on the grid in an automated fashion. Bioinformaticians at the AMC primarily run these workflows using the VBrowser, which also facilitates data manipulation on the grid storage. Selected applications are also available at the web interface of the e-bioinfra gateway for novice users. The goal of the project is to enable and enhance genomics research via advanced tools for data analysis. This is achieved in close collaboration with bioinformaticians.
Silvia Olabarriaga, on behalf of the VLEMED VO, Amsterdam Medical Centre / University of Amsterdam
Infrastructure developed. The AMC now operates a WS-PGRADE science gateway in addition to the in-house developed gateway. See http://www.ebioscience.amc.nl/liferay-portal-6.1.0/. Support was provided for the installation of the gateway using grid resources using the e-infrastructure gateway at AMC, the construction of the first workflows, and internal training.
Knowledge. The AMC participated in an international collaboration to develop a concept for dynamically scheduling light-paths based on compute and data location. In acting as alpha-users of new SURFnet BoD/NSI service, the AMC assisted in the debugging of the service. The initial results of this work were presented at a conference (see publications). With an interest in data security, we performed a study ‘Legal constraints on genetic data processing in European grids’ (see publications). In the scope of ER-FLOW a document was produced titled ‘ethical issues: policy and code of conduct’. Elements of this document can be re-used for similar projects (this document can be obtained upon request from the EGI document database: https://documents.egi.eu/secure/ShowDocument?docid=1461.
Software. Insights in the co-scheduling of compute and data. A Pilot-Data implementation was developed based on DIANE which is capable of running on BiG Grid resources. See http://www.ci.uchicago.edu/escience2012/pdf/P-A_Model_of_Pilot-Abstractions.pdf The code can be obtained from: http://redmine.ebioscience.amc.nl/projects/pilotapi-diane AMC represented the Life Science community in Staged Rollout of EGI/EMIsoftware (SAGA). This has led to the inclusion of SAGA in the next EMI release. See http://repository.egi.eu/2012/11/20/release-umd-2-3-0/.
Other. The AMC has further developed and operated a workflow-based service that automatically tracks provenance of grid workflow execution. This service, and its communication with BiGGrid resource providers, was supported by this e-biogrid project. The AMC participated and coordinated a task-force in the SCI-BUS project (http://www.sci-bus.eu) to study new data management functionality for the WS-PGRADE science gateway. See http://www.sci-bus.eu/wiki/-/wiki/Public/DataManagement A new community has been reached within the AMC, the group of Medical Biochemistry. They are now re-using a workflow from the SHIWA repository for a virtual screening project with Autodock Vina.
Publications. “P*: A Model of Pilot-Abstractions”, Andre Luckow, Mark Santcroos, Ole Weidner, Andre Merzky, Pradeep Mantha, Shantenu Jha, 8th IEEE International Conference on e-Science 2012, 2012 “Pilot Abstractions for Compute, Data, and Network”, Mark Santcroos, Silvia Delgado Olabarriaga, Daniel S. Katz, Shantenu Jha, NECS Workshop, 8th IEEE International Conference on e-Science 2012, 2012 “Exploring Dynamic Enactment of Scientific Workflows using Pilot-Abstractions”, Mark Santcroos, Barbera DC van Schaik, Shayan Shahand, Silvia Delgado Olabarriaga, Andre Luckow, Shantenu Jha ,13th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (accepted), 2013
We want to investigate the use of Linux Vms with specific phylogenomics and population genomics software installed to do whole genome coalescent and phylogenetic analyses that are currently outside of our computational reach. The most important request here is for cpu time (e.g. several weeks of wall time with 24 processors).
Hendrik Jan Megens, Wageningen University, Animal Breeding and Genomics Centre
The goal of the project is to have an annotation pipeline for microbial genomes. Pipeline connecting freely available software is already available as stand-alone version. Ideally it should be upgraded to a faster version (currently a step involving BLAST is limiting) and should be made accessible to other users in a web interface.
We would like to use the HPC Cloud for analysis of RNA-seq data, using already described protocols and algorithms. For that we will try to use the Galaxy computing framework, scaling up the amount of nodes as needed. Besides that, we will try to see if we can scale-out our own cluster to the HPC Cloud as a test for future use. The results will be a complete pipeline for RNA-seq analysis, capable of running in the HPC Cloud, and knowledge whether a HPC Cloud is useful for scaling our own infrastructure.
The Life Sciences Group (MAC4) has various ongoing and future projects/collaborations concerning next generation sequencing (NGS) data. One run of an NGS instrument can produce over 1TB of sequencing reads. In collaboration with Ivan G. Costa Filho (Universidade Federal de Pernambuco, Brazil), Alexander Schliep (Rutgers University, NJ, USA), and Markus Bauer (Illumina Inc., UK), we develop methods to discover structural variations such as insertions, deletions, and inversions given paired-end sequencing reads. Preliminary results show that our methods outperform state-of-the-art algorithms. To thoroughly benchmark our methods, we ask for computational ressources on the SARA HPC Cloud. To demonstrate the performance of our algorithms, we will run several simulation studies as well as discovering structural variations on real data. To obtain real data, we have set up a collaboration with Illumina Inc. (Cambridge, UK), which is one of the leading companies manufacturing next-generation sequencing machines, and recently received the data sets from Illumina. The requested ressources are extrapolated based on our preliminary results, where we processed smaller data sets.
Tobias Marschall, Centrum Wiskunde & Informatica
will follow soon
Tobias Marschall, Alexander Schönhuth, Gunnar Klau, Stefan Canzar
With the collaboration of the LUMC with Washington University, the entire GoNL raw-reads needed to be transfered for the shared LUMC / GoNL project.
Kai Ye, Leiden University Medical Centre
For this shared LUMC / GoNL project the entire GoNL raw-reads set was transferred to Washington University (Seattle, USA) using a high speed internet connection provided by SARA. The entire set consisted of roughly 35 TB of data and was transferred in batches, taking about a month to complete. This project shows that given the right tools and a reliable connection, large data transfer 'over the wire' are a good alternative to transfers by harddisk. The problems encountered in this project have given us insight into the challenges that lie ahead for these types of large data transfers.