Biobanks are collections of biological material, such as DNA, tissue,
cells and blood, and the data that go with each sample (in the form of
medical records, environmental information, lifestyle information and
follow-up data, for example). Biobanking thus deals with a secure repository for long-term storage and
protection of confidential medical, health, and lifestyle data of patients of volunteers. The challenge is to provide access and data sharing within the Biobank community, sample handling, and integration of phenotype and genotype data.
The main project involves building a model infrastructure for the Dutch Biobank community, BBMRI-NL, that manages resources for the future of biomedical research. It forms a hub in the European Biobanking and Biomolecular Resources Research Infrastructure.
e-BioGrid is open to take Biobanking dedicated projects on board. Contact us if you are involved in Biobanking and you need support in software or hardware infrastructure.
The Netherlands have 150 biobanks with over 400,000 samples in total. To exploit these billions worth of material they are now embarking on large scale genetic profiling. An example is the highly visible Genome of the Netherlands project that will sequence the DNA of 750 Dutch individuals completely to elucidate the genetic diversity in the Dutch population, and to impute this new information onto existing more sparse genetic information of 100.000 Dutch individuals. However, the data handling and computational needs are enormous and Dutch Institutes are struggling to effectively use the hardware infrastructures available.
This e-BioGrid subproject will overcome this barrier by interfacing the data processing tools used in biobanking to the existing BiGGrid infrastructure and by supporting the Dutch biobanking community to deploy these tools for their large data and processing challenges. Applications include (1) high-throughput genetics studies from next generation sequencing of biobank samples (>750), (2) genome wide imputation and association studies (>100,000) and (3) follow-up BBMRI-NL projects that are currently being drafted by the BBMRI-NL steering commitee. We also aim to pilot knowledge discovery across biobanks by connecting this new information with existing information from over 150 existing Dutch biobanks.
Envisioned short-term results are high-impact scientific publications of these biobank studies. Long-term results are availability of flexible and scalable e-Science tools for large scale biobanking and optimized GRID infrastructure for biobanks within BigGrid, SARA, NBIC and the Life Sciences community to bring the Netherlands at the forefront of the next generation of genetics and populations research.
Morris Swertz, University Medical Center Groningen
Resources. For the BBMRI-NL imputation pipeline a total of ~800 parallel jobs have been run with a maximum of 6GB per job. About ~ 150 TB of data storage has been used. A complete copy of the GoNL raw sequence data was transferred to BiG Grid (SURFsara SRM) for grid usage and backup.
Knowledge. With several training sessions, the BBMRI team is able to now submit jobs to grid.
Infrastructure developed. BBMRI-NL best practices framework for pipeline definition called MOLGENIS/compute, including pilot job framework for running on BiG Grid has been build within the project (2011-2012). The BBMRI-NL pipelines for imputation, alignment, SNP calling and QC pipeline are available using this framework. The software and documentation can be found here. Users from other universities are interested users contacting us from Maastricht, Wageningen, Utrecht, and Nijmegen.
High performance computing. The imputation pipeline ran on Prevend, LifeLines and Utrecht biobanks. It was tested by independent bioinformaticians and used in production. The NGS pipeline is used in daily production UMCG. Completed imputation results for 15.000 LifeLines biobank samples (UMCG), 4.000 Prevend biobank samples (UMCG) and 1500 Utrecht samples (UMCU) and used in production by Celiac research group (UMCG) The next generation sequencing pipeline has been used on ~750 whole genome samples (Genome of NL) and ~1500 exome samples (UMCG).
Publications. MOLGENIS/compute was accepted as a full paper and presented at the International Conference on Bioinformatics Models, Methods and Algorithms, Algarve, Portugal, Feb.3 2012. (BIOINFORMATICS 2012 received 109 submissions, of which 14% were accepted as full papers)
Access. All recent documentation, presentations and software are available and free to download at here. All software has been made available as LGPL open source via Github.
Jan Bot, Pieter Neerincx, Abhishek Narain, George Byelas, Tom Visser