e-BioGrid projects (2010-2012) were selected and categorized according to their technology area (next-generation sequencing, biobanking, mass spectrometry, microarray technology, nanoscopy and imaging, structure determination and prediction). Projects that are applicable to any of the technology areas and focus on the infrastructure technology, are listed under generic infrastructure. For a complete list of projects, see below.

We distinguish between 'main' development projects, and 'dedicated' on demand support projects. Read more here about the e-BioGrid approach. Additional proposals for dedicated projects can be submitted, depending on available resources at the time.

All projects are listed below

Mapping Protein Sequence Space: a High Performance workflow to compute the world's first protein sequence and structure map
description:The overall aim of this project is to systematically chart the vast space of all possible protein sequences of certain lengths and several structural properties, creating valuable roadmaps guiding any protein design or engineering effort, and research into protein sequences and their function. Creating maps of sequence space is a technically straightforward, though data-intensive, three-step process: (1) Generating the appropriate pseudo-random sequences that systematically transect sequence space using the perfect sampling method; (2) Predicting the structural properties associated with the sequences using various algorithms; (3) Analyzing these data for relevant trends. The process will be implemented as a VLAM workflow that dynamically submits appropriate numbers of parallel jobs to the grid. The process will be performed for sequences of 12, 40, 80, 160, or 320 amino acid residues. For a specific use case in phage display experiments, all possible 1.28 billion peptides of 7 amino acid positions will be enumerated and stored in the Sequenome database.
applicant:Marco Roos, LUMC Klinische genetica
results:
status:ongoing
team:Marco Roos, eBioGrid support team
type:This is a dedicated project.
other
High-throughput simulations of the metabolism of the human gut microbiota
description:The aim of this pilot-project at the core modeling group of the Netherlands Consortium for Systems Biology (NCSB) is to upscale existing simulations of populations of metabolizing bacteria, that we use to study behavior and evolution of gut microbiota. Recent advances in high-throughput meta-genomics have shown that the gut is home to a very complex microbial community, consisting of several hundreds different species, the gut microbiota, whose collective genome contains more than 100 times as many genes as our own genome. In close collaboration with experimental and theoretical groups in NCSB-member the Top Institute Food and Nutrition (TIFN) we have developed an in silico simulation model of the gut microbiota. The aim is to uncover fundamental principles of intestinal microbial organization and its relation to health and diet of the host. We will use this VO playground project to set up a WS-VLAM infrastructure supporting our simulation system and to perform some initial simulations with it. As a first goal we aim at identifying parameter regions that support the coexistence of multiple bacterial species.
applicant:Roeland Merks, CWI
results:ongoing
status:ongoing
team:Roeland Merks, Milan van Hoek, Margriet Palm, Bas Teusink, Zhao Zhiming en Adam Belloum
type:This is a dedicated project.
other
Effects of detritivores on litter microbial community
description:This is part of a project intitled 'Traits meet trophic interactions: predicting the effects of climate change on soil carbon cycling using functional traits' (NWO / 819.01.017). It investigates the role of detritivores on driving species composition of microbial communities. In a last experiment, we endeavour to characterize the bacterial community present in litter exposed to different species of terrestrial isopods under three distinct water regimes. To characterize the bacterial community of the litter samples we used the Illumina next-generation sequencing platform, which returned more than five thousand sequences per sample. The normal computers we have in our department do not match the computer power necessary to manage, align and analyse this data set. Therefore, I would like to have access to the HPC Cloud system to do the bioinformatics necessary to analyse this data.
applicant:André Tavares Corrêa Dias, Vrije Universiteit Amsterdam
results:will follow soon
status:ongoing
team:André Tavares Corrêa Dias, e-BioGrid team
type:This is a dedicated project.
other
worldwide e-NMR *
description:The objective of WeNMR, a EU funded project, is to optimize and extend the use of the NMR and SAXS research infrastructures through the implementation of an e-infrastructure in order to provide the user community with a platform integrating and streamlining the computational approaches necessary for NMR and SAXS data analysis and structural modelling. Access to the e-NMR infrastructure is provided through a portal integrating commonly used software and GRID technology.
applicant:Alexander Bonvin, Bijvoet Centre for Biomolecular Research
results:A portal with NMR and SAXS data analysis and modelling tools is available on the we-NMR website. The project team is currently working with the US OSG to enable support for our enmr.eu VO. We have already successfully deployed software and run test jobs on OSG sites, demonstrating interoperability of EU and US grids.
status:ongoing
team:Alexander Bonvin, e-NMR team
type:This is a main project.
e-NMR
Cloud computing for NMR structural biology
description:running common interface for NMR structure generation on BiG Grid cloud
applicant:Jurgen Doreleijers, Geerten Vuister, Centre for Molecular and Biomolecular Informatics
results:development ongoing
status:completed
team:Jurgen Doreleijers, Floris Sluiter
type:This is a dedicated project.
e-NMR
Chemical education in the clouds
description:This is an educational project aiming at introducing CLOUD computing into the chemistry teaching at the bachelor level within the chemistry curriculum of Utrecht University. In the second year of the chemistry curriculum students can choose for a 'molecular modelling and mathematics' course in which the principles of molecular simulation techniques and force fields are introduced in parallel with all the necessary mathematical background. The course contains a practical part in which students learn the basics of molecular dynamics simulations by performing simulations of a protein using Gromacs and studying the effect of mutations on structure and dynamics. In this project CLOUD computers are used for students to perform the entire work (setup, production, analysis), on a single and dedicated system. In that way we introduce e-Science into teaching so that students can get acquainted with CLOUD computing. 'Education in the clouds' should be a novel and attractive concept for students.
applicant:Alexandre Bonvin, Utrecht University
results:aanvraag niet beschikbaar
status:ongoing
team:Alexandre Bonvin, Adrien Melquiond, Alain van Hoof
type:This is a dedicated project.
e-NMR
NMR_REDO; recalculate 5,000 Biomolecular structures the Protein DataBank (PDB)
description:This project, called NMR_REDO, aims to recalculate 5,000 Biomolecular structures the Protein DataBank (PDB) on the basis of experimental NMR data. For this large-scale project we implemented this setup onto virtual machines (VMs) in a cloud environment. The VM contains all necessary software components (about 25 diverse types of programs) for the computation and validation that would be very hard, if not impossible to configure on a traditional grid node.
The VM, called VirtualCin has recently been reconfigured as an 32 bit Ubuntu 11.04 server edition and measures 11 Gb in .tgz format. VC slave threads take their instructions from a Sara-hosted job server called Topos. Data is exchanged by scp with a storage unit at the CMBI data center in Nijmegen.
applicant:Jurgen Doreleijers, Radboud University Nijmegen Medical Centre
results:will follow soon
status:completed
team:Jurgen Doreleijers
type:This is a dedicated project.
e-NMR
Grid- and cloud computing for high-throughput assembly and annotation of (meta)genome sequences *
description:Metagenomics analyses are based on next-generation sequence data. The assembly of reads into contigs, and functional annotation of either contigs or reads in next-generation sequencing requires significant computing resources. Creating Grid and Cloud computing pipeline solutions for next-generation sequence data analysis would be an beneficial contribution to effective metagenomics research.
applicant:Sacha van Hijum, Center for Molecular and Biomolecular Informatics
results:Developed infrastructure. A grid-enabled protein function annotation pipeline using InterProScan is in development and almost finished.

Performance improvement. Quality control and assembly of large metagenomics data from Unilever was achieved in a single day compared to more than a week period with single high-end local PC. Unilever metatranscriptomics analysis that can hardly be handled in our local PC was performed in HPC-Cloud using more than 20 cores.

Data analysis. Quality control, assembly and functional annotation of dozens of bacterial strains from NIZO have been performed.

Publications. Bas E. Dutilh, Lennart Backus, Robert A. Edwards, Michiel Wels, Jumamurat R. Bayjanov, Sacha A.F.T. van Hijum: Explaining microbial phenotypes on a genomic scale: GWAS for microbes, Submitted.
status:completed
team:Sacha van Hijum (CMBI), Victor de Jager (CMBI), Machiel Jansen, Niek Bosch, Jumamurat Bayjan (CMBI)
type:This is a main project.
e-NGS
Pindel & Unified Genotyper Analysis on Grid
description:In this project we will like to explore a solution to enable high throughput processing of next-gen sequence data in grid or cloud. We have two large next-gen sequence available or coming in the April or May, 2011: Dutch Genome Project (250 Dutch trios, parents plus child) and Leiden Longevity Studies (222 individuals with longevity phenotype) while the raw data will be about 60T and 100T, respectively. A simplified pipeline has been prepared for a local cluster to process the pilot data of Dutch genome project and this pipeline will be a starting point to explore a comprehensive solution to port all necessary tools to a grid environment.
applicant:Kai Ye, Leiden University Medical Centre
results:The analyis of GoNL data using Pindel has been completed. For this project, the GoNL data was first filtered and split into smaller regions, after which this data was collected across al samples and analysed simultaneously. This allowed for a large amount of parallellization, demonstrating the high troughput capacity of the Dutch Life Science grid. The experience accumulated in this project is now being used to run the Unified Genotyper on the same GoNL data. For this project, the data also needs to be split into smaller chunks to make analysis feasible. This part of the project has already been completed, resulting in roughly 478 thousand files. The process of combining this data has already started and the first results are being shared with our GoNL partners. Both analyses were carried out using the PiCaS token pool system and are based on one unified meta-data set, allowing users to transparently track the projects progress.
status:ongoing
team:Eline Slagboom, Kai Ye, Jan Bot, Evert Lammerts
type:This is a dedicated project.
e-NGS
Accelerating OpenMX genome wide association studies
description:In a genome wide association (GWA) analysis, genetic variants (Single Nucleotide Polymorphisms: SNPs) across the whole genome are tested for the association with a certain trait (such as body weight or a certain disorder). With the data that is currently available, this signifies that 1.5 to 4.5 million tests are performed. These tests can be set up using structural equation modeling in which covariance structures with fixed effects are analyzed. Due to the large amount of tests GWA analysis is computationally expensive. Because genomic data are produced at increasing density and rapidly decreasing cost the need to apply state-of-the-art high performance computing methods in GWA analyses becomes urgent. Approaches to solve this problem are to use grid technology, and to use the computer hardware more efficiently either by making use of GPUs or by optimizing the algorithms used.
applicant:Han Rauwerda, University of Amsterdam
results:20-40 times gain in computing time by algorithm using symbolic algebra
status:completed
team:Marijn van Eupen, Matthijs Kattenberg, Michel Nivard, Han Rauwerda, Dorret Boomsma
type:This is a dedicated project.
e-NGS
E-Infra for the Virtual lab for contemporary Plant Breeding (VLPB-I) Potato Projects *
description: BreeDB is a framework that stores and analyse phenotyping and genotyping data from large-scale plant breeding experiments. BreeDB can be used through a web-based interface, which offers data exploration & analysis tools such as box plots, histograms, PCA, and QTL analysis. R is used as the principal statistical framework to execute these analyses. Graphical genotyping tools are available to show molecular marker data and QTL data in relation to genetic linkage maps. BreeDB is used within national and international consortia, such as EU-SOL and CBSG.
This e-BioGrid project will be performed as integrated part of the Technology Top Institute Green Genetics (TTI-GG) funded project Virtual Lab for contemporary Plant Breeding-I (VLPB-I). In VLPB academic partners primarily develop open-source tools for breeding companies. The aim of this proposal is to improve data management by end-users analysis of large-scale genotyping and phenotyping datasets using grid-based computing. The rationale is that the involved VLPB partners would like to use BreeDB for storage and analysis of their own data, in combination with shared datasets. An example of such a shared dataset is genotyping data (SNP data) extracted from next-generation sequencing data that will be generated within the 150 Tomato Genome Project (mid-2012). Due to the rapid increase in phenotyping and genotyping data-points, the currently implemented data analysis tools are not powerful enough. Grid-based computing, in which the statistical procedures for genome-wide association studies are executed in parallel over many compute nodes, may be a solution to this problem. All developed tools will become available as open-source software.
applicant:Richard Finkers, Wageningen University and Research Centre
results:Infrastructure developed. The BreeDB software has been further developed within the scope of this combined eBioGrid / TTI-GG program. BreeDB is actively used by our industrial and research community and is used as the main application behind database sites such as CBSG database and EU-SOL BreeDB database. Within this joint VLPB and e-BioGrid program, we further developed the BreeDB framework to enable large-scale data analysis infrastructure for sequencing based or array based SNP <-> phenotype association analysis and manners to communicate this information back to end-users via custom made visualisation.

Performance improvement. We have successfully written and deployed several R based analysis on Big Grid. Currently, we have not invested heavily in further parallelization of the jobs (easily > 15.000 independent statistical analysis / job) into smaller chucks of, lets say, 500 analyses. Additional gain of efficiency / reduction of computing can still be achieved. The R grid enabled analysis methodology focused on several aspects:
  • Calculation of estimated means from multiple field trials.
  • Calculation of allele dosage from illumina infinium genome-wide SNP arrays.
  • Calculation of population structure (or alternatively via the command line program structure).
  • Calculation of multiple trait<-> Marker associations.
  • Calculation of a minimal marker model explaining a trait of interest.
In general, several of the computational steps are brought-back from several days to a time-span between 30-60 minutes. We have benchmarked the association analysis methodology and see that the time increase with increasing job size is almost linear (maybe slightly exponential). The implemented methodology is therefore suitable to analyze the genome wide gbs datasets, which we expect to analyse in 2013.

Knowledge. The involvement within the e-BioGrid program has been instrumental in the process of how-to develop and use grid ready applications within our current, and more especially, future needs of e-Infrastructure within Plant Breeding, and in more general, Breeding research. Other. The e-BioGrid project has been instrumental in enabling grid technology within Plant Breeding research. The major step, which was made within this e-BioGrid project, is that we could incorporate Big Grid within our strategy to develop BreeDB applications for analysis of large-scale genome-wide polymorphism <-> trait analysis.

Access. Code is available at Github: BreeDB and other tooling.
status:completed
team:Richard Finkers, Theo Borm, Richard Visser, Benoit Carreres, e-BioGrid team
type:This is a main project.
e-NGS
DNA Sequencing on the e-BioInfra platform
description:The e-infrastructure for bioscience research e-bioinfra is routinely used by researchers at the AMC to perform analysis of genomics data on the Dutch Grid, in particular for Next Generation Sequencing (NGS). The analysis steps are implemented as workflows that are executed on the grid in an automated fashion. Bioinformaticians at the AMC primarily run these workflows using the VBrowser, which also facilitates data manipulation on the grid storage. Selected applications are also available at the web interface of the e-bioinfra gateway for novice users. The goal of the project is to enable and enhance genomics research via advanced tools for data analysis. This is achieved in close collaboration with bioinformaticians.
applicant:Silvia Olabarriaga, on behalf of the VLEMED VO, Amsterdam Medical Centre / University of Amsterdam
results:Infrastructure developed. The AMC now operates a WS-PGRADE science gateway in addition to the in-house developed gateway. See http://www.ebioscience.amc.nl/liferay-portal-6.1.0/. Support was provided for the installation of the gateway using grid resources using the e-infrastructure gateway at AMC, the construction of the first workflows, and internal training.

Knowledge. The AMC participated in an international collaboration to develop a concept for dynamically scheduling light-paths based on compute and data location. In acting as alpha-users of new SURFnet BoD/NSI service, the AMC assisted in the debugging of the service. The initial results of this work were presented at a conference (see publications).
With an interest in data security, we performed a study ‘Legal constraints on genetic data processing in European grids’ (see publications). In the scope of ER-FLOW a document was produced titled ‘ethical issues: policy and code of conduct’. Elements of this document can be re-used for similar projects (this document can be obtained upon request from the EGI document database: https://documents.egi.eu/secure/ShowDocument?docid=1461.

Software. Insights in the co-scheduling of compute and data. A Pilot-Data implementation was developed based on DIANE which is capable of running on BiG Grid resources. See http://www.ci.uchicago.edu/escience2012/pdf/P-A_Model_of_Pilot-Abstractions.pdf The code can be obtained from: http://redmine.ebioscience.amc.nl/projects/pilotapi-diane
AMC represented the Life Science community in Staged Rollout of EGI/EMIsoftware (SAGA). This has led to the inclusion of SAGA in the next EMI release. See http://repository.egi.eu/2012/11/20/release-umd-2-3-0/.

Other. The AMC has further developed and operated a workflow-based service that automatically tracks provenance of grid workflow execution. This service, and its communication with BiGGrid resource providers, was supported by this e-biogrid project.
The AMC participated and coordinated a task-force in the SCI-BUS project (http://www.sci-bus.eu) to study new data management functionality for the WS-PGRADE science gateway. See http://www.sci-bus.eu/wiki/-/wiki/Public/DataManagement
A new community has been reached within the AMC, the group of Medical Biochemistry. They are now re-using a workflow from the SHIWA repository for a virtual screening project with Autodock Vina.

Publications. “P*: A Model of Pilot-Abstractions”, Andre Luckow, Mark Santcroos, Ole Weidner, Andre Merzky, Pradeep Mantha, Shantenu Jha, 8th IEEE International Conference on e-Science 2012, 2012
“Pilot Abstractions for Compute, Data, and Network”, Mark Santcroos, Silvia Delgado Olabarriaga, Daniel S. Katz, Shantenu Jha, NECS Workshop, 8th IEEE International Conference on e-Science 2012, 2012
“Exploring Dynamic Enactment of Scientific Workflows using Pilot-Abstractions”, Mark Santcroos, Barbera DC van Schaik, Shayan Shahand, Silvia Delgado Olabarriaga, Andre Luckow, Shantenu Jha ,13th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (accepted), 2013

Access. For more information please refer to the gateway documentations. Code Is avaible here.
status:completed
team:Barbera van Schaik, Antoine van Kampen, Angela Luyf, Marcel Willemsen, Aldo Jongejan, Silvia Olabarriaga, Mark Santcroos, Jan Just Keijser, Shayan Shahand, Vladimir Korkhov, Souley Madougou
type:This is a dedicated project.
e-NGS
cloud application animal sciences
description:We want to investigate the use of Linux Vms with specific phylogenomics and population genomics software installed to do whole genome coalescent and phylogenetic analyses that are currently outside of our computational reach. The most important request here is for cpu time (e.g. several weeks of wall time with 24 processors).
applicant:Hendrik Jan Megens, Wageningen University, Animal Breeding and Genomics Centre
results:
status:ongoing
team:Jan Bot, Hendrik Jan Megens
type:This is a dedicated project.
e-NGS
Annotation pipeline for microbial genomes
description:The goal of the project is to have an annotation pipeline for microbial genomes. Pipeline connecting freely available software is already available as stand-alone version. Ideally it should be upgraded to a faster version (currently a step involving BLAST is limiting) and should be made accessible to other users in a web interface.
applicant:Genevieve Girard, IBL, Leiden University
results:none so far; project is starting
status:ongoing
team:Jan Bot, Niek Bosch, Genevieve Girard
type:This is a dedicated project.
e-NGS
Performing RNA-Seq Analysis with Galaxy in the HPC Cloud
description:We would like to use the HPC Cloud for analysis of RNA-seq data, using already described protocols and algorithms. For that we will try to use the Galaxy computing framework, scaling up the amount of nodes as needed. Besides that, we will try to see if we can scale-out our own cluster to the HPC Cloud as a test for future use. The results will be a complete pipeline for RNA-seq analysis, capable of running in the HPC Cloud, and knowledge whether a HPC Cloud is useful for scaling our own infrastructure.
applicant:J. van Haarst, Plant Research International
results:
status:ongoing
team:J. van Haarst
type:This is a dedicated project.
e-NGS
Structural variation discovery in next-generation sequencing data
description:The Life Sciences Group (MAC4) has various ongoing and future projects/collaborations concerning next generation sequencing (NGS) data. One run of an NGS instrument can produce over 1TB of sequencing reads. In collaboration with Ivan G. Costa Filho (Universidade Federal de Pernambuco, Brazil), Alexander Schliep (Rutgers University, NJ, USA), and Markus Bauer (Illumina Inc., UK), we develop methods to discover structural variations such as insertions, deletions, and inversions given paired-end sequencing reads. Preliminary results show that our methods outperform state-of-the-art algorithms. To thoroughly benchmark our methods, we ask for computational ressources on the SARA HPC Cloud. To demonstrate the performance of our algorithms, we will run several simulation studies as well as discovering structural variations on real data. To obtain real data, we have set up a collaboration with Illumina Inc. (Cambridge, UK), which is one of the leading companies manufacturing next-generation sequencing machines, and recently received the data sets from Illumina. The requested ressources are extrapolated based on our preliminary results, where we processed smaller data sets.
applicant:Tobias Marschall, Centrum Wiskunde & Informatica
results:will follow soon
status:completed
team:Tobias Marschall, Alexander Schönhuth, Gunnar Klau, Stefan Canzar
type:This is a dedicated project.
e-NGS
NGS data transfers
description:With the collaboration of the LUMC with Washington University, the entire GoNL raw-reads needed to be transfered for the shared LUMC / GoNL project.
applicant:Kai Ye, Leiden University Medical Centre
results:For this shared LUMC / GoNL project the entire GoNL raw-reads set was transferred to Washington University (Seattle, USA) using a high speed internet connection provided by SARA. The entire set consisted of roughly 35 TB of data and was transferred in batches, taking about a month to complete. This project shows that given the right tools and a reliable connection, large data transfer 'over the wire' are a good alternative to transfers by harddisk. The problems encountered in this project have given us insight into the challenges that lie ahead for these types of large data transfers.
status:completed
team:Kai Ye, Jan Bot
type:This is a dedicated project.
e-NGS
Nanoscopy: E-Science on the Nano- and ultrastructure scale *
description:With powerful machines available at the Netherlands Centre for Electron Microscopy various forms of 3D reconstruction in EM can be realized. Reconstruction methods range from near-atomic resolution to the level of ultra-structure. These methods are computationally intensive. Parallelization and Grid computing have only been partially adopted in this field. Particle-reconstruction has been implemented on a cluster using the IMAGIC software. X-ray crystallography uses the CCP4-suite and this suite is partially made suitable for the Grid. Computerized EM tomography invokes computer clusters using the IMOD software. Here, special attention need be given to enhancement of the 3D reconstruction. The goal of this project is to bundle 3D-reconstruction tools into an e-science problem solving environment for nanoscopy that is Grid enabled.
applicant:Fons Verbeek, Leiden Institute of Advanced Computer Science
results:Infrastructure developed. Software for Single Particle Analysis (SPA) and well as eTomography is used on powerfull workstations. It has turned out the implementation that has been designed on the small cluster had considerable increase in performance. However, as the software could not be ported to a GRID the solution for a cloud was elaborated. This is a scalable solution in which a cluster is created in the cloud according to the size the user considers fit for the job at hand.
The HPC infrastructure provided by SURFsara was used to test the specifications that were elaborated for this specific project. The implementation of the specifications is based on a Debian Linux OS. A head node is defined from which a command line interface is provided. The user interface provided facilities for starting batch jobs that are parallelized over the core nodes. The number of core nodes is chosen such to prevent CPU overload and overhead spill. From the Head node the process is started with the parameters the user has chosen. From the command line an graphical X11 based interface can be invoked to start a particular interface.
In addition to the CL interface a web interface is provided to manage the data in a comfortable manner. Basically, the web interface acts like a unix shell. Both interfaces require some basic knowledge of the unix cl structure. Next to that knowledge, users do have good insights in the software that is being ported to the platform.

Knowledge. In this project a considerable knowledge has been acquired to create a flexible and powerful computational environment for Electron Nanoscopy. In the next years this knowledge should be consolidated on computational platforms, used on the largest possible scale and expanded with new insights when required. The platform will support researchers in the field of Electron Nanoscopy to execute complex tasks that are emerging from their research. It takes into account the increasing size of the data volumes and can adapt to new situations. In the next few years we hope to extend our knowledge on these types of computational platforms.

Software. The software in fitted in a middle layer for the OS basically starting a head node that distributes to client nodes. The middle layer was specifically designed and implemented for this platform architecture.

Other. The achieved results are suitable for use in the field of Electron Nanoscopy. In particular researchers involved in NeCEN will be future users. Amongst the testers were researchers that are involved in the NeCEN. They have been selected for their knowledge on particular processes and specific software that we have ported to the cloud.

Publications.
Cloud and Cluster Computing for eNanoscopy, in preparation (FJ Sicking & FJ Verbeek)

status:completed
team:Fons Verbeek, Floris Sicking, N. Pannu, Jan-Pieter Abrahams, Bram Koster
type:This is a main project.
e-NSC
Optimization of automated 3d electronmicroscopy data analyses
description:Advanced analyses, easier
Electron microscopy is of invaluable importance for the study of the complex organization and the architecture of cellular structures. In recent years new and (mostly automated) electron microscopical techniques have been developed (like electron tomography (ET) and focused ion beam scanning electron microscopy (FIB-SEM) that provide us with 3-dimensional (3D) information of the cell. This added dimension gave us new insights in cellular structures, and the interrelatedness between organelles and cellular processes.
Automated 3D analysis methods are still in their infancy. Data extraction is still mainly depending on manual segmentation techniques, and therefore time consuming and subjective.

Structured storage
Modern 3D electron microscopic recording methods and techniques (not only (S)TEM Tomography, but also FIB-SEM, ILEM and SEM) will have to cope with ever growing amounts of data. This amount of data needs to be stored in a structured and well-organized manner in order to be and remain accessible to its users. Currently data-storage is done by the individual researcher in many different ways and forms. A better structured and more uniform way of storing data and organizing data-management is a precondition for the primary science case, but it also creates the opportunity for long term use and thus enabling future reuse of electron microscopy information by research institutes and their distant collaborators.

The primary objective of this BiG Grid project is, to improve computer intensive analysis methods - including 3D template matching of 3D electron microscopy tomography data. In addition, the improvements include providing broad access of these methods to the 3D electron microscopy community. The secondary objective is to implement a data storage system for 3D electron microscopy data.

In order to achieve these objectives, an intensive collaboration of expert partners is needed to establish an infrastructure for the analysis of 3D electron microscopy data.

This project contains 5 main activities
1. Reduce the total compute time and improve the compute capacity by implementation of existing 3D analysis algorithms on GPU's
2. Improvement of the 3D analysis process by implementation of additional analysis and information management tools, objectifying and improving reliability of 3D data mining of electron tomography volumes
3. Improvement of the accessibility by creating an intuitive user environment
4. Improvement of storage, retrieval, archival and the controlled sharing of 3D data for electron microscopy data-analysis
5. Ensuring continuity and availability of the created solutions during and after the duration of the project

This project stems from a previous collaboration (IOP: IGE03012/VL-E: CellTom) in a IOP genomics programme. The Sara e-science support team has assisted in creating the project proposal and has established the project organization.

The foundation of this project is a proposal by the 3D electron microscopy group of the Utrecht University. This proposal is supported by the Leiden University Medical Center.

applicant:Michel Lebbink , Leiden University Medical Centre
results:See our Wiki.
status:completed
team:Jan Andries Post, Misjaël Lebbink, Tom Visser
type:This is a dedicated project.
e-NSC
Development of PSEs for micro-array-based experiments. *
description:Nearly all micro-array experiments are unique due to differences in experiment design, experimental procedure, level of completeness of the data and cellular responses. Also methods with which array studies are analyzed are topic of much bioinformatics research. Therefore state-of-the-art array analysis is subject to change and must be highly flexible. Another aspect to array analysis is that at some points much computing power is needed. Here we propose to set up an architecture that implements Problem Solving Environments for six problem areas from array design to downstream expression analysis. We will do this by setting up web services and by the configuration of dedicated Virtual Laboratory Machine Images that can be instantiated in a High Performance Compute Cloud. These Machine Images can be shared and dynamically scaled to a large virtual computer cluster. The content of these Machine Images will be documented and stored in a public Document Management Server. The resources used, input data and experimental results can be stored in a private Result Representation Server.
applicant:Timo Breit, University of Amsterdam
results:Software.
- a web-based microarray data quality control system (MADQC) has been developed. This environment produces, based on a data matrix, a design file and a contrast matrix a number of calculations that enable the biologist to quickly assess the quality of the microarray experiment. Until now ~150 experiments have been processed covering several thousands of microarrays;
- image splitter software that makes 20-bits microarray scanner images compatible for 16-bits extraction software. This software has been used until today with 1000+ slide scans each containing 4-12 arrays.
- starting and stopping Cloud clusters on the fly: the Cloud Manager. Central element in this environment is a dedicated machine that runs in the cloud and acts as a manager node that listens to requests from the outside world and can access the XML-RPC interface that is only accessible from within the cloud. Via this interface a cluster can be started and stopped. CloudManager is used on a daily basis and not only with NGS Designer but also when a bioinformatician needs a cluster from within a local R-session. Currently we have used approx.of our 112.500 core hours on the HPC cloud.
- a web tool to design tiling microarrays for a set of (prokaryote) sequences based on a tile step or simply on the required number of probes (Progenius*). It provides functionality to annotate the probes on the basis of the input sequences provided by the user. So far we have used Progenius to design probes for several studies Including those for UMC Utrecht on Staphylococcus aureus strains and RIVM on Salmonella.
- A web tool (NGS designer*) that produces microarray designs on the basis of Next Generation Sequencing reads. Parameters that can be set are: required numbers of probes, probe length, sequence similarity thresholds, thermodynamic parameters such as Free Gibbs Energy and GC content. NGS Designer has been used to design arrays for several species, such as fruits and Chironimus riparius (PhD thesis Marino Marincovic on Gene expression in toxicant-exposed chironomids). The size of the input files requires to instantiate a machine in the BigGrid HPC Cloud for each NGSDesigner request. For this the Cloud Manager was used.
- implemented generic tools for transcriptomics data analysis, like ANOVA based analyses on the cloud;
- started the development of a normalization tool based on extensive experimental spike-ins; Currently we are evaluating the use of these spike ins in Next Generation Sequencing based transcriptomics experimentation (Ion Proton platform).
- Developed a tool for present/absent calling in microarray-based CGH*;
- The result browser (MARSDB*) allows exploring the results of multi-factorial transcriptomics experiments. MARSDB originates from the many and repetitive requests from biologists. MARSDB allows us to generate a list of significant probes at a certain p-value, to generate a list of probes in each contrast with all statistics, to make selections of probes or of probes in a GO category in each given contrast and to determine overlaps (Venn diagrams) in a maximum of 4 contrasts. These lists are then exported and can be further used in other applications such as applications for set analysis and pathway analysis. Finally the tool can produce heatmaps. This tool has been used in several studies.
- implemented pipelines for interpreting dose & time range-finding experiments in the context of design for experimentation. These pipelines have been of particular interest in the BioRange TASTOE project;
- We have explored ways to store, query and browse genome data using the community effort of GBrowse and GMOD. We have setup repositories for several organisms and added tracks from our array design pipeline.
- The development and support of a PSE for community-based collaboration. Currently we are preparing a prototype of this environment in which we will use the data generated in the context of the BioRange TASTOE project on early zebrafish development. The community-based research questions will specifically be of a biological nature. In this environment we will use many of the tools that have been developed in WP1, such as the result browser and the Gbrowse/GMOD environments. Next to that we introduce: a GIST based Tool and code repository. This repository contains entries to all our tools, programs, and scripts and is well annotated, searchable and can be tagged. The Github/GIST environment offers a means to share code snippets. But we also wanted to be able to share links to web resources and to access packages. Hence we made a mixed environment that can be put in the website of the community-based collaboration project, in which anybody can upload, annotate and tag new scripts, packages and weblinks. For this we use the web API of GIST.
* Publication in preparation

Usage and access. The PROGENIUS Array Design tool is used by UvA, RIVM, WUR and UMCU. The NGS Designer tool is also used by the UvA-IBED group. Several e-science ideas and concepts are exploited within the Virtual Lab for Plant Breeding consortium. Progenius can be run to design multistrain probes for non-standard tiling microarrays.NGS designer is a tool to design probes based on Next-generation sequencing reads for non-standard microarrays.

Publications.
1. Chikhovskaya JV, Jonker MJ, Meissner A, Breit TM, Repping S, van Pelt AMM Human testis-derived embryonic stem cell-like cells are not pluripotent, but possess potential of mesenchymal progenitors. Human Reproduction 2012 Jan;27(1):210-21.
2. Schaap MM, Zwart EP, Wackers P, Huijskens I, van de Water B, Breit TM, van Steeg H, Jonker MJ, Luijten M Dissecting Modes of Action of Non-Genotoxic 1 Carcinogens in Primary Mouse Hepatocytes. Archives of Toxicology 2012 Nov;86(11):1717-27.
3. Marinkovic M, de Leeuw WC, de Jong M, Kraak MH, Admiraal W, Breit TM, Jonker MJ. Combining next-generation sequencing and microarray technology into a transcriptomics approach for the non-model organism Chironomus riparius. PLoS One. 2012;7(10):e48096. doi: 10.1371/journal.pone.0048096. Epub 2012 Oct 25.
4. Marinkovic M, de Leeuw WC, Ensink WA, de Jong M, Breit TM, Admiraal W, Kraak MH, Jonker MJ. Gene expression patterns and life cycle responses of toxicant-exposed chironomids. Environ Sci Technol. 2012 Nov 20;46(22):12679-86. doi: 10.1021/es3033617. Epub 2012 Nov 9.
5. Marinkovic M, de Bruijn K, Asselman M, Bogaert M, Jonker MJ, Kraak MH, Admiraal W. Response of the nonbiting midge Chironomus riparius to multigeneration toxicant exposure. Environ Sci Technol. 2012 Nov 6;46(21):12105-11. doi: 10.1021/es300421r. Epub 2012 Oct 19.
6. Schaap MM, Zwart EP, Wackers PF, Huijskens I, van de Water B, Breit TM, van Steeg H, Jonker MJ, Luijten M. Dissecting modes of action of non-genotoxic carcinogens in primary mouse hepatocytes. Arch Toxicol. 2012 Nov;86(11):1717-27. doi: 10.1007/s00204-012-0883-6. Epub 2012 Jun 19.
7. Doroszuk A, Jonker MJ, Pul N, Breit TM, Zwaan BJ. Transcriptome analysis of a long-lived natural Drosophila variant: a prominent role of stress- and reproduction-genes in lifespan extension. BMC Genomics. 2012 May 4;13:167. doi: 10.1186/1471-2164-13-167.
8. Roeschmann K, Luiten S, Jonker M, Breit TM, Fokkens W, Petersen A, van Drunen Timothy Grass pollen extract-induced gene expression and signaling pathways in airway epithelial cells. Clin Exp Allergy. 2011 Jun;41(6):830-41
9. Yuan X, Jonker MJ, de Wilde J, Verhoef A, Wittink FRA, van Benthem J, Bessems JG, Hakkert BC, Kuiper R, van Steeg H, Breit TM, Luijten M Finding maximal transcriptome differences between reprotoxic and non-reprotoxic phthalate responses in rat testis. Journal of Applied Toxicology 2011 31 (5); 421-430
10. de Bekker C, Bruning O, Jonker MJ, Breit TM, Woesten HA. Single cell transcriptomics of neighboring hyphae of Aspergillus niger. Genome Biol. 2011 Aug 4;12(8):R71. doi: 10.1186/gb-2011-12-8-r71.
11. Hakvoort TB, Moerland PD, Frijters R, Sokolovi? A, Labruyere WT, Vermeulen JL, Ver Loren van Themaat E, Breit TM, Wittink FR, van Kampen AH, Verhoeven AJ, Lamers WH, Sokolovi? M. Interorgan coordination of the murine adaptive response to fasting. J Biol Chem. 2011 May 6;286(18):16332-43. doi: 10.1074/jbc.M110.216986. Epub 2011 Mar 10.
12. Yuan X, Jonker MJ, de Wilde J, Verhoef A, Wittink FR, van Benthem J, Bessems JG, Hakkert BC, Kuiper RV, van Steeg H, Breit TM, Luijten M. Finding maximal transcriptome differences between reprotoxic and non-reprotoxic phthalate responses in rat testis. J Appl Toxicol. 2011 Jul;31(5):421-30. doi: 10.1002/jat.1601. Epub 2010 Nov 9.
status:completed
team:Linda Bakker, Wim de Leeuw, Han Rauwerda, Mark de Jong, Oscar Bruning, Timo Breit
type:This is a main project.
e-MAT
e-science tool for identifying common haplotypes
description:From SNP data common haplotypes are identified by using a tools prepared by the NBIC BRS team. The tool is running in Galaxy and will be made available on the BiG Grid computing infrastructure
applicant:Cisca Wijmenga, University of Groningen
results:
status:ongoing
team:Leon Mei, Marcel Kempenaar, Marc van Driel, Gerard te Meerman, Andre de Vries
type:This is a dedicated project.
e-MAT
HPC Cloud Usage for Transcriptomics
description:MAD/IBU is a transcriptomics expert centre. MAD/IBU participates in a number of scientific projects such as the Concord MRSA FP7 EU project, NBIC BioAssist 8.1, NBIC BioRange 4.1, BiGGrid?s e-BioGrid, TTI-GG Virtual Lab for Plant Breeding. MAD/IBU is also is the Microarray facility of the University of Amsterdam, offers bioinformatics support in transcriptomics projects and provides bachelor as well as master education in the area of transcriptomics. In all mentioned projects and activities MAD/IBU has a need for high performance computing. In general initial analyses are performed on machines owned by MAD/IBU. However, these machines are by far not sufficient to analyse full (transcriptomics) datasets with state-of-the-art methodology. On the other hand, MAD/IBU does not have a need for the continuous usage of HPC equipment. The software MAD/IBU uses is very diverse. For these reasons the HPC cloud very much suits our needs and therefore we apply for compute time and storage on the HPC Cloud. The main applications for which we need HPC are mixed effect model analysis of microarray data, (multi-strain) micro array design, sequence alignment using non-redundant Genbank freezes, transcription factor binding site discovery, Next Generation de novo assembly, SNP-calling, re-sequencing (read-mapping) and RNA-seq tooling.
applicant:Timo Breit, University of Amsterdam
results:ongoing
status:ongoing
team:Timo Breit, Linda Bakker, Oskar Bruning, Martijs Jonker, Mattias Kuzak, Wim de Leeuw, Han Rauwerda
type:This is a dedicated project.
e-MAT
Medical Imaging on the e-BioInfra platform *
description:The e-infrastructure for bioscience research, e-bioinfra, is routinely used by researchers at the AMC to perform medical image analysis on the Dutch Grid. The image analysis pipelines are implemented as workflows that are executed on the grid in an automated fashion. Various neuroimaging applications have been ported to this platform and made available for researchers from the Radiology, Psychiatry and other clinical departments at the AMC. The web interface of the e-bioinfra gateway provides easy access to novice users to applications such as FreeSurfer (brain surface segmentation) and DTI atlas construction. The goal of the project is to enable and enhance medical imaging research via advanced tools for data analysis. This is achieved in close collaboration with medical imaging researchers.
applicant:Silvia Olabarriaga, on behalf of the VLEMED VO, Amsterdam Medical Centre / University of Amsterdam
results:Infrastructure developed. The AMC now operates a WS-PGRADE science gateway in addition to the in-house developed gateway. See http://www.ebioscience.amc.nl/liferay-portal-6.1.0/. Support was provided for the installation of the gateway using grid resources using the e-infrastructure gateway at AMC, the construction of the first workflows, and internal training.

Knowledge. The AMC participated in an international collaboration to develop a concept for dynamically scheduling light-paths based on compute and data location. In acting as alpha-users of new SURFnet BoD/NSI service, the AMC assisted in the debugging of the service. The initial results of this work were presented at a conference (see publications).
With an interest in data security, we performed a study "Legal constraints on genetic data processing in European grids" (see publications). In the scope of ER-FLOW a document was produced titled "ethical issues: policy and code of conduct". Elements of this document can be re-used for similar projects (this document can be obtained upon request from the EGI document database here.

Software. Insights in the co-scheduling of compute and data. A Pilot-Data implementation was developed based on DIANE which is capable of running on BiG Grid resources. See the presentation. The code can be obtained here .
AMC represented the Life Science community in Staged Rollout of EGI/EMIsoftware (SAGA). This has led to the inclusion of SAGA in the next EMI release. See http://repository.egi.eu/2012/11/20/release-umd-2-3-0/.

Other. The AMC has further developed and operated a workflow-based service that automatically tracks provenance of grid workflow execution. This service, and its communication with BiGGrid resource providers, was supported by this e-biogrid project.
The AMC participated and coordinated a task-force in the SCI-BUS project to study new data management functionality for the WS-PGRADE science gateway. See the wiki.
A new community has been reached within the AMC, the group of Medical Biochemistry. They are now re-using a workflow from the SHIWA repository for a virtual screening project with Autodock Vina.

Publications.
- P*: A Model of Pilot-Abstractions, Andre Luckow, Mark Santcroos, Ole Weidner, Andre Merzky, Pradeep Mantha, Shantenu Jha, 8th IEEE International Conference on e-Science 2012, 2012
- Pilot Abstractions for Compute, Data, and Network, Mark Santcroos, Silvia Delgado Olabarriaga, Daniel S. Katz, Shantenu Jha, NECS Workshop, 8th IEEE International Conference on e-Science 2012, 2012
- Exploring Dynamic Enactment of Scientific Workflows using Pilot-Abstractions, Mark Santcroos, Barbera DC van Schaik, Shayan Shahand, Silvia Delgado Olabarriaga, Andre Luckow, Shantenu Jha ,13th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (accepted), 2013

Access. For more information see the gateway documentation. Code is available here.
status:completed
team:Matthan Caan, Silvia Olabarriaga, Antoine van Kampen, Mark Santcroos, Jan Just Keijser, Shayan Shahand, Vladimir Korkhov, Souley Madougou
type:This is a main project.
e-MDI
Testing the feasibility of Massive MRI data analysis
description:Magnetic resonance Imagery is a modern technique for recording brain activity. The analysis of this data, both functional and anatomical, will undoubtedly bring new insights in the neural basis of cognitive functioning. Up to now the complexity of analysis has been determined by the available computational power and certain types of approaches have been avoided in typical analytical approaches. In this project we want to evaluate whether is it feasible to use computational heavy approaches to noise reduction, and anatomical and functional connectivity methods, for the normal experiments that are being conducted at the Spinoza Center for NeuroImaging.
applicant:Steven Scholte, University of Amsterdam
results:anticipated results: Matlab module to serve standard fMRI preprocessing tool for all brain and cognition researchers, and a DTI analysis tool for new type of connectivity analysis
status:ongoing
team:Steven Scholte, Sennay Ghebreab, Lourens Waldorp, Caan Matthan
type:This is a dedicated project.
e-MDI
Translational Research IT - Enhanced Pathology Image Sharing.
description:This project will result in an installed environment to enable Pathology Image sharing for translational research between and by the Academic Medical Centers in the Netherlands.
Pathology is an important domain in the concept of translational research. Since translational research projects include multi-center studies very often much can be gained by improving the workflow of pathology slides among project participants. Currently, most research facilities have their own stand-alone systems for digital pathology (the term often used when glass slides are captured as whole slide images). Those images are in different file formats (sometimes even proprietary) and there is no infrastructure for collaborating on those images. This project addresses a number of key challenges related to realization of a translational research applications with an IT infrastructure for sharing Pathology images between multiple participating sites with their local Laboratory Information Systems and multi-vendor digital glass slides scanners. The realized infrastructure shall include an open interface to access images for image analysis purposes and an industry standard interface to search for the relevant information based on cross-domain queries from external systems.
Besides image sharing within the tEPIS environment it should be integrated with the TraIT context to be able to find digital pathology images from cross-domain queries (e.g. search for all patients that had a particular Gleason score from which MRI images are available and also a tissue sample was stored in the biobank catalogue).
This environment will consist of connected Image Management Systems (IMS) per participant, where the connected IMSs will be coupled to the participants specific Laboratory Information System (LIS), when relevant to the pathology reporting system (e.g. U-DPS) and the specific whole slide image scanner. The objective of this environment is to enable sharing of digital pathology images in the context of translational research for purposes like sharing diagnosis, consultation, image analysis (both stand-alone for research purposes and as part of a review workflow), translational (across domains) research and miscellaneous purposes like sharing slides for conferences and publications.
applicant:Nikolaos Stathonikus, Universitair Medisch Centrum Utrecht, Pathologie
results:
status:ongoing
team:Nikolaos Stathonikus, eBioGrid support team
type:This is a dedicated project.
e-MDI
Biomarker Boosting pipeline testing using the cloud
description:The Biomarker Boosting project is a collaboration between Radboud University Nijmegen, the Dutch eScience Center, and four Dutch UMCs (VU Amsterdam, Erasmus, Maastricht, Nijmegen). It aims to develop a platform for sharing and joint analysis of imaging data. In this pilot project, the four UMCs contribute a total of 1500 structural MRI scans. An automated hippocampal volume pipeline will be applied to all of these, and the results will be correlated with age, gender and mental disease state. The goal: investigate under what circumstances the pooling of large datasets improves the statistical significance of the computed correlations (biomarkers). To test and debug the pipeline, we request an initial test grant of 20000 core-hours. This test phase should give us a clear estimate of how much additional compute time we will need, and whether the cloud-based solution is the right choice.
applicant:Paul Tiesinga, Radboud Universiteit Nijmegen
results:
status:ongoing
team:Rembrandt Bakker, Piter de Boer, e-BioGrid support team
type:This is a dedicated project.
e-MDI
fMRI and MEG data analysis
description:Brain imaging experiments deliver massive amounts of data that need very intensive analysis. In this NWO veni project, full datasets are gathered on which new methods of statistical learning analysis are tested. A Cloud machine on which to run these analysis, which are all based on open-source components, provide a very homogeneous and dependable environment to work on.
applicant:Tomas Knapen, Universiteit van Amsterdam
results:
status:ongoing
team:Tomas Knapen, e-BioGrid team
type:This is a dedicated project.
e-MDI
Integration of computationally intensive software in a metabolomics data processing tool chain *
description:Metabolomics is a rapidly growing discipline, relying on bioinformatics for data processing. Large amounts of data are being generated in metabolomics studies. The process of extracting biological information can be seen as an integrated workflow. It is recognized that this workflow can benefit highly from coordinated (and automated) handling and processing of the data. The need for tools and applications to support the data handling and biological interpretation is huge, but online availability of metabolomics data and tools is poor. This is hampering progress and standardization of the scientific field. The Netherlands Metabolomics Centre, in collaboration with the Netherlands Bioinformatics Centre, has a dedicated project that supports the development of an infrastructure to share metabolomics data and tools: the NMC Data Support Platform. This project addresses two major bottlenecks for metabolomics research. The first is sharing of metabolomics studies and data. The second addresses the accessibility of dedicated processing and biostatistics tools. This goal has a number of bioinformatics and e-science challenges: some tools require high performance computing, and the tools also need to be integrated into a data processing tool chain. This project proposes collaboration between eBioGrid and the NBIC/NMC data support platform taskforce of programmers, with the aim to tackle the e-science challenges. The output of the project will be an online computing environment where sets of preprocessing, biostatistics and quality control tools are made accessible for all NMC biologists and biostatisticians, and any interested users from the international community.
applicant:Theo Reijmers, LACDR, Leiden University
results:Developed infrastructure. A logistic regression tool that can be used for biomarker selection in mass spectrometry data was implemented together with double cross-validation and permutation functionalities. This was done in collaboration with AMC. The tool was implemented in such a way that also other classification and/or feature selection methods could be easily added. Future plans are to extend this tool with Partial least squares Discriminant Analysis (PLS-DA) and other classification tools. At the moment the tool is running as a web-based gateway on ebioinfra gateway. Hundreds of parallel instances of logistic regression can be submitted at once for different permutations on the input data. The computing intensive steps used by Structure Generator, a tool that is used in conjunction with Metitree, a mass spectra repository for life science, as well as the cross-validation tool have been parallelised.

Performance improvement. For speeding up metabolite identification the in-house developed Open Molecule Generator (OMG) was checked for running its calculations in parallel. Next to development of a parallel version, OMG was also extended with new functionalities that allow further narrowing down the number of possible candidates that can be connected to metabolomics features with an unknown identity. Another improvement to OMG is obtained by implementing a faster algorithm that even in sequential execution could deliver up to 10 times speed-up.

Knowledge. For automating preprocessing of high resolution metabolomics mass spectrometry data, within the same group progress was made in the development of a new generic data integration method. The open molecule generator (OMG) is part of a metabolite identification pipeline. With the extended version of OMG setting up such a pipeline becomes within reach. In the near future implementation is envisioned of an identification pipeline containing next to this improved structure generator multiple other open source, in-house developed, computational identification tools.

Publications. In preparation: Journal of Metabolomics, Proceedings of ACSD 2013.

Access. Documentation on the developed tool chain can be found here. The tool chain can be run on the ebioinfra gateway. Note that a login is required.
status:completed
team:Theo Reijmers, , Margriet Hendriks, Kees van Bochove, M. van Vliet, G. Zwanenburg, J. Bouwman, J. Wesbeek, S. Sikkema, T. Abma, Mahdi Jaghouri
type:This is a main project.
e-MAS
Implementing a proteomics Taverna workflow onto Grid and Cloud
description:In this project modular workflows are developed for robust, automated and efficient analysis of LC-MS data. The goal in this project is to develop a suite of efficient programs to eliminate existing bottlenecks in the high-throughput analysis of LC-MS data, i.e. to develop and implement robust parallel software for chromatographic alignment, retention time prediction, calibration of MS data and extraction of quantitative information from LC-MS datasets. A typical analysis workflow contains at least one component for matching tandem mass spectra to predicted peptide fragmentation patterns. Examples would be X!Tandem or Crux but we are also developing our own software. We have already integrated the serial version of X!Tandem in a Taverna workflow with PeptideProphet and some of our own tools for alignment and calibration, such as pepAlign and msRecal. All these tools/algorithms are open source and have been described in recent literature, although only X!Tandem has been parallelized previously.
applicant:Magnus Palmblad, Leiden University Medical Center
results:different workflows and workflow components for proteomics data analysis are implemented to run on the cloud. Paper describing the use of scientific workflow management system in proteomics is also published (de Bruin, Deelder, and Palmblad, Scientific Workflow Management in Proteomics, Mol. Cell. Proteomics. 2012). Two other manuscripts are in preparation.
status:ongoing
team:Magnus Palmblad, Yassene Mohammed, Andre M. Deelder
type:This is a dedicated project.
e-MAS
analysis of mass spectrometry tumor imaging
description:aanvraag niet beschikbaar
applicant:Liam McDonnell, Leiden University Medical Centre
results:aanvraag niet beschikbaar
status:on hold
team:aanvraag niet beschikbaar
type:This is a dedicated project.
e-MAS
Grid Based Advanced Data Analysis and Classification of Big Imaging MS Datasets
description:The FOM institute AMOLF is currently a part of the COMMIT project for e-biobanking of large mass spectrometric datasets. One aim of this project is the collection, storage and analysis of large mass spectrometric imaging datasets. The highest performance mass spectrometers, Fourier transform ion cyclotron resonance (FT-ICR MS), offer unrivalled chemical specificity. This high performance requires large (4 MB-16 MB) individual data files for each mass spectrum. A full MS imaging scan of a biological tissue usually requires ~4,000 individual spectra, yielding complete datasets with a size of 15 GB-100 GB. This data is then processed, which entails a zero-filling of the data (increasing the individual data size by 2x), application of an apodization function (CPU/time intensive) and a Fast Fourier transform.
BigImage will use hardware resources to roll out work-flow based data analysis software onto the BiG Grid. The requested core hours will support testing on BiG Grid, as well as analysis of FT-ICR MS imaging datasets of breast cancer tissues. After basic processing on BiG Grid, the core hours will be used to extend the data analysis capabilities of the work-flow based software to include multi-variate statistical analysis tools.
applicant:Donald Smith, FOM-Instituut voor Atoom-Molecuulfysica
results:The use of BiG Grid for analysis of large FT-ICR MS imaging datasets will yield a dramatic decrease in analysis time. In addition, the ability to apply advanced algorithms will improve the mass spectral performance and will be applied for the first time to FT-ICR MS imaging datasets. Combined, the results will yield a unique capability for unrivalled rapid data analysis of high resolution FT-ICR MS imaging datasets. New statistical analysis modules in Chameleon should result in unique classifiers for diseased tissues based on integrated multi-modal data processing on BiG Grid.
status:ongoing
team:Donald Smith, Ron Heeren, Carl Schultz, Nadine Mascini
type:This is a dedicated project.
e-MAS
HPC cloud to scale the NBIC Galaxy
description:The Galaxy server, serviced by NBIC, is used to run generic bioinformatics tools sequence and proteomics analysis through a standard web-user interface. As many of the tools are CPU and storage demanding, running the Galaxy server on a high-performing computing cloud will expand its computer capacity. In collaboration with NBIC, the e-BioGrid team developed NBIC Galaxy on Cloud. The application scales dynamically with increasing workload. Having access to the HPC Cloud enables processing of big data volumes, and high speed network connection provides rapid data transfers.
applicant:Marc van Driel, Netherlands Bioinformatics Centre
results:Galaxy images are being build to be installed on the clouds
status:completed
team:Marc van Driel, Leon Mei, Floris Sluiter, Tom Visser, Niek Bosch
type:This is a dedicated project.
e-infra
HPC Cloud beta testing
description:The Microarray Department/Integrative Bioinformatics Unit at the University of Amsterdam will setup the HPC Cloud environment as a flexible and scalable environment for microarray design and analysis. From a local R session we want to be able to initialize a HPC Cloud computer cluster on the fly, use it from the local R session and shut the cluster down when no longer needed for.
applicant:Timo Breit, University of Amsterdam
results:Faster submission of computationally intensive jobs to the Cloud. Dynamically up and down-scaling a Cloud cluster from a local R session.
status:completed
team:Han Rauwerda, Wim de Leeuw
type:This is a dedicated project.
e-infra
AMC e-infrastructure for Biomedical Research
description:The e-BioInfra platform provides facilities to run large data analysis experiments on the Dutch Grid. The project includes software and system design, development and deployment as services for the AMC researchers community. The platform is based on workflow technology, including also data transfer, monitoring and provenance services. The team also provides support to researchers that wish to perform experiments on the grid infrastructure. The web interface of the e-bioinfra gateway provides easy access to novice users.
BiGGrid funds one member of the e-bioscience team (Mark Santcroos) to improve the link between the e-Bioinfra and the Dutch grid resources and services. Activities involve development and integration of new middleware tools, user support, definition of guidelines and best practices, and platform dissemination to a larger community of biomedical and life science researchers.
applicant:Silvia Olabarriaga, Antoine van Kampen and Jan Just Keiser, Amsterdam Medical Centre / University of Amsterdam
results:S.D. Olabarriaga, T. Glatard, P.T. de Boer, "A Virtual Laboratory for Medical Image Analysis", IEEE Transactions on Information Technology In Biomedicine (TITB), 2010 Apr 5.
M.W.A. Caan, F.M. Vos, L.J. van Vliet, A.H.C. van Kampen, S.D. Olabarriaga. Gridifying a Diffusion Tensor Imaging Analysis Pipeline. Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid 2010) Melbourne, VIC, Australia, May 17-May 20. IEEE Computer Society. pp.733-738, 2010.
S. Shahand, M. Santcroos, Y. Mohammed, V. Korkhov, A. Luyf, A. van Kampen and S. Olabarriaga. Front-ends to Biomedical Data Analysis on Grids. Proceedings of HealthGrid 2011 (in press), 2011
status:ongoing
team:Mark Santcroos, Silvia Olabarriaga, Jan Just Keijser, Antoine van Kampen, Shayan Shahand, Vladimir Korkhov, Souley Madougou
type:This is a dedicated project.
e-infra
Data Analysis Framework (DAF)
description:Development of generic infrastructure program Data Analysis Framework (DAF) to make simpler processing of data intensive tasks on Grid. DAF provides for each users quota limited disk space to upload the starting input files and store the processing results. On the data available at the user disk space data processing tasks can be executed based on integrated command line tools in DAF. All data processing services including file I/O to and from the user´s disk space can be accessed via web service requests. DAF is using glite for job submission and enhanced ToPoS pilot job system to lower job submission errors. We plan to integrate DAF with data management software such as Molgenis or OpenBIS to develop a fully integrated data analysis platform. We also plan to provide an easy-to-use web interface, where generic web pages are created for integrated tools and workflow and scientific visualization platform providing the visualization support.
See also the application to a proteomics data analysis infrastructure here.

applicant:Ishtiaq Ahmad, University of Groningen, Department of Pharmacy, Analytical Biochemistry
results:DAF in current state is already used to provide high-throughput time alignment service¹ based on Warp2D tool² for LC-MS peak list accessible at http://www.nbpp.nl/warp2d.html. Integrated msComapre6 workflow in NBIC Galaxy server. Other results relate to the tools and workflows, that we intend to integrate in DAF and already mentioned above.

1. Ahmad I, Suits F, Hoekman B, Swertz MA, Byelas H, Dijkstra M, Hooft R, Katsubo D, van Breukelen B, Bischoff R, Horvatovich P., A high-throughput processing service for retention time alignment of complex proteomics and metabolomics LC-MS data, Bioinformatics, 2011, 27(8):1176-1178, PMID: 21349866
2. Suits F, Lepre J, Du P, Bischoff R, Horvatovich P., Two-dimensional method for time aligning liquid chromatography-mass spectrometry data, Anal Chem., 2008, 80(9):3095-3104, PMID: 18396914
3. Christin C, Hoefsloot HC, Smilde AK, Suits F, Bischoff R, Horvatovich PL., Time alignment algorithms based on selected mass traces for complex LC-MS data, J Proteome Res., 2010, 9(3):1483-1495, PMID: 20070124
4. Christin C, Smilde AK, Hoefsloot HC, Suits F, Bischoff R, Horvatovich PL., Optimized time alignment algorithm for LC-MS data: correlation optimized warping using component detection algorithm-selected mass chromatograms, Anal Chem., 2008, 80(18):7012-7021, PMID: 18715018
5. Christin, C., Hoefsloot, H. C. J., Smilde, A. K., Hoekman, B., Bischoff, R., Horvatovich, P., A critical assessment of statistical methods for biomarker discovery in clinical proteomics, manuscript submitted to Molecular & Cellular Proteomics.
6. Hoekman, B., Breitling, R., Suits, F., Bischoff, R., Horvatovich, P., msCompare: a framework for quantitative analysis of label-free LC-MS data for comparative
status:ongoing
team:Isthiaq Ahmad, Berend Hoekman, Peter Horvatovich, Rainer Bischoff and collaborators from Gaining Momentum Initiative
type:This is a dedicated project.
e-infra
SHIWA - SHaring Interoperable Workflows for large-scale scientific simulations on Available DCIs
description:The SHIWA VO was created for the SHIWA project (shiwa-workflow.eu). It will be used for testing the SHIWA Simulation Platform (SSP), which will enable scientists to share and run workflows on DCIs. The project develops solutions interoperable workflows, including management of credentials across DCIs. The tests to be performed on the SHIWA VO resources initially will be intended to show viability of the adopted solutions on production infrastructures. This VO is already supported by the French NGI, but we need more sites to support it to enable testing under more realistic conditions.
applicant:Vladimir Korkov, Amsterdam Medical Centre
results:will follow shortly
status:completed
team:Vladimir Korkov, Silvia Olabarriaga
type:This is a dedicated project.
e-infra
web service fail-over
description:We want to test a Web service fail-over system for a Danish web service by use of a virtual machine that can perform the same task. This Virtual machine will be parked at both a computer centre in Munchen and one in the UK. We will at fixed times let the web service in Denmark go down at which time our calling program will one way or another launch one of those two virtual machines to do the one second call at your cloud.
applicant:Gert Vriend, Radboud University Nijmegen
results:will follow shortly
status:ongoing
team:Gert Vriend
type:This is a dedicated project.
e-infra
Managing cloud computing for life sciences research via smart interfaces
description:In a large class of bioinformatics applications, the processing power required fluctuates strongly and it is not feasible nor needed to keep the maximum processing capability available locally all the time. The SARA HPC-cloud offers compute power on demand in the form of freely configurable virtual machines. In the cloud one can configure a system: number of cores, amount of memory, secondary storage and network of the machine and freely install desired software running on this machine. These machines are stored as images, which can be deployed at a later time. We have implemented a system which can be used to control deployment of machine images in the cloud bypassing the cloud user interface. Using this system it is easy to setup applications in which cloud resources are transparently used from outside. It consists of a lightweight server, which is capable of starting and stopping machine images, just as is possible through the web-interface. It also keeps track of running machines under its control. Clients can request the starting of machines or request information about running machines. These clients can be used in applications to access cloud resources with minimal user intervention. We describe two use cases: the first one is about creating an R-cluster in the cloud. In this use case a user can start an R-cluster on the cloud from within an local R session and distribute the calculation work using the normal R-cluster commands over the cloud. The second use case is the back end of the array designer web-application. In the web application, the user can generate a microarray design based on input sequence data and a number of additional parameters. The required resources for generation of the array design are not available on the web-server and in this use case, the work is done in the cloud. For each array design a machine is instantiated on the cloud and stopped when the design is ready.
applicant:Wim de Leeuw, UvA
results:will follow soon
status:completed
team:Wim de Leeuw, Linda Bakker, Han Rauwerda, Timo Breit
type:This is a dedicated project.
e-infra
Using large scale computing facilities for Biobanks, knowledge discovery across biobanks and Enrichment of Biobanks with whole genome information. *
description:The Netherlands have 150 biobanks with over 400,000 samples in total. To exploit these billions worth of material they are now embarking on large scale genetic profiling. An example is the highly visible Genome of the Netherlands project that will sequence the DNA of 750 Dutch individuals completely to elucidate the genetic diversity in the Dutch population, and to impute this new information onto existing more sparse genetic information of 100.000 Dutch individuals. However, the data handling and computational needs are enormous and Dutch Institutes are struggling to effectively use the hardware infrastructures available.

This e-BioGrid subproject will overcome this barrier by interfacing the data processing tools used in biobanking to the existing BiGGrid infrastructure and by supporting the Dutch biobanking community to deploy these tools for their large data and processing challenges. Applications include (1) high-throughput genetics studies from next generation sequencing of biobank samples (>750), (2) genome wide imputation and association studies (>100,000) and (3) follow-up BBMRI-NL projects that are currently being drafted by the BBMRI-NL steering commitee. We also aim to pilot knowledge discovery across biobanks by connecting this new information with existing information from over 150 existing Dutch biobanks.

Envisioned short-term results are high-impact scientific publications of these biobank studies. Long-term results are availability of flexible and scalable e-Science tools for large scale biobanking and optimized GRID infrastructure for biobanks within BigGrid, SARA, NBIC and the Life Sciences community to bring the Netherlands at the forefront of the next generation of genetics and populations research.
applicant:Morris Swertz, University Medical Center Groningen
results:Resources. For the BBMRI-NL imputation pipeline a total of ~800 parallel jobs have been run with a maximum of 6GB per job. About ~ 150 TB of data storage has been used. A complete copy of the GoNL raw sequence data was transferred to BiG Grid (SURFsara SRM) for grid usage and backup.

Knowledge. With several training sessions, the BBMRI team is able to now submit jobs to grid.

Infrastructure developed. BBMRI-NL best practices framework for pipeline definition called MOLGENIS/compute, including pilot job framework for running on BiG Grid has been build within the project (2011-2012). The BBMRI-NL pipelines for imputation, alignment, SNP calling and QC pipeline are available using this framework. The software and documentation can be found here. Users from other universities are interested users contacting us from Maastricht, Wageningen, Utrecht, and Nijmegen.

High performance computing. The imputation pipeline ran on Prevend, LifeLines and Utrecht biobanks. It was tested by independent bioinformaticians and used in production. The NGS pipeline is used in daily production UMCG. Completed imputation results for 15.000 LifeLines biobank samples (UMCG), 4.000 Prevend biobank samples (UMCG) and 1500 Utrecht samples (UMCU) and used in production by Celiac research group (UMCG) The next generation sequencing pipeline has been used on ~750 whole genome samples (Genome of NL) and ~1500 exome samples (UMCG).

Publications. MOLGENIS/compute was accepted as a full paper and presented at the International Conference on Bioinformatics Models, Methods and Algorithms, Algarve, Portugal, Feb.3 2012. (BIOINFORMATICS 2012 received 109 submissions, of which 14% were accepted as full papers)

Access. All recent documentation, presentations and software are available and free to download at here. All software has been made available as LGPL open source via Github.
status:completed
team:Jan Bot, Pieter Neerincx, Abhishek Narain, George Byelas, Tom Visser
type:This is a main project.
e-BBC


* These are the main projects

Loading feed..

Subscribe to our newsletter here