Mass spectrometry in life sciences involves quantative and qualitative protein and metabolite analyses in biological samples. It produces large amounts of data that requires storage, and computational intensive analysis for the identification of biological identities in the peak patterns. Liquid chromatography - mass spectrometry (LC-MS) is a commonly used and rapidly evolving technology in many fields of biomolecular analysis, for instance in proteomics and metabolomics.
The e-BioGrid program is open for dedicated Mass Spectrometry projects. Contact us if you are involved in Mass Spectrometry life science research and you have a need for support in software or hardware infrastructure.
Metabolomics is a rapidly growing discipline, relying on bioinformatics for data processing. Large amounts of data are being generated in metabolomics studies. The process of extracting biological information can be seen as an integrated workflow. It is recognized that this workflow can benefit highly from coordinated (and automated) handling and processing of the data. The need for tools and applications to support the data handling and biological interpretation is huge, but online availability of metabolomics data and tools is poor. This is hampering progress and standardization of the scientific field. The Netherlands Metabolomics Centre, in collaboration with the Netherlands Bioinformatics Centre, has a dedicated project that supports the development of an infrastructure to share metabolomics data and tools: the NMC Data Support Platform. This project addresses two major bottlenecks for metabolomics research. The first is sharing of metabolomics studies and data. The second addresses the accessibility of dedicated processing and biostatistics tools. This goal has a number of bioinformatics and e-science challenges: some tools require high performance computing, and the tools also need to be integrated into a data processing tool chain. This project proposes collaboration between eBioGrid and the NBIC/NMC data support platform taskforce of programmers, with the aim to tackle the e-science challenges. The output of the project will be an online computing environment where sets of preprocessing, biostatistics and quality control tools are made accessible for all NMC biologists and biostatisticians, and any interested users from the international community.
Theo Reijmers, LACDR, Leiden University
Developed infrastructure. A logistic regression tool that can be used for biomarker selection in mass spectrometry data was implemented together with double cross-validation and permutation functionalities. This was done in collaboration with AMC. The tool was implemented in such a way that also other classification and/or feature selection methods could be easily added. Future plans are to extend this tool with Partial least squares Discriminant Analysis (PLS-DA) and other classification tools. At the moment the tool is running as a web-based gateway on ebioinfra gateway. Hundreds of parallel instances of logistic regression can be submitted at once for different permutations on the input data. The computing intensive steps used by Structure Generator, a tool that is used in conjunction with Metitree, a mass spectra repository for life science, as well as the cross-validation tool have been parallelised.
Performance improvement. For speeding up metabolite identification the in-house developed Open Molecule Generator (OMG) was checked for running its calculations in parallel. Next to development of a parallel version, OMG was also extended with new functionalities that allow further narrowing down the number of possible candidates that can be connected to metabolomics features with an unknown identity. Another improvement to OMG is obtained by implementing a faster algorithm that even in sequential execution could deliver up to 10 times speed-up.
Knowledge. For automating preprocessing of high resolution metabolomics mass spectrometry data, within the same group progress was made in the development of a new generic data integration method. The open molecule generator (OMG) is part of a metabolite identification pipeline. With the extended version of OMG setting up such a pipeline becomes within reach. In the near future implementation is envisioned of an identification pipeline containing next to this improved structure generator multiple other open source, in-house developed, computational identification tools.
Publications. In preparation: Journal of Metabolomics, Proceedings of ACSD 2013.
Access. Documentation on the developed tool chain can be found here. The tool chain can be run on the ebioinfra gateway. Note that a login is required.
Theo Reijmers, , Margriet Hendriks, Kees van Bochove, M. van Vliet, G. Zwanenburg, J. Bouwman, J. Wesbeek, S. Sikkema, T. Abma, Mahdi Jaghouri
In this project modular workflows are developed for robust, automated and efficient analysis of LC-MS data. The goal in this project is to develop a suite of efficient programs to eliminate existing bottlenecks in the high-throughput analysis of LC-MS data, i.e. to develop and implement robust parallel software for chromatographic alignment, retention time prediction, calibration of MS data and extraction of quantitative information from LC-MS datasets. A typical analysis workflow contains at least one component for matching tandem mass spectra to predicted peptide fragmentation patterns. Examples would be X!Tandem or Crux but we are also developing our own software. We have already integrated the serial version of X!Tandem in a Taverna workflow with PeptideProphet and some of our own tools for alignment and calibration, such as pepAlign and msRecal. All these tools/algorithms are open source and have been described in recent literature, although only X!Tandem has been parallelized previously.
Magnus Palmblad, Leiden University Medical Center
different workflows and workflow components for proteomics data analysis are implemented to run on the cloud. Paper describing the use of scientific workflow management system in proteomics is also published (de Bruin, Deelder, and Palmblad, Scientific Workflow Management in Proteomics, Mol. Cell. Proteomics. 2012). Two other manuscripts are in preparation.
Magnus Palmblad, Yassene Mohammed, Andre M. Deelder
The FOM institute AMOLF is currently a part of the COMMIT project for e-biobanking of large mass spectrometric datasets. One aim of this project is the collection, storage and analysis of large mass spectrometric imaging datasets. The highest performance mass spectrometers, Fourier transform ion cyclotron resonance (FT-ICR MS), offer unrivalled chemical specificity. This high performance requires large (4 MB-16 MB) individual data files for each mass spectrum. A full MS imaging scan of a biological tissue usually requires ~4,000 individual spectra, yielding complete datasets with a size of 15 GB-100 GB. This data is then processed, which entails a zero-filling of the data (increasing the individual data size by 2x), application of an apodization function (CPU/time intensive) and a Fast Fourier transform. BigImage will use hardware resources to roll out work-flow based data analysis software onto the BiG Grid. The requested core hours will support testing on BiG Grid, as well as analysis of FT-ICR MS imaging datasets of breast cancer tissues. After basic processing on BiG Grid, the core hours will be used to extend the data analysis capabilities of the work-flow based software to include multi-variate statistical analysis tools.
Donald Smith, FOM-Instituut voor Atoom-Molecuulfysica
The use of BiG Grid for analysis of large FT-ICR MS imaging datasets will yield a dramatic decrease in analysis time. In addition, the ability to apply advanced algorithms will improve the mass spectral performance and will be applied for the first time to FT-ICR MS imaging datasets. Combined, the results will yield a unique capability for unrivalled rapid data analysis of high resolution FT-ICR MS imaging datasets. New statistical analysis modules in Chameleon should result in unique classifiers for diseased tissues based on integrated multi-modal data processing on BiG Grid.
Donald Smith, Ron Heeren, Carl Schultz, Nadine Mascini