Publication : USDA ARS

ARS Home » Research » Publications at this Location » Publication #192028

Title: PROTEIN PANORAMA: PROBABILITY AND PARSIMONY-BASED SOFTWARE FOR ASSESSING PROTEINS ASSEMBLED FROM PEPTIDES INFERRED FROM MS/MS DATA

Author

	FENG, JIAN - JOHNS HOPKINS UNIVERSITY
	NAIMAN, DANIEL - JOHNS HOPKINS UNIVERSITY
	Cooper, Bret

Submitted to: American Society for Mass Spectrometry
Publication Type: Abstract Only
Publication Acceptance Date: 3/1/2006
Publication Date: 3/1/2006
Citation: Feng, J., Naiman, D.Q., Cooper, B. 2006. Protein panorama: probability and parsimony-based software for assessing proteins assembled from peptides inferred from ms/ms data [abstract]. American Society for Mass Spectrometry. p. 31.

Interpretive Summary:

Technical Abstract: Along with analyzing the quality of the peptide inferences from LC-MS/MS experiments, it is important is to reassemble the peptides into the proteins from which they were derived. This can be done by using a protein database as a scaffold. Since peptides can be assigned to different proteins due to protein sequence homology, there is an increased probability of having correctly used the protein from the database in assembly if several peptides map to the protein. To improve the interpretation of assembled data, we have designed a software platform that applies a rigorous probability model and assembles a parsimonious set of proteins from the deduced peptide sequences that best explains the observed data. Protein Panorama is written in ansi C programming language under GNU/Linux, but can also be transferred to Windows. It provides a web-based interface and is compatible with Mascot 2.1 .dat file output, but can be extended to SEQUEST output or other spectral interpretation programs that provide quality scores for the spectral inferences. Panorama factors spectral inference scores, the number of peptides in the database with similar molecular weights, the charge state of the ions, the number of proteins sharing the same peptide sequence and length of peptide in relation to the size of search database and calculates a probability for a protein or group of proteins that share the same peptides. Mascot scores describe the chance that the inference for the peptide is a random event, but do not accurately predict the chance that the peptide correctly identifies a protein. Instead, other factors have to be considered when predicting the probability that a protein is correctly identified from peptides, including the number of peptides assigned (the more the better), the number of proteins that share the same peptides (the fewer the better) and the length of the peptide in relation to the size of the database, which factors into false-positive identifications (distraction effects). Protein Panorama is a Probability And NOn-Redundant Assembly Mathematical Algorithm that derives rigorous and accurate probability estimates for proteins by standardizing and filtering the scores from Mascot and applying several probability constraints. New concepts of peptide class and protein class are introduced in this software to help lower false positive rates and eliminate side-effects of distraction. Analysis of Arabidopsis thaliana MS/MS data by Mascot against the A. thaliana protein database (TAIR V. 6.0) and selection for peptides for which there is less than a 5% chance of the inferences being false-positive resulted in a non-redundant data set of 346 proteins. A search against a reverse A. thaliana protein database produced 39 non-redundant proteins, indicating a false positive rate of 11%. On the other hand, Panorama assembled 265 non-redundant proteins with > 95% probability. The reverse database search produced 10 proteins at 95% level, indicating a reasonable false positive rate of 4%. Panorama offers a unique feature of allowing the user to discard protein assemblies based on known experimental information, at which time probabilities can be recomputed. Novel data structures allow processing a 1 Gb Mascot .dat file in 40 seconds and the algorithm works equally well for both redundant and non-redundant databases.