To cleave or not to cleave in CXMS? — A comprehensive comparison of DSS and DSSO workflows
Abstract:Chemical cross-linking of proteins coupled with mass spectrometry (CXMS) has enjoyed growing popularity in biomedical research. DSS and DSSO are the most widely used non-cleavable and cleavable crosslinkers in CXMS, respectively. They use the same reactive groups, but appear to have differing performance. Reportedly, DSS outperforms DSSO on less complex samples such as purified protein complexes, whereas on high-complexity samples, such as whole cell lysate, DSSO holds the highest number of identified cross-linked peptide pairs. However, no side-by-side comparison had been made to prove that the much more expensive DSSO CXMS workflow is better one compared to the DSS CXMS workflow, or vice versa. To compare DSS and DSSO properly, I tested DSS in conjugation with pLink 2, the best search engine for non-cleavable cross-linking, and DSSO in conjugation with pLink-DSSO, the best search engine for cleavable cross-linking, on low-, medium- and high-complexity samples. The results show that the performance of DSS is better than that of DSSO in all samples, but the margin of difference decreases as sample complexity increases. I further analyzed the possible reasons behind their performance differences from three aspects: physical and chemical properties, search process and spectrum quality. The results suggest that the relatively longer spacer arm of DSS gives it a big performance advantage on low complexity samples. Surprisingly, the smaller search space of DSSO cross-links, by using the masses of α and β peptides to filter out unlikely candidate sequences, has almost no effect on the identification sensitivity. However, DSSO-linked peptides fragment better than DSS-linked peptides in MS/MS, leading to a smaller sensitivity loss for DSSO-linked peptides when the search space increases, especially for high-complexity samples.
Keywords:CXMS, pLink, DSS, DSSO, cleavable cross-linking
AI-based large-scale proteomics data analysis
Abstract:In recent years, the rapid development of AI techniques such as large language model has revolutionized many fields, including the proteomics research. Here, we show two cases that how AI assists us to perform accurate large-scale proteomics data analysis. The first is how to perform retention time (RT) alignment across multiple runs in large cohort studies. We developed a deep learning-based alignment tool, DeepRTAlign, independent of identification (ID-free). Benchmarked on datasets with known fold changes, the results showed that DeepRTAlign can improve the identification sensitivity of MS data without compromising the quantitative accuracy. Furthermore, using the MS features aligned by DeepRTAlign in a large cohort, we trained a classifier of 15 features to predict the early recurrence of hepatocellular carcinoma. The features were validated on an independent cohort using targeted proteomics with an AUC of 0.833. Being flexible and robust with four different feature extraction tools, DeepRTAlign provided an advanced solution to RT alignment for large cohort LC-MS data analysis, which is currently one of the bottlenecks in proteomics and metabolomics research, especially for clinical applications. The second is how to increase the performance of de novo peptide sequencing to a similar level compared with database search. We used a novel concept of complementary spectra to enhance ion information and proposed a de novo sequencing model PandaNovo based on Transformer architecture. PandaNovo outperforms other state-of-the-art models and enhances the taxonomic resolution of gut metaproteome, taking a significant step forward in de novo sequencing. Taken together, we believe that more advanced AI techniques will have wider applications for large-scale proteomics data analysis.
Keywords:AI, large-scale, proteomics data analysis, RT alignment, de novo peptide sequencing.
Development of effective cross-linkers and their application in XL-MS
Abstract:Chemical cross-linking mass spectrometry (XL-MS) is a powerful technology to provide protein structural information and study protein–protein interactions. Currently, the most important task for XL-MS is to simplify MS spectra interpretation workflow and consequently identify cross-linked peptides with high accuracy and much less computation time. It is crucial for a cross-linker to carry an affinity tag to make low abundant cross-linked peptides detectable. MS-cleavable cross-linking strategy is also an effective way to handle the problem. Recently, we have reported that phosphate group could be taken as an affinity tag in a cross-linker and used in XL-MS. A retro-Michael addition-driven MS-cleavable strategy has been developed to simplify the identification of cross-linked peptides. Besides, a decision tree searching strategy (DTSS) and a retention time-based prediction method (RTBP) have been proposed, which can reduce complexity by orders of magnitude and facilitate substantially increasing inter-cross-link identification. Here, I will describe recent advancements made in the development of cross-linkers and their application in XL-MS.
Keywords:Chemical cross-linking mass spectrometry, cross-linker, MS-cleavable, protein-protein interaction
Spatiotemporal and global profiling of protein-biomolecule interactions via PANAC photoclick chemistry
Abstract:Protein-biomolecule interactions play crucial roles in many biological processes. Precise dissection of protein-biomolecule interactions is essential for the molecular basis of cellular function and phenotype. Recently, we have developed light-induced primary amines and o-nitrobenzyl alcohols cyclization (PANAC) as photoclick chemistry via primary amines as direct click handle, for spatiotemporal and global profiling of protein-biomolecule interactions. With intrinsic advantages of temporal control, reliable chemoselectivity, good biocompatibility, excellent efficiency and operational simplicity, the PANAC photoclick is robust for lysine-specific modifications of native proteins, temporal profiling of endogenous kinases and organelle-targeted labeling in living systems, spatiotemporal and global profiling of DNA-protein interactions enabling discovery of low-affinity transcription factors, as well as direct capturing global substrates of post-translational modification enzymes via lysine crosslinking in living cells. The PANAC photoclick chemistry provides a versatile platform for bioconjugation, protein interactive proteomics, chemical biology and medicinal chemistry.
Keywords:PANAC photoclick chemistry; Crosslinking mass spectrometry; Protein labeling; Protein-DNA interactions; Enzyme-substrate interactions
Francis J. O'Reilly
National Cancer Institute
In-cell crosslinking mass spectrometry combined with Alphafold-Multimer to discover the structures of novel protein complexes
Abstract:We designed a project to identify novel protein complexes in the model organism Bacillus subtilis and to predict the structures of all those discovered interactions. Many protein-protein interactions (PPIs) remain to be discovered as they are difficult to maintain when disrupting cells for AP-MS or co-fractionation MS (CoFrac-MS) studies. Fixing the interactions with a crosslinker prior to cell lysis maintains these interactions and they can be identified by crosslinking MS or CoFrac-MS. Accurately modeling the structures of proteins and their complexes using artificial intelligence could revolutionize PPI screens by adding molecular details to PPI screens. Our experimental data enables a candidate-based approach to systematically model novel protein assemblies. Crosslinking MS data independently validates the AlphaFold-multimer predictions and scoring. We report and validate novel interactors of central cellular machineries that include the ribosome, RNA polymerase, and pyruvate dehydrogenase, assigning function to several uncharacterized proteins.
Specific pupylation as IDEntity reporter (SPIDER) to decipher Protein-Biomolecule Interactions
Abstract:Protein-biomolecule interactions play crucial roles in nearly all biological processes. Identifying the interacting protein(s) for a biomolecule of interest is vital. Although numerous assays exist, there is always a demand for highly robust and reliable methods. We developed the Specific Pupylation as IDEntity Reporter (SPIDER) method for identifying protein-biomolecule interactions by combining substrate-based proximity labeling activity from the pupylation pathway of Mycobacterium tuberculosis and the streptavidin (SA)-biotin system. With SPIDER, we verified the interactions between known binding proteins of protein, DNA, RNA, and small molecules. We successfully utilized SPIDER to construct the global protein interactome for m6A and mRNA, identifying various uncharacterized m6A binding proteins, and confirming SRSF7 as a potential m6A reader. We also determined the binding proteins for lenalidomide and CobB on a global scale. Furthermore, we pinpointed SARS-CoV-2-specific receptors on the cell membrane. In conclusion, SPIDER is a substrate-based proximity labeling system. Due to its enzymatic catalytic nature, which converts noncovalent interactions to covalent ones, it allows for the efficient and specific identification and validation of biomolecule-protein interactions, especially weak, transient, and membrane-localized interactions, as long as the biomolecule can be biotinylated.
Abstract:The Chinese pharmaceutical industry is going through the stage from Generics and Me Too, to the initial stage of pursuing innovative drugs such as Best in Class and First in Class. However, in the context of international decoupling, if China companies continue to completely follow the R&D path of mature Western pharmaceutical companies, with the investment ability of domestic capital in innovative drugs and the tolerance for failure and error, China's development of innovative drugs will be very slow. Only by utilizing new technologies to assist in the development of innovative drugs can this situation be improved. Proteomics technology is an important driving tool with great potential for the development of innovative drugs. However, domestic pharmaceutical companies are still in a very early stage of gestation regarding the need to use proteomics technology. This report will introduce the current level of acceptance of proteomics by Chinese innovative pharmaceutical companies, as well as the current demand points of Chinese innovative pharmaceutical companies, and which are the most accepted entry points for proteomics intervention in the innovative pharmaceutical industry.
Network Analysis in Cancer Proteomics
Abstract:Networks in biology represent relationships or interactions between molecules, such as proteins, peptides, mutations and chemical compounds. In our early studies, network-based statistical models are more powerful to identify differential expressed pathways and driver proteins in both transcriptome and proteome data analysis. The rise of cancer proteome data has created challenges and opportunities to analyze networks in different types. In order to infer the regulatory phosphorylation in caner proteome, here we proposed a causal inference model via Mendelian randomization to infer causal effects between protein phosphorylation, protein expression and DNA mutation. We evaluated the model under simulation data with different causal effect sizes, sample sizes and heterogeneity of data as well as the real data applications. Our new method exhibited more robust estimates and lower FDR than Pearson and Spearman correlations, with better performance than existing instrument selection methods for Mendelian randomization. On the basis of the large-scale cancer proteome data, we also built network-assisted statistical and deep-learning models to predict neoantigen, drug repurposing and drug synergy in cancer. Network analysis can enhance biological insights and interpretation of cancer proteome at the systemic level.
Keywords:cancer proteome, PTM, causal inference, network analysis, statistical model
Computational design and co-translational incorporation of protein PTMs for biological discovery
Abstract:A protein can be post-translationally modified by the covalent addition of various chemical functionalities, such as acetylation, methylation, phosphorylation, glycosylation, lipidation and ubiquitination. These chemical modifications of proteins remarkably increase the functional diversity of the proteome and regulate nearly every aspect of cellular processes. Therefore, the identification of new protein post-translational modifications (PTMs) is very important both chemically and biologically. However, the identification process is very challenging, mainly due to the lack of chemical information for these unknown modifications and their enrichment methods. In addition, it is still challenging and laborious to address the biology function of protein PTMs by biochemical and genetic approaches. My lab applies genetic code expansion strategy to co-translationally incorporate authentic PTMs or analogs into protein of interest at desired positions, allowing the uncovering of new function of native modifications, the discovery of novel protein PTMs, as well as the engineering of new functions with non-native modifications.
Keywords:Genetic code expansion; Protein PTMs; Protein engineering; Computational design; Proteomics
A comprehensive investigation of DDA and DIA workflow for deep coverage of proteome
Abstract:We made a comprehensive investigation of DDA and DIA workflow for deep coverage of proteome based on shotgun liquid chromatography-tandem mass spectrometry (LC-MS/MS) . The same HeLa samples were distributed to about 20 Chinese laboratories, which submitted their data sets acquired from different liquid chromatography mass spectrometry (LC-MS) platforms. We evaluated the correlation between the MS2 scanning capacity utilization and scan speed, and interpreted how the MS parameters impacted the performance of DDA and DIA to provide a suite of reasonable settings. Our centralized analysis show that improving chromatography and maximizing utilization of the MS/MS capacities could significantly improve sampling depth and identifications in proteomics.
Leibniz-Forschungsinstitut für Molekulare Pharmakologie
Developing structural interactomics and its application in cell biology
Abstract:Profiling human interactome is critical in understanding the molecular basis for nearly all processes of life. Over the years, we’ve advanced crosslinking mass spectrometry by developing experimental methods and software tools to identify tens of thousands of PPIs from whole cells. These data reveal numerous aspects of living systems - for example protein subcellular localizations, virus-host interactions, and architectures of suprabiomolecular machineries. Furthermore, these data offer unprecedented opportunities to profile interactome changes between tissues and disease states, providing invaluable training data for AI-based methods to identify PPI-mediating motifs, inform new protein/antibody designs and screen for drug targets.
Keywords:cross-linking mass spectrometry, structural interactomics, cell biology, structural biology
Proteome Quantitation Methods Based on Chemical Labeling and Fragment Ions
Abstract:Proteome quantitation is the crucial requirement for gaining insight into the dynamic life processes with multiple protein properties. In combination with 2D-LCMS, stable isotope labeling-based methods are beneficial to deeply quantify a large number of proteins. However, the widely-used quantitation methods based on reporter ions or precursor ions still suffer from trade-offs in terms of accuracy, multiplex capacity, coverage and analysis time. Herein, a series of fragment ion-based methods were carried out due to its peptide specificity and high signal-to-noise (S/N) ratios. We proposed to utilize a1 ions as new quantification ions to overcome ratio distortion, and developed its corresponding MS acquisition method for in-depth analysis. An 8-plex labeling reagents based on pseudo-isobaric dimethyl labeling was further obtained to extend the multiplexing capacity. With these methods in hand, several applications suggested our strategy might become a promising tool to enhance our understandings of cellular and physiological processes in a system-wide view.
Keywords:proteome quantitation, fragment ion, chemical labeling, a1 ion
Research Institute of Molecular Pathology (IMP)
A mimicked ribosomal protein complex to benchmark XL-MS workflows and the investigation of molecular networks of the dormant ribosome
Abstract:The field of Cross-Linking MS (XL-MS) has matured to a frequently used technique for the investigation of protein structures and for interactome studies. The growing community participated in generating a broad spectrum of applications, linker-types, and acquisition methods. Among numerous highly specialized workflows it is challenging to identify and validate the best strategy for a new project. We therefore present a large synthetic peptide library, that contains peptides from 38 proteins of the E. coli ribosomal complex. The library includes 1018 unique and known crosslinks to experimentally validate false discovery rates and benchmark XL-MS workflows.
Our library comes with a tool, IMP-X-FDR, that calculates the experimentally validated false-discovery-rate (FDR), compares results across search engine platforms and analyses cross-link properties in an automated manner. We apply the library to 6 commonly used linker reagents and analyze the data with 6 established search engines. We thereby show that the correct algorithm and search setting choice is highly important to improve identification rate and reliability. We reach identification rates of up to ~70 % of the theoretical maximum while maintaining a low validated FDR. We show which additional empirical score-cutoff values should be used to maintain the validated FDR < 1% for the used tools.
We applied the XL-MS workflow best suited based on results of the synthetic library to find factors keeping the ribosome dormant in the egg. The majority of polypeptide chains of Habp4, Dap1b/Dapl1 and Dap proteins are not visible in cryo-EM structures but could be confirmed by XL-MS using DSSO. More than 1000 cross-linked peptides were obtained for each ribosomal sample and their superior low FDR was confirmed by mapping obtained cross-links to the respective cryo-EM structure with 95% (zebrafish) and 90% (Xenopus) having Cα–Cα distances below the expected maximum of 23 Å. Our analyses show that Dap1b/Dapl1 and Dap N-terminal regions are proximal to the polypeptide exit site, which is consistent with insertion of their C termini into the polypeptide exit tunnel. We found crosslinks between a highly conserved N-terminal region of Dapl1/Dap and Rpl31 and between the central region of Dapl1/Dap and Rpl35, indicating they are in close proximity to the ribosome.
Generic diagramming software for biology and medicine
Abstract:The visualization of biological sequences with various functional elements is fundamental for the publication of scientific achievements in the field of molecular and cellular biology. However, due to the limitations of the currently used applications, there are still considerable challenges in the preparation of biological schematic diagrams. Here, we present a professional tool called IBS 2.0(https://ibs.renlab.org) for illustrating the organization of both protein and nucleotide sequences. With the abundant graphical elements provided in IBS 2.0, biological sequences can be easily represented in a concise and clear way. Moreover, we implemented a database visualization module in IBS 2.0, enabling batch visualization of biological sequences from the UniProt and the NCBI RefSeq databases. Furthermore, to increase the design efficiency, a resource platform that allows uploading, retrieval, and browsing of existing biological sequence diagrams has been integrated into IBS 2.0. In addition, we are developing a web-based generic software to draw all kinds of biology and medicine diagrams.
Keywords:biological sequences, visualization, diagramming software
The SysteMHC Atlas 2.0
Abstract:Comprehensive characterization of major histocompatibility complex (MHC)-bound peptides promises a better understanding of the basic mechanism of our immune system. Mass spectrometry (MS) has emerged as the method of choice to identify MHC-bound peptides. Post-translational modifications (PTMs), such as phosphorylation, cysteinylation or glycosylation, may occur on presented peptides and have been suggested to be a more routine addition to immunopeptidomics analysis for their broad biological and clinical relevance. Here, we describe the SysteMHC Atlas 2.0, an extensive collection of publicly available immunopeptidomics datasets analyzed by an optimized computational pipeline. By analyzing >200 million MS/MS spectra collected from 76 published datasets with >7,000 MS raw files, this release of the atlas covers 194 HLA-I and 142 HLA-II allotypes presenting over 1 million HLA-I and HLA-II peptides, which greatly expands the previous SysteMHC Atlas 1.0 by 8.2 times on average. The atlas also provides all the MS raw files associated with their search results, a catalog of context-specific datasets of class I and class II peptides, and various spectral libraries. These spectral libraries include 272 allele-specific, 52 PTM-specific and 1,775 sample-specific ones, which aim to facilitate further DIA-MS analysis and provide the data required for the next generation of AI-based informatic tools. We anticipate that the SysteMHC atlas 2.0 will serve as an important resource for the immunopeptidomics community, which provides insights into immune-associated questions in the context of cancer immunotherapy.
Keywords:Computational proteomics, Immunopeptidomics, Spectral library
“Mass+Structure+Knowledge”：A Journey to In-depth Interpretation of Tandem Mass Spectra Derived from RNA Oligonucleotides
(“质量+结构+知识” : RNA寡聚核苷酸的质谱深度解析路程)
Abstract:Mass spectrometry (MS) has been evolving into one of the indispensable tools to elucidate biomolecule structures with widespread applications in biomedical research. Particularly, last decade has witnessed the increasing efforts stretching into DNA and RNA oligonucleotide characterization using tandem mass spectrometry (MS/MS), which includes sequencing RNAs or characterizing their post-transcriptional modifications. However, MS fragmentation behaviors of RNA oligos are so far understood insufficiently. In this talk, I will report our work that characterized the negative-ion-mode fragmentation behaviors of 30 synthetic RNA oligos containing four to eight nucleotides using multiple fragmentation methods, including CID, HCD, UVPD, and EThcD on a high-resolution, accurate-mass instrument. We found that MS/MS spectra derived from RNA oligos were much more complicated than those from peptides or proteins. There are more gas-phase dissociation pathways available for RNAs than for peptides, hence more fragment ions, and dispersed intensities. Moreover, the MS/MS spectra of RNA oligos are greatly affected by their precursor charge states. Among nine types of sequencing ions (a-B, b, c, d, w, x, y, z), we, for the first time, found that the intensity of w ions in CID/HCD spectra is highly correlated to 5’-side nucleotide around the cleavage site and the precursor charge state. Additionally, our analysis revealed that high-charge RNA oligos containing 3'-U, tended to produce precursors with NCO- losses in CID/HCD spectra, which presumably corresponded to cyanate anions. All these findings provide valuable insights for better comprehending the mechanisms behind RNA fragmentation by MS/MS, thereby facilitating future automated identification of RNA oligos based on their MS/MS spectra in a more efficient manner.
Blood Ecosystem approach to dissect systematic diseases
Abstract:Although host responses to the ancestral SARS-CoV-2 strain are well described, those to the new Omicron variants are less resolved. We profiled the clinical phenomes, transcriptomes, proteomes, metabolomes, and immune repertoires of >1,000 blood cell or plasma specimens from SARS-CoV-2 Omicron patients. Using in-depth integrated multi-omics, we dissected the host response dynamics during multiple disease phases to reveal the molecular and cellular landscapes in the blood. Specifically, we detected enhanced interferon-mediated antiviral signatures of platelets in Omicron-infected patients, and platelets preferentially formed widespread aggregates with leukocytes to modulate immune cell functions. In addition, patients who were re-tested positive for viral RNA showed marked reductions in B cell receptor clones, antibody generation, and neutralizing capacity against Omicron. Finally, we developed a machine learning model that accurately predicted the probability of re-positivity in Omicron patients. Our study may inspire a paradigm shift in studying systemic diseases and emerging public health concerns.
Keywords:Blood ecosystem, proteomics, metabolomics, multi-omics, Single-cell multi-omics, SARS-CoV-2 Omicron
A universal method for in-depth measurement of plant phosphoproteome with high quantitative reproducibility
Abstract:Protein phosphorylation regulates a variety of important cellular and physiological processes in plants. In-depth profiling of phosphoproteomes of plants has been more technically challenging than that of animals. This is largely due to the necessities to improve protein extraction efficiency from plant cells with dense cell wall, and to minimize sample loss resulting from the stringent sample clean-up steps for the removal of large amount of biomolecules interfering phosphopeptide purification and mass spectrometry analysis. To this end, we developed a method with streamlined workflow for highly efficient purification of phosphopeptides from tissues of various green organisms including Arabidopsis, rice, tomato, and Chlamydomonas reinhardtii, enabling in-depth identification with high quantitative reproducibility of about 11,000 phosphosites, the greatest depth achieved so far with single liquid chromatography-mass spectrometry (LC-MS) runs operated in a data dependent acquisition (DDA) mode. This method，if combined with single-shot LC-MS, allows in-depth quantitative identification of Arabidopsis phosphoproteomes at multiple time points during the course of salt stress, including differential phosphorylation of spliceosomal proteins and a number of kinase substrate motifs. The method is expected to serve as a universal method for purification of plant phosphopeptides, which if further fractionated and analyzed by multiple LC-MS runs, could enable measurement of plant phosphoproteomes with an unprecedented depth under a given mass spectrometry technology.
Keywords:Plant phosphoproteomics, phosphopeptides, LC-MS, Arabidopsis, salt stress
Baylor College of Medicine
Targeted peptide searching and its applications in proteomics
Abstract:Peptide identification in proteomics is typically achieved using spectrum-centric search in which all spectra from an MS/MS dataset are searched against a protein database in an untargeted fashion. This spectrum-centric search generally consists of several steps including matching each spectrum against a protein database and controlling FDR at a global level. However, there are many studies in which only a small set of novel or known peptides or proteins are interested. In these studies, the traditional spectrum-centric search has been shown to be inefficient. To improve the efficiency of peptide identification in these studies, we developed PepQuery2, a universe targeted peptide search engine for identifying or validating known and novel peptides of interest in local or publicly available mass spectrometry-based proteomics datasets. The utilities of PepQuery2 have been demonstrated in several applications, including identifying novel peptides in proteogenomics, validating novel and known peptides identified from spectrum-centric search, prioritizing tumor-specific antigens by leveraging public proteomics data, identifying missing proteins in large-scale public proteomics datasets, and guiding proteotypic peptides selection for targeted proteomics experiments.
Keywords:proteomics, proteogenomics, peptide identification, missing protein, targeted proteomics
Abstract:Mass spectrometry-based chemoproteomics has emerged as a key technology to expand the functional space in complex proteomes for probing fundamental biology and for discovering new small-molecule-based therapies. Nonetheless, the development of an efficient and selective probe for chemoproteomics can still be challenging. It is particularly difficult to unbiasedly assess its chemoselectivity at a proteome-wide scale. To address this challenge, we recently develop pChem, a modification-centric computational tool that can provide a streamlined pipeline for “probing the secrets of probes”. The pipeline starts with an experimental setting for isotopically coding probe-derived modifications that can be automatically recognized by pChem, with masses accurately calculated and sites precisely localized. pChem exports on-demand reports by scoring the profiling efficiency, modification homogeneity and proteome-wide residue selectivity, thereby facilitating development of a tested probe. In this talk, I will provide a showcase to illustrate the critical role of pChem in developing the next-generation probes specific to cysteine sulfenylation (-SOH), an important post-translational modification.
Keywords:Chemoproteomics; Bioorthogonal probe; Mass spectrometry; pChem; Post-translational modification
Glyco-Decipher 2.0: Towards Comprehensive Interpretation of the Spectra of O-glyco and N-glycopeptides
Abstract:Site-specific glycoproteomics analysis relies on the interpretation of the spectra of intact glycopeptides. Recently, we develop a glycoproteomics tool, Glyco-Decipher, to interpret the spectra of N-linked Glycopeptides. It conducts glycan database-independent peptide matching and exploits the fragmentation pattern of shared peptide backbones in glycopeptides to improve the spectrum interpretation. Our results demonstrate that Glyco-Decipher reported the most peptide-spectrum matches than other existing tools in the number of identified glycopeptide spectra. The database-independent and unbiased profiling of attached glycans enable the discovery of 164 modified glycans in mouse tissues, including those linked with chemical or biological modifications. Different from N-linked glycosylation, the core structures of mucin type O-glycans are much more diverse and the sensitive interpretation of O-glycopeptide spectra remains a challenge. The Y-ion pattern, a series of Y-ions with known mass gaps derived from the penta-saccharide core structure of N-linked glycosylation, is exploited to facilitate N-glycopeptide identification from their spectra by Glyco-Decipher. We found that the Y-ion patterns were also frequently observed in the spectra of O-glycopeptides and a special search approach is presented to identify O-glycopeptides by exploiting the Y-ion patterns. In this strategy, theoretical O-glycan Y-ion patterns are constructed to match the experimental Y-ions in O-glycopeptide spectra, which enables the determination of the mass of some glycans and results in the reduction of searching space. This search mode, the O-Search-pattern, has been implemented into our database search software, MS-Decipher, and is recommended for the searching of O-glycopeptide spectra acquired by sceHCD. Currently, we have two software tools, Glyco-Decipher and O-search (-pattern) mode in MS-Decipher to interpret the spectra of O-glyco and N-glycopeptides, respectively. It should be noted that O-glyco and N-glycopeptides could be presented in the same sample, and the current tools are often not good at distinguishing these two types of spectra. We are now developing Glyco-Decipher 2.0 which will have high performance in simultaneous interpretation of the spectra of both O-glyco and N-glycopeptides.
Keywords:Spectrum interpretation, intact glycopeptides, site specific glycoform, bioinformatics
University of Michigan
FragPipe enables the one-stop analysis for DDA and DIA bottom-up proteomics
Abstract:FragPipe represents a robust and comprehensive suite designed for bottom-up proteomics data analysis, with efficient handling of both data-dependent acquisition (DDA) and data-independent acquisition (DIA) data. FragPipe seamlessly incorporates a host of core components including MSFragger, Crystal-C, MSBooster, Percolator, PeptideProphet, PTMProphet, ProteinProphet, Philosopher, PTM-Shepherd, IonQuant, TMT-Integrator, EasyPQP, and DIA-NN. These tools function collaboratively to perform a broad spectrum of data analyses, encompassing closed searches, open searches, glycopeptide searches, label-free quantification, isotopic-labeling quantification, isobaric-labeling quantification, spectral library building, and DIA quantification. The thorough experiments conducted by us and independent research groups conclusively establish that FragPipe outperforms most of the existing tools in key metrics such as sensitivity, precision, accuracy, and speed. Designed with user experience in mind, FragPipe operates on both Windows and Linux operating systems, offering a graphical user interface (GUI) for a straightforward, error-proof experience. The GUI enables users, regardless of their level of expertise, to select from a range of analysis workflows and run them without the need for manual option setting, essentially making complex data analysis tasks a breeze. Additionally, FragPipe features a command-line interface (CLI) to streamline large-scale automatic data analysis in clusters and servers. FragPipe continues to evolve, with new modules and features being developed. As it stands at the cutting edge of proteomics data analysis, FragPipe offers a versatile and powerful solution tailored to meet the requirements of both single-cell and bulk cell data analysis.
Keywords:proteomics, peptide identification, quantification, post-translational modification, data-dependent acquisition, data-independent acquisitioin
Clinical Proteomics Big data drives the discovery of innovative biomarkers
Abstract:As gene function executors, protein changes are closely related to human life activities and play an important role in the physiological to pathological transformation of the human body. The large-scale acquisition of protein change information during human life activities is of great significance for understanding the evolution laws of human diseases, identifying new biomarker molecules and drug targets, and enabling patients to choose the right treatment plan at the right time. Since 2016, we have successfully developed the protein chip technology that can detect the humoral proteome and antibodies on a large scale, established a number of humoral protein and antibody microarrays and databases, and discovered more than 10 original biomarkers and potential drug targets for hepatocellular carcinoma, lung cancer, myeloma and psoriasis, etc..
Prediction of site markers for kinase activity with expression-based and sequence-based representations
Abstract:Protein phosphorylation is a crucial part of the intricate and precise regulatory mechanisms within cells and is essential for normal cellular function and physiological processes. Phosphorylation is achieved by transferring phosphate group from ATP to specific amino acid residue on protein through the catalysis of kinases. Kinases possess marker phosphorylation sites that, when phosphorylated, result in increased kinase activity. However, there are few known marker sites, which are obtained through experiments. To systematically identify site markers for kinase activity, we employed a deep neural network-based model to learn representations from phosphorylation mass spectrometry data. The pre-trained protein sequence model ProtT5 was utilized to obtain representations from protein sequences. By combining these two levels of representation, we constructed a classification model of 5,430 kinase phosphorylation sites, resulting in the discovery of novel site markers for kinase activity. The biological characteristics of these site markers were also revealed.
Keywords:Phosphorylation, kinase activity, site marker, deep neural network-based model, representation
Discovery of Functional Peptides from the Small Open Reading Frame-encoded Peptidome
Abstract:Small open reading frames (sORFs) are novel coding DNA sequences that are shorter than 100 codons. They used to be considered non-coding or even junk. In recent years, accumulating evidence suggests that sORFs can encode microproteins or peptides with important functions. Given 98% of the human genome was defined as non-coding regions, peptides encoded by small open reading frames is foreseeably an underexplored territory and gold mine. However, large-scale, and confident identification of these peptides have been technically challenging. We have established a systematic approach to discover, quantify, and characterize novel sORF-encoded peptides (SEPs). First, we analyzed samples with ribosome profiling to predict thousands of sORFs hidden in 5'UTR, 3'UTR or lncRNA, which showed high temporal and spatial specificity. Next, we established a systematic approach for the direct detection of these sORF-encoded peptides by mass spectrometry. The following optimizations greatly improved the identification number and data reproducibility, that is: 1) Peptide enrichment during sample preparation; 2) Mass spectrometry data acquisition and analysis; 3) Customized sORF database. With our approach, hundreds of novel peptides were identified and quantified during mouse embryonic development. We have also discovered novel peptides with important functions in cancer progression, therapy, and cancer immunotherapy. Our work not only provided a mass spectrometry platform to identify sORF-encoded peptides, but also, reported novel peptides with important functions.
Keywords:small open reading frame (sORF), peptides, mass spectrometry, discovery and function characterization