NovoRank: Machine Learning-Based Post-Processing Tool for De Novo Peptide Sequencing

NovoRank is a post-processing tool designed to improve the accuracy of de novo peptide sequencing in proteomics. Unlike database-dependent methods, de novo sequencing derives peptide sequences directly from tandem mass spectrometry (MS/MS) data, enabling the discovery of novel peptides. However, reliance on incomplete scoring functions often leads to incorrect identifications. NovoRank addresses this by re-ranking candidate peptides to recover correct identifications, enhancing both precision and recall. It is compatible with any de novo sequencing software, and by reassigning the optimal peptide, NovoRank offers a robust solution for overcoming the noise and ambiguities inherent in MS/MS data.

For detailed insights behind NovoRank, refer to the NovoRank paper.

Overview

Code structure

NovoRank
  │ 
  ├─── generate_candidates_and_extract_features.py
  │
  ├─── run_novorank.py
  │ 
  ├─── src
  │     │     
  │     ├── features
  │     │     └── featureprocessor.py: Functions for feature calculation
  │     │       
  │     ├── loader
  │     │     └── dataloader.py: Functions for data loading
  │     │
  │     ├── model
  │     │     ├── base_model.py: Base model structure
  │     │     ├── inference.py: Functions for model inference
  │     │     ├── preprocess.py: Functions for data preprocessing in modeling
  │     │     └── train.py: Functions for model training
  │     │ 
  │     └─── utils
  │           ├── config_first.py: Command-line argument parsing and configuration file loading for generate_candidates_and_extract_features.py
  │           ├── config_second.py: Command-line argument parsing and configuration file loading for run_novorank.py
  │           ├── process.py: Functions for performing data processing
  │           └── utils.py: Utility functions providing support
  │
  ├─── models: Trained models (may include models saved at each epoch)
  │
  ├─── pretrained: Pretrained NovoRank models (Casanovo, PEAKS, pNovo3) in .h5 format
  │
  └─── software
        ├── CometX: XCorr calculation software (in-house software)
        └── MSCluster: Spectral clustering software

Datasets

All datasets used in this work are available for download from Zenodo.

To use NovoRank, a user MUST refer to the README.md in the ./data directory, where sample data has also been provided.

Configuration

The config.yaml is used to set up the parameters and initial configurations required to run NovoRank. It contains default values, and descriptions for each option are provided as comments within it.

Requirements

⦁ To install the required Python packages:

Clone the repository or download the code.
Create and activate an Anaconda virtual environment:

conda create -n [NAME] python==3.9
conda activate [NAME]

Install the dependencies listed in requirements.txt:

pip install -r requirements.txt

Note:
NovoRank was implemented using Python 3.9 and utilizes the DeepLC package, which is included in the requirements.txt.

⦁ Software

MS-Cluster (Download)
CometX (In-house software modified to calculate XCorr, based on Comet software)

How to Use

For the description of the datasets required to execute Steps 2 and 4, refer to Essential Data for Using NovoRank.

Step 1. Spectral clustering using MS-Cluster

MSCluster.exe --list [PATH] --output-name CLUSTERS --mixture-prob 0.01 --fragment-tolerance 0.02 --assign-charges

Note:
For detailed instructions on using MS-cluster, refer to the manual.

Step 2. Generate two candidates and extract features

The parameter that controls the training process or the inference process is specified in the config.yaml.

python generate_candidates_and_extract_features.py --search_ppm [PRECURSOR_TOLERANCE] --elution_time [ELUTION_TIME_MIN]

Note:
To check the available options and their descriptions, run the command python generate_candidates_and_extract_features.py -h.

Step 3. Xcorr calculation using CometX

CometX.exe -X -Pcomet.params [PATH]\*.mgf

Note:
[PATH] is the directory containing MGF files for xcorr calculation.
*_xcorr.tsv files will be generated.

Step 4. Training & Inference of the NovoRank

python run_novorank.py

Note:
To check the available options and their descriptions, run the command python run_novorank.py -h.

Training
To use NovoRank, users must train a model tailored to their dataset - it is recommended to use a customized model based on the de novo search software used. The trained model is saved in the ./models/ directory in .h5 format. Additionally, checkpoint models trained at each epoch can be saved.

Inference
Inference can be performed using the pre-trained model. The pre-trained models for testing, created using three types of de novo search software (Casanovo, PEAKS, pNovo3), are located in the ./pretrained/ directory.

The deep learning model only handles peptides with a maximum mass of 5000 Da and a length of 40 or less.

Results

- The results_top1.csv file is generated at the ./data/interim location.
  (The save location and result file name can be changed in the config.yaml file)
- The NovoRank results are to output a single assigned peptide for each spectrum.

Credits

NovoRank is created by Jangho Seo, Seunghyuk Choi, and Eunok Paek at the Hanyang University.

Citation

@article{sep2024novorank,
  title = {NovoRank: Refinement for De Novo Peptide Sequencing Based on Spectral Clustering and Deep Learning},
  shorttitle = {NovoRank},
  author = {Seo, Jangho and Choi, Seunghyuk and Paek, Eunok},
  journal={Journal of Proteome Research},
  year={2024}
}

License

- NovoRank © 2024 is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
  This license requires that reusers give credit to the creator. It allows reusers to distribute, 
  remix, adapt, and build upon the material in any medium or format, for noncommercial purposes only. 
  If others modify or adapt the material, they must license the modified material under identical terms.

Contact

If you have any questions, feel free to open an issue or contact Jangho Seo or any of the contributors listed above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NovoRank: Machine Learning-Based Post-Processing Tool for De Novo Peptide Sequencing

Overview

Code structure

Datasets

Configuration

Requirements

How to Use

Results

Credits

Citation

License

Contact

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
data		data
models		models
pretrained		pretrained
software		software
src		src
README.md		README.md
config.yaml		config.yaml
generate_candidates_and_extract_features.py		generate_candidates_and_extract_features.py
requirements.txt		requirements.txt
run_novorank.py		run_novorank.py

HanyangBISLab/NovoRank

Folders and files

Latest commit

History

Repository files navigation

NovoRank: Machine Learning-Based Post-Processing Tool for De Novo Peptide Sequencing

Overview

Code structure

Datasets

Configuration

Requirements

How to Use

Results

Credits

Citation

License

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages