Guide to MaxQuant in Linux

(MaxQuant Version 1.6.5.0 - Home Page)

The popular proteomics search engine MaxQuant has recently gone Linux, after getting their code base to work with the Mono/C# runtime.

The official documentation on getting MaxQuant to run on Linux is sparse, so this repo is dedicated to passing on small tips and helper scripts that help people get started.

All code is hosted in this GitHub repo

Table of Contents

Requirements

  • A decent computer
  • Mono Framework >= 5.4.1

Running MaxQuant

>$ mono MaxQuant/bin/MaxQuantCmd.exe mqpar.xml

The default output options aren’t too descriptive to the exact task that MaxQuant is busy with, especially during the “Preparing for searches” step. You can query the exact status of your search by printing and grepping the process file names.

>$ ls $MQ_combined_folder/combined/proc | grep started

MSMS_first_search 16.started.txt
MSMS_first_search 26.started.txt
MSMS_first_search 36.started.txt
MSMS_first_search 46.started.txt

You can also print the #runningTimes.txt table to see exactly which steps are taking the longest

>$ cat $MQ_combined_folder/combined/proc/#runningTimes.txt

Job Running time [min]  Start date  Start time  End date  End time
Configuring   0.0168226333333333  14/11/2018  11:03:03  14/11/2018  11:03:04
Testing fasta files   0.0334432333333333  14/11/2018  11:03:04  14/11/2018  11:03:06
Testing raw files   0.11721345  14/11/2018  11:03:06  14/11/2018  11:03:13
Feature detection   5.0676712 14/11/2018  11:03:13  14/11/2018  11:08:17
Tmp folder cleanup  0.0172500333333333  14/11/2018  11:08:17  14/11/2018  11:08:18
Calculating peak properties   3.06749265  14/11/2018  11:08:18  14/11/2018  11:11:22
Combining apl files for first search  0.216795733333333 14/11/2018  11:11:22  14/11/2018  11:11:35
Preparing searches  28.8861353833333  14/11/2018  11:11:35  14/11/2018  11:40:28
MS/MS first search  68.32271715 14/11/2018  11:40:28  14/11/2018  12:48:47
...

Partial Processing

The partial processing feature of the MaxQuant GUI is present in the command-line interface, but to run it you first need to list out the individual steps that MaxQuant wants to take. You can list these out by using the “dry-run” command – appending -n to the normal execution command.

>$ mono MaxQuant/bin/MaxQuantCmd.exe mqpar.xml -n

id  number of threads job name
0 1 Configuring 
1 1 Testing fasta files 
2 8 Testing raw files 
3 8 Feature detection 

...

39  1 Prepare writing tables  
40  8 Writing tables 
41  1 Finish writing tables

Then use the --partial-processing=[id] flag to tell MaxQuant to start at that specific step

>$ mono MaxQuant/bin/MaxQuantCmd.exe mqpar.xml --partial-processing=35

For the best chance to run partial processing successfully, you should have run the previous steps with the same mqpar.xml configuration file. Although it could be possible that partial processing tolerates small configuration changes (for example, the feature detection step may not depend at all on the variable modification settings), for reproducibility’s sake it is best to be consistent with your configuration files. (I haven’t tested this thoroughly, so if you know more get back to me!)

Helper Scripts

Nowadays I’m running MaxQuant on Northeastern’s Research Computing cluster which is hosted at the Massachusetts Green High-Performance Computing Center (MGHPCC) in Holyoke, MA. It uses the Slurm workload manager to run tasks, and as such I have to package all of my MaxQuant calls into slurm-compatible bash scripts.

All code is hosted in this GitHub repo

I have one python script, gen_mqpar.py, which takes as input the mqpar.xml template, folder(s) of raw files, and some other parameters to generate both a filled-out mqpar.xml file and a bash script that can be run by slurm.

>$ ./gen_mqpar.py -h
usage: gen_mqpar.py [-h] -o OUTFILE [-mq {1_6_2_3,1_6_2_10,1_6_3_2,1_6_3_3}]
                    [-t THREADS]
                    input_xml raw_file_folders [raw_file_folders ...]

MaxQuant for linux

positional arguments:
  input_xml
  raw_file_folders

optional arguments:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
  -mq {1_6_2_3,1_6_2_10,1_6_3_2,1_6_3_3}, --mq-version {1_6_2_3,1_6_2_10,1_6_3_2,1_6_3_3}
                        MaxQuant version
  -t THREADS, --threads THREADS
                        Number of threads to specify in MaxQuant configuration
                        file

Example:

>$ ./gen_mqpar.py templates/SILAC.xml raw_files/FP93 -o fp93_silac.xml -t 6
success!

I keep a list of templates in a folder, templates/, which I can pull from when I want to change the parameters of a search. For example, I might do a normal TMT search to get quantitation, and also search where TMT is a variable mod to see my labeling efficiency.

Also note that I keep multiple versions of MaxQuant on hand, in case any of them breaks for some reason or another. I manage these versions with aliases in my .bashrc:

MQ_1_6_2_3=~/MQ_1.6.2.3/bin/MaxQuantCmd.exe
MQ_1_6_2_10=~/MQ_1.6.2.10/bin/MaxQuantCmd.exe
MQ_1_6_3_2=~/MQ_1.6.3.2/bin/MaxQuantCmd.exe
MQ_1_6_3_3=~/MQ_1.6.3.3/bin/MaxQuantCmd.exe

As I want all of my MaxQuant versions to have the same modifications.xml list, I have a symlink going from bin/conf/modifications.xml to a common modifications.xml file that I maintain.

The script outputs an XML file that is ready to run by MaxQuantCmd.exe, and a shell script that can be queued up by the slurm job manager:

#!/bin/sh
#SBATCH --job-name=fp103f
#SBATCH --output=fp103f.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=1000
#SBATCH --time=24:0:0
#SBATCH --partition=general

source /home/chen.alb/.bashrc
srun mono $MQ_1_6_3_3 /home/chen.alb/fp103f.xml

Check out the git repo with all the code and some examples shown: https://github.com/blahoink/maxquant_linux_guide.

All in all, this workflow is great for me, as I just upload some raw files, then run:

>$ ./gen_mqpar.py templates/SILAC.xml raw_files/FP93 -o fp93_silac.xml -t 6
>$ sbatch scripts/fp93_silac.sh
...

And I’m done! I have to wait for a long time for the run to finish, of course, but I try to not babysit the task and just check back after a few hours.

Notes:

The gen_mqpar.py script does create separate output, temp, and andromeda index directories for each search. This was done because I noticed that sharing any of these directories could result in conflicts where two concurrent searches would try to access the same resource at the same time and then one of them would crash. I know this does slow down my search times, especially since all the andromeda indices are separate, but the cluster is fast anyways so I’m willing to take this hit.