-
Notifications
You must be signed in to change notification settings - Fork 28
As mentioned in the supplemental material (section 1 and especially 1.1), the coupling scores in the output matrix are a root sum of squares (Frobenius Norm) of the coupling coefficients of the markov random field. This score then undergoes an "APC" renormalization step.
Since the maximum-pseudo-likelihood method tries to learn a generative model for the sequences in the input MSA, the individual epsilon_{i,j}(a, b)
scores give a log-odd ratio for the amino acid a
in column i
to co-occur with amino acid b
in column j
compared to what would be expected if the two columns were independent. Calculating the Frobenius norm for each of the 20x20 submatrices is only a heuristic that has proven effective for pseudolikelihood-based contact prediction methods, but unfortunately there is no direct probabilistic interpretation of the resultant scores. We also observe that for input alignments of differing numbers of sequences per column, the magnitude of coupling scores values varies, so that coupling scores cannot be directly compared to another but can only be used for ranking within one input alignment.
If you need probability values for the correctness of individual contacts: You can generate estimates by creating an evaluation set of similarly-dimensioned alignments and then determining the prediction accuracy for the most-confident score, second-most-confident score, and so on.
If you only care about a qualitative prediction of which residues are most likely in contact: You can extract the N couplings with the highest coupling score, or alternatively plot the coupling matrix as a 2D greyscale image to see if interaction patterns arise.
Please format the alignments in the PSICOV format - one sequence per line, upper-case single-letter-code with no identifiers. Gaps should be denoted by the -
symbol. You can find an example alignment in example/1atzA.aln
We recommend using HHblits to generate alignments with the following parameters (the 100000 values and the -all parmeter are so that more sequences are returned, in addition we use 3 iterations and an E-Value cutoff of 10^-3
:
hhblits -maxfilt 100000 -realign_max 100000 -d uniprot20 -all -B 100000 -Z 100000 -n 3 -e 0.001
After that, we recommend using hhfilter
to filter the sequences at a 90% identity threshold.
To get alignments into the PSICOV format required by CCMpred, please use reformat.pl
to convert the A3M-formatted hhblits result into a FASTA-formatted alignment (be sure to remove inserts with respect to he target sequence using the -r option) and convert_alignment.py to convert from FASTA to PSICOV format.
This means that the optimization of the model has not converged for the given number of maximum iterations. While this generally should not negatively impact prediction accuracy, you might try to increase the number of iterations.
This means that the line search in the optimization could not find a suitable search direction. Either this means that for the available accuracy, the model cannot be improved further (i.e. you are already close to the optimum but the optimization did not converge) or that your input data is misformatted. You can try the following:
- Check if your input alignment is formatted correctly (see above)
- Check if you have enough homologous sequences in the alignment
- Re-compile CCMpred with double-precision floating point arithmetic
Simply compile with the appropriate CMake flag set:
cmake -DCONJUGRAD_FLOAT=64 .
make
You can re-compile CCMpred with padding disabled - this will pack variables more densely in memory at the cost of a slower runtime performance:
cmake -DWITH_PADDING=off .
make
CMake should automatically detect which libraries are available on the compiling system. If this detection should go wrong or you want to intentionally disable features, you can do so using the flags defined in the CMakeLists.txt file:
cmake -DWITH_CUDA=off -DWITH_OMP=off .
make
If a library that you would like to compile with could not be found automatically, you can also set locations via the CMake command:
cmake -DMSGPACK_INCLUDE_DIR:STRING=/path/to/msgpack/include -DMSGPACK_LIBRARIES:STRING=/path/to/libmsgpack.so
For other libraries, you can find the search code and variable names in cmake_lib/ or the system-wide CMake library