Skip to content

Concerns about the PTM-Mamba training task design and its biological validity #5

@Cai-z-us

Description

@Cai-z-us

Dear authors,

First of all, thank you for sharing the excellent work on PTM-Mamba. I’ve read the paper and reviewed the [token embedding t-SNE figure](https://github.com/programmablebio/ptm-mamba/tree/main/ptm_data_preprocessing), training methods, and the downstream evaluations. I find the overall framework promising. However, I have some technical concerns about the training task design and its ability to reflect biological understanding, and I hope to hear your thoughts.


❗ Concern 1: Masked language modeling may favor shallow context patterns over semantic understanding

The training objective of PTM-Mamba is masked language modeling (MLM), with the following masking strategy:

80% of sequences use standard 15% random masking,
20% mask all PTM tokens + random 15% of wild-type tokens.

This approach can effectively enhance PTM token learning, but also risks encouraging the model to memorize positional or contextual patterns of PTM appearance rather than understanding their functional roles in protein biology. Especially for frequently occurring PTMs in conserved motifs, the model may perform well by shallow pattern matching instead of semantic generalization.


❗ Concern 2: Task-objective mismatch between MLM and PTM biological function

Predicting the identity of masked tokens in context (as in MLM) is not inherently aligned with real PTM-related tasks such as:

  • Predicting whether a PTM occurs at a site (site-level classification),
  • Understanding its causal effect on protein function, interaction, or localization,
  • Reasoning over PTM-related disease association.

Thus, the current objective may train an excellent token reconstruction model, but not necessarily a PTM functional understanding model.


❗ Concern 3: Downstream benchmarks may not sufficiently validate PTM token understanding

Most downstream tasks (e.g., phosphorylation site prediction) use wild-type sequences as inputs even for PTM-Mamba. This limits the ability to test whether the added PTM token embeddings capture meaningful biology. Only the PPI benchmark (PTMint) appears to incorporate PTM tokenized input explicitly.

It would be more convincing if:

  • PTM tokens were explicitly used in more evaluation tasks,
  • Zero-shot generalization to unseen PTMs were measured,
  • Or contrastive evaluations were added (e.g., WT vs PTM-modified embeddings).

❗ Concern 4: t-SNE embedding visualization may not reflect real semantic structure

While the t-SNE plot shows clean clustering of certain PTMs (e.g., acetylation, phosphorylation), such embedding proximity can arise from shared local context windows or training bias, rather than deeper biological semantics. This is especially a concern under biased masking and without independent functional supervision.


Thank you again for your exciting contribution to protein representation learning. I hope this question sparks constructive discussion, and I’d love to hear your response.

Best regards,
[涟Ripple]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions