Concerns about the PTM-Mamba training task design and its biological validity

Dear authors,

First of all, thank you for sharing the excellent work on **PTM-Mamba**. I’ve read the paper and reviewed the [[token embedding t-SNE figure](https://github.com/programmablebio/ptm-mamba/tree/main/ptm_data_preprocessing)](https://github.com/programmablebio/ptm-mamba/tree/main/ptm_data_preprocessing), training methods, and the downstream evaluations. I find the overall framework promising. However, I have some **technical concerns about the training task design and its ability to reflect biological understanding**, and I hope to hear your thoughts.

---

### ❗ Concern 1: Masked language modeling may favor shallow context patterns over semantic understanding

The training objective of PTM-Mamba is masked language modeling (MLM), with the following masking strategy:

> 80% of sequences use standard 15% random masking,  
> 20% mask **all PTM tokens** + random 15% of wild-type tokens.

This approach can effectively enhance PTM token learning, but also risks encouraging the model to **memorize positional or contextual patterns of PTM appearance** rather than understanding their functional roles in protein biology. Especially for frequently occurring PTMs in conserved motifs, the model may perform well by shallow pattern matching instead of semantic generalization.

---

### ❗ Concern 2: Task-objective mismatch between MLM and PTM biological function

Predicting the identity of masked tokens in context (as in MLM) is not inherently aligned with real PTM-related tasks such as:
- Predicting whether a PTM occurs at a site (site-level classification),
- Understanding its causal effect on protein function, interaction, or localization,
- Reasoning over PTM-related disease association.

Thus, the current objective may train an excellent **token reconstruction model**, but not necessarily a **PTM functional understanding model**.

---

### ❗ Concern 3: Downstream benchmarks may not sufficiently validate PTM token understanding

Most downstream tasks (e.g., phosphorylation site prediction) **use wild-type sequences as inputs** even for PTM-Mamba. This limits the ability to test whether the added PTM token embeddings capture meaningful biology. Only the PPI benchmark (PTMint) appears to incorporate PTM tokenized input explicitly.

It would be more convincing if:
- PTM tokens were explicitly used in more evaluation tasks,
- Zero-shot generalization to **unseen PTMs** were measured,
- Or contrastive evaluations were added (e.g., WT vs PTM-modified embeddings).

---

### ❗ Concern 4: t-SNE embedding visualization may not reflect real semantic structure

While the t-SNE plot shows clean clustering of certain PTMs (e.g., acetylation, phosphorylation), such embedding proximity can arise from **shared local context windows** or training bias, rather than deeper biological semantics. This is especially a concern under biased masking and without independent functional supervision.

---


Thank you again for your exciting contribution to protein representation learning. I hope this question sparks constructive discussion, and I’d love to hear your response.

Best regards,  
[涟Ripple]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concerns about the PTM-Mamba training task design and its biological validity #5

❗ Concern 1: Masked language modeling may favor shallow context patterns over semantic understanding

❗ Concern 2: Task-objective mismatch between MLM and PTM biological function

❗ Concern 3: Downstream benchmarks may not sufficiently validate PTM token understanding

❗ Concern 4: t-SNE embedding visualization may not reflect real semantic structure

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Concerns about the PTM-Mamba training task design and its biological validity #5

Description

❗ Concern 1: Masked language modeling may favor shallow context patterns over semantic understanding

❗ Concern 2: Task-objective mismatch between MLM and PTM biological function

❗ Concern 3: Downstream benchmarks may not sufficiently validate PTM token understanding

❗ Concern 4: t-SNE embedding visualization may not reflect real semantic structure

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions