First of all, thank you for this fantastic and forward-looking work! The idea of decoupling memory operations into a self-evolving skill bank is highly inspiring for the community's research on long-term agent memory mechanisms.
I am currently doing a deep dive into the codebase to fully grasp the architectural choices of the PPOController. While reviewing the implementation, I noticed a structural discrepancy between the mathematical formulation in the paper and the actual PyTorch implementation regarding the scoring mechanism.
1. The Formulation in the Paper
In Section 3.3.1 , the logits for skill selection are described conceptually as a dot product between the state embedding and the skill embedding:
$$z_{t,i} = h_t^\top u_i$$
This implies a standard dense retrieval paradigm where the skill representation $u_i$ is static and the match is based on geometric similarity.
2. The Implementation in the Code
However, in src/controller.py (PPOController), the architecture appears to be a dual-encoder with a cross-encoder interaction layer:
-
Both the state and the operations pass through separate trainable MLPs (state_net and op_net).
-
Instead of a dot product, the state and operation representations are concatenated and passed through a third MLP (actor_head) to output the scalar logit:
$$z_{t,i} = \text{MLP}_{\text{actor}}([h_t \parallel o_i])$$
My Questions:
Design Philosophy: Was the dot product $h_t^\top u_i$ in the paper intended as a theoretical simplification?
Clarifying this would be incredibly helpful for researchers (like myself) who are looking to build upon or evaluate your architectural design choices.
Thank you so much for your time and for open-sourcing this excellent project!
First of all, thank you for this fantastic and forward-looking work! The idea of decoupling memory operations into a self-evolving skill bank is highly inspiring for the community's research on long-term agent memory mechanisms.
I am currently doing a deep dive into the codebase to fully grasp the architectural choices of the
PPOController. While reviewing the implementation, I noticed a structural discrepancy between the mathematical formulation in the paper and the actual PyTorch implementation regarding the scoring mechanism.1. The Formulation in the Paper
In Section 3.3.1 , the logits for skill selection are described conceptually as a dot product between the state embedding and the skill embedding:
This implies a standard dense retrieval paradigm where the skill representation$u_i$ is static and the match is based on geometric similarity.
2. The Implementation in the Code
However, in
src/controller.py(PPOController), the architecture appears to be a dual-encoder with a cross-encoder interaction layer:Both the state and the operations pass through separate trainable MLPs (
state_netandop_net).Instead of a dot product, the state and operation representations are concatenated and passed through a third MLP (
actor_head) to output the scalar logit:My Questions:
Design Philosophy: Was the dot product$h_t^\top u_i$ in the paper intended as a theoretical simplification?
Clarifying this would be incredibly helpful for researchers (like myself) who are looking to build upon or evaluate your architectural design choices.
Thank you so much for your time and for open-sourcing this excellent project!