Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

文章中的细节问题 #3

Open
CIawevy opened this issue Apr 8, 2023 · 2 comments
Open

文章中的细节问题 #3

CIawevy opened this issue Apr 8, 2023 · 2 comments

Comments

@CIawevy
Copy link

CIawevy commented Apr 8, 2023

你们好,很高兴在你们优秀的论文中学习,整个Modle的可视化做的也非常好看。我在阅读过程中有一个问题不知道你们可不可以解决一下。
1.基于vision transfomer block的特征提取中,如下图所示,有多箭头输入NL-TEM,这一步在论文中没有提及怎么做的,具体是几个token传入呢?transfomer block的n设置为多少?这两个问题与我下面的想问的问题也有关
image
2.文中说FSD中总共用了12个AMI,从12-6-3-2-1。似乎初始输入给decoder的token数量为12,这样才与论文描述相符合。按照我得理解,NL-TEM对于n个输入的token可以生成n个对应的加强过locality的image。那么结论是输入给NL-TEM的tokens length 理应也是12。这很奇怪,在回答问题一transfomer block输出是什么情况以后我还想了解,NL-TEM到SFD之前是否通过了MLP或者其他手段使得维度降维到12?

希望能得到您的关注,谢谢!

@CIawevy
Copy link
Author

CIawevy commented Apr 8, 2023

我记得VIT中不会改变输出的数量,为什么是12而不是w*H/(s**2)个token呢?谢谢!

@ZhouHuang23
Copy link
Owner

感谢您对我们工作的关注,代码已上传,请查看细节。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants