Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training with a subset of the training data #1093

Open
jeswan opened this issue Sep 17, 2020 · 2 comments
Open

training with a subset of the training data #1093

jeswan opened this issue Sep 17, 2020 · 2 comments

Comments

@jeswan
Copy link
Contributor

jeswan commented Sep 17, 2020

Issue by lovodkin93
Friday May 22, 2020 at 11:06 GMT
Originally opened as nyu-mll/jiant#1093


Hello,
I would like to probe the bert module for the coreference task, but I would like the training to be done only on examples where the span distance is bigger than a certain size.
My question is - should I add this constraint only in the train loop in the train function in /jiant/trainer.py? Or is there another part of the code that I need to update?
Thanks!

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Friday May 22, 2020 at 13:32 GMT


I don't think there's an easy/obvious place to add this.

If you need to use the tokenized text to determine distance, then it would make sense to add this to the training data iterator, but I can't guarantee that it'll be easy to identify the relevant code path for that.

If you don't need to filter the tokenized, it might just make sense to create a filtered copy of the training data file, and use a new @register_task declaration above the task definition in jiant/tasks to create a new task with a new name and directory but the same loading and modeling logic.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by lovodkin93
Friday May 22, 2020 at 15:52 GMT


I don't think there's an easy/obvious place to add this.

If you need to use the tokenized text to determine distance, then it would make sense to add this to the training data iterator, but I can't guarantee that it'll be easy to identify the relevant code path for that.

If you don't need to filter the tokenized, it might just make sense to create a filtered copy of the training data file, and use a new @register_task declaration above the task definition in jiant/tasks to create a new task with a new name and directory but the same loading and modeling logic.

So what you say is that I should filter sentences from the OntoNote that have coreferences that are too close to each other (if my goal is to work with coreferences bigger than a certain size)?
Because after seeing the ontoNote training data file, I saw that it basically consisted of tables of words from sentences and their different qualities (POS, coreference etc), so i don't really see how I can filter the data other than taking out full sentences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant