training with a subset of the training data #1093

jeswan · 2020-09-17T18:36:47Z

Issue by lovodkin93
Friday May 22, 2020 at 11:06 GMT
Originally opened as nyu-mll/jiant#1093

Hello,
I would like to probe the bert module for the coreference task, but I would like the training to be done only on examples where the span distance is bigger than a certain size.
My question is - should I add this constraint only in the train loop in the train function in /jiant/trainer.py? Or is there another part of the code that I need to update?
Thanks!

jeswan · 2020-09-17T18:36:48Z

Comment by sleepinyourhat
Friday May 22, 2020 at 13:32 GMT

I don't think there's an easy/obvious place to add this.

If you need to use the tokenized text to determine distance, then it would make sense to add this to the training data iterator, but I can't guarantee that it'll be easy to identify the relevant code path for that.

If you don't need to filter the tokenized, it might just make sense to create a filtered copy of the training data file, and use a new @register_task declaration above the task definition in jiant/tasks to create a new task with a new name and directory but the same loading and modeling logic.

jeswan · 2020-09-17T18:36:49Z

Comment by lovodkin93
Friday May 22, 2020 at 15:52 GMT

I don't think there's an easy/obvious place to add this.

If you need to use the tokenized text to determine distance, then it would make sense to add this to the training data iterator, but I can't guarantee that it'll be easy to identify the relevant code path for that.

If you don't need to filter the tokenized, it might just make sense to create a filtered copy of the training data file, and use a new @register_task declaration above the task definition in jiant/tasks to create a new task with a new name and directory but the same loading and modeling logic.

So what you say is that I should filter sentences from the OntoNote that have coreferences that are too close to each other (if my goal is to work with coreferences bigger than a certain size)?
Because after seeing the ontoNote training data file, I saw that it basically consisted of tables of words from sentences and their different qualities (POS, coreference etc), so i don't really see how I can filter the data other than taking out full sentences.

zphang mentioned this issue Oct 16, 2020

training with a subset of the training data nyu-mll/jiant#1093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training with a subset of the training data #1093

training with a subset of the training data #1093

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

training with a subset of the training data #1093

training with a subset of the training data #1093

Comments

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020