Skip to content

Adds parser task using deep biaffine parser#120

Draft
kylebgorman wants to merge 32 commits into
CUNY-CL:masterfrom
kylebgorman:parser2
Draft

Adds parser task using deep biaffine parser#120
kylebgorman wants to merge 32 commits into
CUNY-CL:masterfrom
kylebgorman:parser2

Conversation

@kylebgorman
Copy link
Copy Markdown
Contributor

Draft.

Closes #72.

One major issue is that this requires us to use negative indices for
specials, which breaks assumptions in the indexes. Will have to come
back and fix this.
Known issues:

1. I don't think the metrics test is going to work; I will need to shift
   all the head indices by special.OFFSET.
2. I am not passing a parser mask. Do I need to? I think maybe yes.
It has no effect in the model, so let's get rid of it.
* Adds logging for vocabularies

Sample output:

INFO: 22-Feb-26 17:56:27 - UPOS vocabulary (21): '[PAD]', '[UNK]', '_', 'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', '_'
INFO: 22-Feb-26 17:56:27 - XPOS vocabulary (53): '[PAD]', '[UNK]', '_', '$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'GW', 'HYPH', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NFP', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '_', '``'
INFO: 22-Feb-26 17:56:27 - Lemma vocabulary (533): [omitted]
INFO: 22-Feb-26 17:56:27 - Features vocabulary (235): [omitted]

Closes CUNY-CL#115.

* black update

* f-string fix

* driveby: silence more warnings
* Fix pooling layer regression in UDTubeEncoder.forward

Special cases pooling_layers=1 to use last_hidden_state directly, avoiding
unnecessary allocation of all hidden states. This seems to save a lot of
GPU memory.

A few drive-bys:

1. suppress progress bar during test data generation
2. add "not human-readable" to "[omitted]" when logging lemmas
3. actually log features; why not?
4. pass information about which heads to build to the data module too,
   so it logs properly
5. removes _ from "special", since it doesn't require any special
   treatment in actuality; it's just another tag as far as we're
   concerned.
6. Standardizes trailing """: it's on its own line if the comment is
   more than one line.

* regeneration last-minute fix
* Adds logging for vocabularies

Sample output:

INFO: 22-Feb-26 17:56:27 - UPOS vocabulary (21): '[PAD]', '[UNK]', '_', 'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', '_'
INFO: 22-Feb-26 17:56:27 - XPOS vocabulary (53): '[PAD]', '[UNK]', '_', '$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'GW', 'HYPH', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NFP', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '_', '``'
INFO: 22-Feb-26 17:56:27 - Lemma vocabulary (533): [omitted]
INFO: 22-Feb-26 17:56:27 - Features vocabulary (235): [omitted]

Closes CUNY-CL#115.

* black update

* f-string fix

* driveby: silence more warnings
* Fix pooling layer regression in UDTubeEncoder.forward

Special cases pooling_layers=1 to use last_hidden_state directly, avoiding
unnecessary allocation of all hidden states. This seems to save a lot of
GPU memory.

A few drive-bys:

1. suppress progress bar during test data generation
2. add "not human-readable" to "[omitted]" when logging lemmas
3. actually log features; why not?
4. pass information about which heads to build to the data module too,
   so it logs properly
5. removes _ from "special", since it doesn't require any special
   treatment in actuality; it's just another tag as far as we're
   concerned.
6. Standardizes trailing """: it's on its own line if the comment is
   more than one line.

* regeneration last-minute fix
One major issue is that this requires us to use negative indices for
specials, which breaks assumptions in the indexes. Will have to come
back and fix this.
Known issues:

1. I don't think the metrics test is going to work; I will need to shift
   all the head indices by special.OFFSET.
2. I am not passing a parser mask. Do I need to? I think maybe yes.
It has no effect in the model, so let's get rid of it.
* Fix pooling layer regression in UDTubeEncoder.forward

Special cases pooling_layers=1 to use last_hidden_state directly, avoiding
unnecessary allocation of all hidden states. This seems to save a lot of
GPU memory.

A few drive-bys:

1. suppress progress bar during test data generation
2. add "not human-readable" to "[omitted]" when logging lemmas
3. actually log features; why not?
4. pass information about which heads to build to the data module too,
   so it logs properly
5. removes _ from "special", since it doesn't require any special
   treatment in actuality; it's just another tag as far as we're
   concerned.
6. Standardizes trailing """: it's on its own line if the comment is
   more than one line.

* regeneration last-minute fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dependency parsing

1 participant