Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
d53531b
Adds metrics for parsing
kylebgorman Jan 14, 2026
749043b
Beginning integration
kylebgorman Jan 14, 2026
04b0b23
Adds metrics test.
kylebgorman Jan 15, 2026
28e1cdd
Draft of parser and its integration
kylebgorman Jan 15, 2026
8b0d96c
More work.
kylebgorman Jan 15, 2026
f8defcb
Applies shift to metrics test to avoid collisions.
kylebgorman Jan 15, 2026
a499d51
Moves reverse_edits to data, where it belongs.
kylebgorman Jan 15, 2026
c506566
Days' debugging work
kylebgorman Jan 18, 2026
386bed6
More work; still debugging
kylebgorman Jan 18, 2026
5057d25
Optimizes mmap instructions (#116)
kylebgorman Feb 22, 2026
e80df85
Updates Black version
kylebgorman Feb 22, 2026
3bf3d06
Adds logging for vocabularies (#117)
kylebgorman Feb 22, 2026
011cff4
Avoids "Crashed" status in sweeps. (#118)
kylebgorman Mar 18, 2026
032cfa3
Pooling layer efficiency (#119)
kylebgorman Apr 5, 2026
9053124
Update special.py
kylebgorman Apr 6, 2026
b75ba43
fix typo
kylebgorman Apr 6, 2026
7712ab4
Optimizes mmap instructions (#116)
kylebgorman Feb 22, 2026
9af9ac7
Adds logging for vocabularies (#117)
kylebgorman Feb 22, 2026
b5f2fd2
Avoids "Crashed" status in sweeps. (#118)
kylebgorman Mar 18, 2026
c1c5246
Pooling layer efficiency (#119)
kylebgorman Apr 5, 2026
f42e721
Beginning integration
kylebgorman Jan 14, 2026
bedb192
Adds metrics test.
kylebgorman Jan 15, 2026
3093858
Draft of parser and its integration
kylebgorman Jan 15, 2026
63f290a
More work.
kylebgorman Jan 15, 2026
64ff892
Moves reverse_edits to data, where it belongs.
kylebgorman Jan 15, 2026
a133a31
Days' debugging work
kylebgorman Jan 18, 2026
f58654d
More work; still debugging
kylebgorman Jan 18, 2026
b962e48
Optimizes mmap instructions (#116)
kylebgorman Feb 22, 2026
1683013
Pooling layer efficiency (#119)
kylebgorman Apr 5, 2026
df55a73
Manual merge
kylebgorman Apr 7, 2026
e2b916e
README and bibliography
kylebgorman Apr 7, 2026
b892d9e
manual merge of upstream/master
kylebgorman Apr 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 33 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,31 +60,29 @@ Dependencies project](https://universaldependencies.org/).

UDTube can perform up to four morphological tasks simultaneously:

- Lemmatization is performed using the `LEMMA` field and [edit
scripts](https://aclanthology.org/P14-2111/).

- [Universal part-of-speech
tagging](https://universaldependencies.org/u/pos/index.html) is performed
using the `UPOS` field: enable with `data: use_upos: true`.

- Language-specific part-of-speech tagging is performed using the `XPOS`
field: enable with `data: use_xpos: true`.

- Morphological feature tagging is performed using the `FEATS` field: enable
with `data: use_feats: true`.
- Lemmatization is performed using the `LEMMA` field and edit scripts.
- [Universal part-of-speech
tagging](https://universaldependencies.org/u/pos/index.html) is performed
using the `UPOS` field.
- Language-specific part-of-speech tagging is performed using the `XPOS` field.
- Morphological feature tagging is performed using the `FEATS` field.
- Dependency parsing is performed using the `HEAD` and `DEPREL` fields, a deep
biaffine parser, and minimum spanning tree decoding.

The following caveats apply:

- Note that many newer Universal Dependencies datasets do not have
language-specific part-of-speech-tags.
- The `FEATS` field is treated as a single unit and is not segmented in any
way.
- One can convert from [Universal Dependencies morphological
features](https://universaldependencies.org/u/feat/index.html) to [UniMorph
features](https://unimorph.github.io/schema/) using
[`scripts/convert_to_um.py`](scripts/convert_to_um.py).
- UDTube does not perform dependency parsing at present, so the `HEAD`,
`DEPREL`, and `DEPS` fields are ignored and should be specified as `_`.
- By default, lemmatization uses reverse-edit scripts. This is appropriate for
predominantly suffixal languages, which are thought to represent the majority
of the world's languages. If working with a predominantly prefixal language,
disable this with `data: reverse_edits: false`.
- Note that many newer Universal Dependencies datasets do not have
language-specific part-of-speech-tags so this task should be disabled
(`data: use_xpos: false`).
- The `FEATS` field is treated as a single unit and is not segmented in any way.
- One can convert from [Universal Dependencies morphological
features](https://universaldependencies.org/u/feat/index.html) to [UniMorph
features](https://unimorph.github.io/schema/) using
[`scripts/convert_to_um.py`](scripts/convert_to_um.py).

## Usage

Expand Down Expand Up @@ -189,7 +187,7 @@ information](https://github.com/CUNY-CL/yoyodyne/blob/master/README.md#logging).

#### Other options

By default, UDTube attempts to model all four tasks; one can disable the
By default, UDTube attempts to model all five tasks; one can disable the
language-specific tagging task using `model: use_xpos: false`, and so on.

Dropout probability is specified using `model: dropout: ...`.
Expand All @@ -198,25 +196,19 @@ The encoder has multiple layers. The input to the classifier consists of just
the last few layers mean-pooled together. The number of layers used for
mean-pooling is specified using `model: pooling_layers: ...`.

By default, lemmatization uses reverse-edit scripts. This is appropriate for
predominantly suffixal languages, which are thought to represent the majority of
the world's languages. If working with a predominantly prefixal language,
disable this with `model: reverse_edits: false`.

The following YAML snippet shows the default architectural arguments.

...
model:
dropout: 0.5
encoder: google-bert/bert-base-multilingual-cased
pooling_layers: 1
reverse_edits: true
use_upos: true
use_xpos: true
use_lemma: true
use_feats: true
use_parse: true
...


Batch size is specified using `data: batch_size: ...` and defaults to 32.

Expand Down Expand Up @@ -268,14 +260,14 @@ written.

Here are some additional details:

- In `predict` mode UDTube loads the file to be labeled incrementally (i.e.,
one sentence at a time) so this can be used with very large files.
- In `predict` mode, if no path for the predictions is specified, stdout will
be used. If using this in conjunction with \> or \|, add
`--trainer.enable_progress_bar false` on the command line.
- The target task fields are overriden if their heads are active.
- Use [`scripts/pretokenize.py`](scripts/pretokenize.py) to convert raw text
files to CoNLL-U input files.
- In `predict` mode UDTube loads the file to be labeled incrementally (i.e., one
sentence at a time) so this can be used with very large files.
- In `predict` mode, if no path for the predictions is specified, stdout will be
used. If using this in conjunction with \> or \|, add
`--trainer.enable_progress_bar false` on the command line.
- The target task fields are overriden if their heads are active.
- Use [`scripts/pretokenize.py`](scripts/pretokenize.py) to convert raw text
files to CoNLL-U input files.

This mode is invoked using the `predict` subcommand, like so:

Expand Down Expand Up @@ -322,3 +314,6 @@ following document, which describes the model:
Yakubov, D. 2024. [How do we learn what we cannot
say?](https://academicworks.cuny.edu/gc_etds/5622/) Master's thesis, CUNY
Graduate Center.

(See also [`udtube.bib`](udtube.bib) for more work used during the development
of this library.)
6 changes: 0 additions & 6 deletions configs/ewt_bert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,6 @@ trainer:
model:
dropout: 0.4
encoder: google-bert/bert-base-cased
pooling_layers: 4
reverse_edits: true
use_upos: true
use_xpos: true
use_lemma: true
use_feats: true
encoder_optimizer:
class_path: torch.optim.Adam
init_args:
Expand Down
7 changes: 1 addition & 6 deletions configs/ewt_distilbert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,6 @@ trainer:
model:
dropout: 0.4
encoder: distilbert/distilbert-base-cased
pooling_layers: 4
reverse_edits: true
use_upos: true
use_xpos: true
use_lemma: true
use_feats: true
encoder_optimizer:
class_path: torch.optim.Adam
init_args:
Expand All @@ -52,6 +46,7 @@ data:
test: /Users/Shinji/UD_English-EWT/en_ewt-ud-test.conllu
predict: /Users/Shinji/UD_English-EWT/en_ewt-ud-test.conllu
batch_size: 32
reverse_edits: true
checkpoint:
filename: "model-{epoch:03d}-{val_loss:.4f}"
monitor: val_loss
Expand Down
6 changes: 0 additions & 6 deletions configs/ewt_roberta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,6 @@ trainer:
model:
dropout: 0.4
encoder: FacebookAI/roberta-base
pooling_layers: 4
reverse_edits: true
use_upos: true
use_xpos: true
use_lemma: true
use_feats: true
encoder_optimizer:
class_path: torch.optim.Adam
init_args:
Expand Down
5 changes: 0 additions & 5 deletions configs/syntagrus_mbert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,7 @@ trainer:
model:
dropout: 0.4
encoder: google-bert/bert-base-multilingual-cased
pooling_layers: 4
reverse_edits: true
use_upos: true
use_xpos: false
use_lemma: true
use_feats: true
encoder_optimizer:
class_path: torch.optim.Adam
init_args:
Expand Down
5 changes: 0 additions & 5 deletions configs/syntagrus_rubert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,7 @@ trainer:
model:
dropout: 0.4
encoder: DeepPavlov/rubert
pooling_layers: 4
reverse_edits: true
use_upos: true
use_xpos: false
use_lemma: true
use_feats: true
encoder_optimizer:
class_path: torch.optim.Adam
init_args:
Expand Down
5 changes: 0 additions & 5 deletions configs/syntagrus_xlm-roberta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,7 @@ trainer:
model:
dropout: 0.4
encoder: FacebookAI/xlm-roberta-base
pooling_layers: 4
reverse_edits: true
use_upos: true
use_xpos: false
use_lemma: true
use_feats: true
encoder_optimizer:
class_path: torch.optim.Adam
init_args:
Expand Down
8 changes: 5 additions & 3 deletions examples/wandb_sweeps/configs/ewt_grid.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
method: random
method: bayes
metric:
name: val_loss
goal: minimize
Expand All @@ -10,6 +10,7 @@ parameters:
min: 0
max: 0.5
model.encoder:
distribution: categorical
values:
- FacebookAI/roberta-base
- distilbert/distilbert-base-cased
Expand All @@ -18,7 +19,7 @@ parameters:
distribution: q_uniform
q: 1
min: 1
max: 8
max: 4
model.encoder_optimizer.class_path:
value: torch.optim.Adam
model.encoder_optimizer.init_args.lr:
Expand All @@ -31,7 +32,7 @@ parameters:
distribution: q_uniform
q: 1
min: 1
max: 20
max: 40
model.classifier_optimizer.class_path:
value: torch.optim.Adam
model.classifier_optimizer.init_args.lr:
Expand All @@ -49,6 +50,7 @@ parameters:
model.classifier_scheduler.init_args.patience:
value: 5
data.batch_size:
distribution: categorical
values:
- 8
- 16
8 changes: 5 additions & 3 deletions examples/wandb_sweeps/configs/gdt_grid.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
method: random
method: bayes
metric:
name: val_loss
goal: minimize
Expand All @@ -10,14 +10,15 @@ parameters:
min: 0
max: 0.5
model.encoder:
distribution: categorical
values:
- google-bert/bert-base-multilingual-cased
- FacebookAI/xlm-roberta-base
model.pooling_layers:
distribution: q_uniform
q: 1
min: 1
max: 8
max: 4
model.encoder_optimizer.class_path:
value: torch.optim.Adam
model.encoder_optimizer.init_args.lr:
Expand All @@ -30,7 +31,7 @@ parameters:
distribution: q_uniform
q: 1
min: 1
max: 20
max: 40
model.classifier_optimizer.class_path:
value: torch.optim.Adam
model.classifier_optimizer.init_args.lr:
Expand All @@ -48,6 +49,7 @@ parameters:
model.classifier_scheduler.init_args.patience:
value: 5
data.batch_size:
distribution: categorical
values:
- 8
- 16
8 changes: 5 additions & 3 deletions examples/wandb_sweeps/configs/syntagrus_grid.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
method: random
method: bayes
metric:
name: val_loss
goal: minimize
Expand All @@ -10,6 +10,7 @@ parameters:
min: 0
max: 0.5
model.encoder:
distribution: categorical
values:
- google-bert/bert-base-multilingual-cased
- FacebookAI/xlm-roberta-base
Expand All @@ -18,7 +19,7 @@ parameters:
distribution: q_uniform
q: 1
min: 1
max: 8
max: 4
model.encoder_optimizer.class_path:
value: torch.optim.Adam
model.encoder_optimizer.init_args.lr:
Expand All @@ -31,7 +32,7 @@ parameters:
distribution: q_uniform
q: 1
min: 1
max: 20
max: 40
model.classifier_optimizer.class_path:
value: torch.optim.Adam
model.classifier_optimizer.init_args.lr:
Expand All @@ -49,6 +50,7 @@ parameters:
model.classifier_scheduler.init_args.patience:
value: 5
data.batch_size:
distribution: categorical
values:
- 8
- 16
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "udtube"
version = "0.1.12"
version = "0.2.0"
description = "Neural morphological analysis"
license = "Apache-2.0"
readme = "README.md"
Expand Down
Loading