Dynamic refactoring #2

francoishernandez · 2020-09-24T07:48:52Z

No description provided.

* bin/train_dynamicdata.py * opts: --data_config and --bucket_size * train_single.main_dynamicdata * dynamicdata package * broken * still broken: not enough to get rid of immediate generator * build_data_loader in producer * nfeats 0, not 1 * fixes to mismatching batch structure * looks weird reporting while filling up queue * bin/preprocess_dynamicdata.py * enable PrefixTransform to work with only src * WIP: translator for dynamicdata * WIP: opened too early * dynamicdata runs * removed obsolete function * restore mixing weight schedule counter * pop all expired mixing weights at once * compressed inputs * allow missing transforms during sharding * fail fast on missing files * bad cleverness in vocab paths * repr for transforms * SentencepieceTransform * send cpu tensors from producer to consumer The data loader cannot access the gpu when CUDA Compute Mode is set to Exclusive_Process. Trying to do so results in "CUDA error: all CUDA-capable devices are busy or unavailable". As the training process has the gpu, it is responsible for calling torch.device to send data from cpu to gpu. * torch.device moved to wrong place. Also missing from validation. * prints to logging * access subword vocab from transforms. SwitchOut * WbNoiseTransform * multiple transform pipelines with share_inputs * reverse: flips src and trg for backtranslation * saving and loading data_loader_step * pre-de-tokenization * MorfessorEmTabooPeturbedTransform * logging (on and off) * parameter seg_n_samples * protect from clobbering * extra_prefix * comment * bin/frankenstein.py * prune matching keys when config doesn't match * insertion transform * order of _inputs shouldn't matter * Sentencepiece sampling params * need to store in pickle * started writing dynamicdata documentation * critique of the current dataloader * minor * concepts * usage * dyndata.md * updated fig * better naming * pep8 * translate opt * re.groups should not be changed to tasks * rename group to task in translator * replaced readme with dynamicdata version * dynamicdata requirements as optional * DeterministicSegmentationTransform * WIP: bin/debug_dynamicdata.py * bin/debug_dynamicdata.py * DeterministicSegmentationTransform * sanity checks to SentencepieceTransform * update README.md * formatting * template_dynamicdata.py * Cleaner structure of sharded directory. Separate shards from vocabs/transforms * pep8 * completed some 2dos * Automatically determine input corpus sizes * use line counts to ensure shards of even size * update README.md * don't die on blanks * spm_to_vocab.py, minor bugfixes * Missing subword vocab conversion step from SentencePiece usage. * clarified that meta.train.vocab_path points to subword vocab * morfessor expected counts to vocab * use Morfessor EM+Prune models directly as vocab * pretokenize=True silently ignored * documentation: pretokenization, exclusive_process * arxiv link * documentation: dampening, pretokenizing test set * mention BART * minor * note about BPE-Dropout * reverse README * fix multiprocess & pooling * some cleaning * merge train_single * fix gpu_rank for cpu * add adapted dynamic data iterate mecanism * some cleaning * clean train script * remove deprecated code * remove deprecated code, patch 2 * fix multiprocessing not pickle generator for train dynamic * fix some mentioned issues & update train config example file * more transforms * drop sharding, iterate directely on full corpus * transform composite and statistics * add some documentation * improve transform perf. relate to sampling * add ONMTTokenizer Transform & use new interface of sentencepiece * MixingStrategy for DatasetIter to make it interchangeable * fix relate to valid_iter * fix translate with dynamic trained model * misc fix * add BPE * save dynamic sample if verbose * [WIP] Some adaptations to the dynamic data pipeline (#4) * wip simplify shards stuff * some cleanup * rename shard to corpus * some cleanup * multiple producers, move transform up in code execution * apply transform properly in verbose mode * do not use cycle, which causes memory leak * make one queue per gpu per producer, to preserve reproducibility * enable bpe dropout for pyonmttok, v>1.19, fix inverted dropout value * terminate producers properly when training processes are done * single queue per producer/consumer couple * revert semaphore size change * make transforms a folder * add bart noise as transform * register transform to support extension (acoladgroup#6) * fix to BART * remove preprocess from dynamic (acoladgroup#8) * do not make links, make counters * add initial build_vocab script * merge preprocess into train * compute index with stride and offset * remove preprocess & some cleaning Co-authored-by: François Hernandez <[email protected]> * enable transforms behavior change between train/valid * fix minor issue with bart * fix reproducibility for transforms (acoladgroup#9) * update doc & config example Co-authored-by: Stig-Arne Gronroos <[email protected]> Co-authored-by: François Hernandez <[email protected]>

* drop old style vocab supports * drop video captioning * drop speech2text * drop image2text * drop previous experimental BART * gather iterators in inputter.py to a seperate file * minor fix relate to flask * fix rebuild_test_model error * update test_model * Add back img, audio, vid documentation in legacy

* pack src & tgt into a dict in transform * add dynamic_dict support to dynamic iter * add seq_len_trunc options to dynamic

* drop previous preprocess & training modules * rewrite pr_check and some test * fix doc build exception * add ggnn check & fix ggnn

* add support for align file * add missing align data config * raise error when run align with some transform

* move dynamic train to bin * move dynamic build_vocab to bin * fix train entry * beautify opts.py & rm unused sphinx doc * remove src_dir related code * add bart's __repr__ * use config for data_config * remove unused dependency pretrainedmodels

* update build_vocab opts * fix subword_type none for onmt_tok * clean option validate * minimum testing data.yaml * better prefix transform * rename dynamic.vocab to dynamic.fields * add entry point for bin/build_vocab.py * exit training after save sample * fix bart's poisson_lambda description * fix data_prepare test error due to previous change

* better api for DynamicDatasetIter * extend n_sample * silent defaulting valid corpus weight * make transform class option check extensible * handle empty line * clean train script

* embed pretrained embeddings stuff at beginning of train * fix some flake * move functions to embedding module * remove functions from misc * move embedding options check to right place Co-authored-by: Linxiao ZENG <[email protected]>

Zenglinxiao and others added 16 commits September 14, 2020 14:38

merge recent changes to dynamic_wip

071b162

gather constants across the project (acoladgroup#11)

83237a9

add dynamic_dict to dynamic (acoladgroup#13)

4a2e115

* pack src & tgt into a dict in transform * add dynamic_dict support to dynamic iter * add seq_len_trunc options to dynamic

minor fix relate to _load_vocab and acoladgroup#13

c7fe9ec

clean previous preprocess-training paradigm (acoladgroup#14)

281f7bb

* drop previous preprocess & training modules * rewrite pr_check and some test * fix doc build exception * add ggnn check & fix ggnn

add support for align file (acoladgroup#15)

c9cc7f0

* add support for align file * add missing align data config * raise error when run align with some transform

fix typo & clean comments

d347f63

rename create_vocabulary to extract_vocabulary

c3517b3

rename get_vocab & some minor fix

4c686e4

some changes relate to dynamic PR (acoladgroup#21)

32ee11e

* better api for DynamicDatasetIter * extend n_sample * silent defaulting valid corpus weight * make transform class option check extensible * handle empty line * clean train script

remove unnecessary argument in train consumer

414dc9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic refactoring #2

Dynamic refactoring #2

Uh oh!

francoishernandez commented Sep 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dynamic refactoring #2

Are you sure you want to change the base?

Dynamic refactoring #2

Uh oh!

Conversation

francoishernandez commented Sep 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants