forked from acoladgroup/OpenNMT-py
-
Notifications
You must be signed in to change notification settings - Fork 0
Dynamic refactoring #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
francoishernandez
wants to merge
16
commits into
master
Choose a base branch
from
dynamic_wip
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* bin/train_dynamicdata.py * opts: --data_config and --bucket_size * train_single.main_dynamicdata * dynamicdata package * broken * still broken: not enough to get rid of immediate generator * build_data_loader in producer * nfeats 0, not 1 * fixes to mismatching batch structure * looks weird reporting while filling up queue * bin/preprocess_dynamicdata.py * enable PrefixTransform to work with only src * WIP: translator for dynamicdata * WIP: opened too early * dynamicdata runs * removed obsolete function * restore mixing weight schedule counter * pop all expired mixing weights at once * compressed inputs * allow missing transforms during sharding * fail fast on missing files * bad cleverness in vocab paths * repr for transforms * SentencepieceTransform * send cpu tensors from producer to consumer The data loader cannot access the gpu when CUDA Compute Mode is set to Exclusive_Process. Trying to do so results in "CUDA error: all CUDA-capable devices are busy or unavailable". As the training process has the gpu, it is responsible for calling torch.device to send data from cpu to gpu. * torch.device moved to wrong place. Also missing from validation. * prints to logging * access subword vocab from transforms. SwitchOut * WbNoiseTransform * multiple transform pipelines with share_inputs * reverse: flips src and trg for backtranslation * saving and loading data_loader_step * pre-de-tokenization * MorfessorEmTabooPeturbedTransform * logging (on and off) * parameter seg_n_samples * protect from clobbering * extra_prefix * comment * bin/frankenstein.py * prune matching keys when config doesn't match * insertion transform * order of _inputs shouldn't matter * Sentencepiece sampling params * need to store in pickle * started writing dynamicdata documentation * critique of the current dataloader * minor * concepts * usage * dyndata.md * updated fig * better naming * pep8 * translate opt * re.groups should not be changed to tasks * rename group to task in translator * replaced readme with dynamicdata version * dynamicdata requirements as optional * DeterministicSegmentationTransform * WIP: bin/debug_dynamicdata.py * bin/debug_dynamicdata.py * DeterministicSegmentationTransform * sanity checks to SentencepieceTransform * update README.md * formatting * template_dynamicdata.py * Cleaner structure of sharded directory. Separate shards from vocabs/transforms * pep8 * completed some 2dos * Automatically determine input corpus sizes * use line counts to ensure shards of even size * update README.md * don't die on blanks * spm_to_vocab.py, minor bugfixes * Missing subword vocab conversion step from SentencePiece usage. * clarified that meta.train.vocab_path points to subword vocab * morfessor expected counts to vocab * use Morfessor EM+Prune models directly as vocab * pretokenize=True silently ignored * documentation: pretokenization, exclusive_process * arxiv link * documentation: dampening, pretokenizing test set * mention BART * minor * note about BPE-Dropout * reverse README * fix multiprocess & pooling * some cleaning * merge train_single * fix gpu_rank for cpu * add adapted dynamic data iterate mecanism * some cleaning * clean train script * remove deprecated code * remove deprecated code, patch 2 * fix multiprocessing not pickle generator for train dynamic * fix some mentioned issues & update train config example file * more transforms * drop sharding, iterate directely on full corpus * transform composite and statistics * add some documentation * improve transform perf. relate to sampling * add ONMTTokenizer Transform & use new interface of sentencepiece * MixingStrategy for DatasetIter to make it interchangeable * fix relate to valid_iter * fix translate with dynamic trained model * misc fix * add BPE * save dynamic sample if verbose * [WIP] Some adaptations to the dynamic data pipeline (#4) * wip simplify shards stuff * some cleanup * rename shard to corpus * some cleanup * multiple producers, move transform up in code execution * apply transform properly in verbose mode * do not use cycle, which causes memory leak * make one queue per gpu per producer, to preserve reproducibility * enable bpe dropout for pyonmttok, v>1.19, fix inverted dropout value * terminate producers properly when training processes are done * single queue per producer/consumer couple * revert semaphore size change * make transforms a folder * add bart noise as transform * register transform to support extension (acoladgroup#6) * fix to BART * remove preprocess from dynamic (acoladgroup#8) * do not make links, make counters * add initial build_vocab script * merge preprocess into train * compute index with stride and offset * remove preprocess & some cleaning Co-authored-by: François Hernandez <[email protected]> * enable transforms behavior change between train/valid * fix minor issue with bart * fix reproducibility for transforms (acoladgroup#9) * update doc & config example Co-authored-by: Stig-Arne Gronroos <[email protected]> Co-authored-by: François Hernandez <[email protected]>
* drop old style vocab supports * drop video captioning * drop speech2text * drop image2text * drop previous experimental BART * gather iterators in inputter.py to a seperate file * minor fix relate to flask * fix rebuild_test_model error * update test_model * Add back img, audio, vid documentation in legacy
* pack src & tgt into a dict in transform * add dynamic_dict support to dynamic iter * add seq_len_trunc options to dynamic
* drop previous preprocess & training modules * rewrite pr_check and some test * fix doc build exception * add ggnn check & fix ggnn
* add support for align file * add missing align data config * raise error when run align with some transform
* move dynamic train to bin * move dynamic build_vocab to bin * fix train entry * beautify opts.py & rm unused sphinx doc * remove src_dir related code * add bart's __repr__ * use config for data_config * remove unused dependency pretrainedmodels
* update build_vocab opts * fix subword_type none for onmt_tok * clean option validate * minimum testing data.yaml * better prefix transform * rename dynamic.vocab to dynamic.fields * add entry point for bin/build_vocab.py * exit training after save sample * fix bart's poisson_lambda description * fix data_prepare test error due to previous change
* better api for DynamicDatasetIter * extend n_sample * silent defaulting valid corpus weight * make transform class option check extensible * handle empty line * clean train script
* embed pretrained embeddings stuff at beginning of train * fix some flake * move functions to embedding module * remove functions from misc * move embedding options check to right place Co-authored-by: Linxiao ZENG <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.