Skip to content

Questions concerning configuring train_lora.py for custom corpus #130

@eshau

Description

@eshau

Hi! I followed the instructions for fine-tuning my corpus and (I think) managed to do so successfully after days of debugging. I have A LOT of implementation questions and the following is half-guide, half questions about the process. I want to make it very clear I am VERY thankful for the existence of this library but I feel obligated to point out the issues below.



I first created a new conda environment (with conda create -n sat-finetune python=3.9 and conda install pip just to be safe) and ran the following:

git clone https://github.com/segment-any-text/wtpsplit
cd wtpsplit
pip install -r requirements.txt
pip install adapters==0.2.1 --no-dependencies
cd ..

Then I created the .pth dataset as per this format:

import torch

torch.save(
    {
        "language_code": {
            "sentence": {
                "sat-dataset": {
                    "meta": {
                        "train_data": ["train sentence 1", "train sentence 2"],
                    },
                    "data": [
                        "test sentence 1",
                        "test sentence 2",
                    ]
                }
            }
        }
    },
    "<path>/sat-dataset.pth"
)

My config is below:

{
    "model_name_or_path": "segment-any-text/sat-3l",
    "output_dir": "sat-3l-LL_lora",
    "block_size": 256,
    "eval_stride": 128,
    "do_train": true,
    "do_eval": true,
    "per_device_train_batch_size": 64,
    "per_device_eval_batch_size": 32,
    "gradient_accumulation_steps": 1,
    "eval_accumulation_steps": 8,
    "dataloader_num_workers": 1,
    "preprocessing_num_workers": 1,
    "learning_rate": 3e-4,
    "fp16": false,
    "num_train_epochs": 30,
    "logging_steps": 50,
    "report_to": "wandb",
    "wandb_project": "sentence",
    "save_steps": 100000000,
    "remove_unused_columns": false,
    "do_sentence_training": true,
    "do_auxiliary_training": false,
    "warmup_ratio": 0.1,
    "non_punctuation_sample_ratio": null,
    "prediction_loss_only": true,
    "use_auxiliary": true,
    "ddp_timeout": 3600,
    "use_subwords": true,
    "custom_punctuation_file": "punctuation_xlmr_unk.txt",
    "log_level": "warning",
    "adapter_config": "lora[r=16,alpha=32,intermediate_lora=True]",
    "text_path": "<path>/sat-dataset.pth",
    "weight_decay": 0.0,
    "auxiliary_remove_prob": 0.0,
    "train_adapter": true
}


The first issue that popped up was that wtpsplit wasn't installed. To fix this, I added the path of the wtsplit dir to train_lora.py:

...
import wandb

# Added this line below
sys.path.insert(0, os.path.abspath('<path>/wtpsplit'))

from wtpsplit.models import SubwordXLMConfig, SubwordXLMForTokenClassification
...

On hindsight, I think this can be fixed with a pip install . in the correct directory but I wasn't sure.


After this, I received an outdated CUDA version error as the torch version in the requirements.txt file by default installs 1.7.1. I tried upgrading to the version on my kernel with the recommended torch command (conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia) but this updates numpy to 2.0 among other things and causes version errors. Downgrading to torch=1.13.1+cu117 did not help (from a brief Google search the version itself is buggy) and I progressively downgraded to torch=1.10.1+cu111 to make it work.


This made CUDA work but then I got an index error along these lines:

...
(Thousands of the line below)
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [864,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

...
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

I believe this is because in train_lora.py we add the newline token to the tokenizer:

...
tokenizer = AutoTokenizer.from_pretrained(args.base_model)
# needed since we create labels in collate_fn based on tokens
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n")]})
custom_token_id = tokenizer.convert_tokens_to_ids("\n")
# used later to filter out special tokens
special_tokens_ids = set(tokenizer.all_special_ids)
special_tokens_ids.discard(custom_token_id)
...

but never update the size of the embedding in the backbone model. This leads to the tokenizer generating the input id token corresponding to the newline token (250002) but the embedding model not being big enough to accommodate it. I thought this was because I had newlines in my sentences but even after removing them I still received this error (I also later realized we added newlines anyway in prepare_dataset). To fix this, I added this line:

...
special_tokens_ids = set(tokenizer.all_special_ids)
 special_tokens_ids.discard(custom_token_id)

# Added this line below
backbone.resize_token_embeddings(len(tokenizer))

if "short" in dataset_name:
    one_sample_per_line = True
...

This led to another error I will explain further below but at this point I had a few questions:



1. Should we have newlines in our train / valid sentences?


2. Adding the extra embedding for newline feels like a hack and I am pretty sure the original SaT-3l I used as the base model should have had this in the embedding. Was the wrong model used as the base? And is this new embedding for newline changing enough given we freeze the weights of the original model? Furthermore, is this leaking into the TokenClassifier part of the model?





After this, I tried running train_lora.py but the code was stuck. I did some debugging and it was stuck on this line:

...
    **kwargs,
)

# Stuck on this line below
trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)

logger.warning(f"Finished training for {lang} {dataset_name}.")
...

I did some digging and I think it was because I was running on 4 GPUS and the inter-GPU communication was not working. I tried several things (setting `ddp_backend=gloo` in the `config.json` file, setting `os.environ["NCCL_P2P_DISABLE"]="1"` on top of `train_lora.py`) but the only thing that fixed it was restricting CUDA to one device by setting `os.environ["CUDA_VISIBLE_DEVICES"]="0"` on top of `train_lora.py`. From my understanding of the paper and the Github Issues I have read, the paper's experiments were run on TPUs (on Amazon Sagemaker) and 1 GPU for fine-tuning so this seems like an oversight. I feel like I have to ask the question here:



1. Is there a way to run fine-tuning on multiple GPUs?





After I fixed this, the code ran but I received another error involving infinite values:


ValueError: Input contains infinity or a value too large for dtype('float64')

When I went through the traceback, I found the error to be in evaluate.py, specifically when compute_metrics in train_lora.py calls evaluate_sentencewhich in turn calls get_metrics here:

newline_probs = char_probs[:, positive_index]

# This line below
metrics, info = get_metrics(newline_labels, newline_probs, threshold=threshold)

info["newline_labels"] = newline_labels

This is because of this line in get_metrics:

def get_metrics(labels, preds, threshold: float = 0.01):
    # Compute precision-recall curve and AUC

    # This line below
    precision, recall, thresholds = sklearn.metrics.precision_recall_curve(labels, preds)

    pr_auc = sklearn.metrics.auc(recall, precision)
    ...

because preds contains -np.inf. I did some more digging and found it was because we call token_to_char_probs here in evaluate_sentence:

...
if "xlm" in model.config.model_type:
    tokens = tokenizer.tokenize(text, verbose=False)

    # This line below
    char_probs = token_to_char_probs(text, tokens, logits, tokenizer, offsets_mapping)

else:
    ....

This is because token_to_char_probs in utils.__init__.py initiializes the return tensor char_probs as -np.inf here:

def token_to_char_probs(text, tokens, token_logits, tokenizer, offsets_mapping):
    """Map from token probabalities to character probabilities"""
    
    # This line below
    char_probs = np.full((len(text), token_logits.shape[1]), -np.inf)  # Initialize with very low numbers
    ...

Which because we only replace rows whose corresponding character is the last character of a non-special token:
...
 # Assign the token's probability to the last character of the token
for i in range(valid_offsets.shape[0]):
    start, end = valid_offsets[i]

    # This line below
    char_probs[end - 1] = token_logits[valid_indices[i]]

...

We are left with a lot of -np.inf in the first column when we call get_metrics:

...
# This line below
newline_probs = char_probs[:, positive_index]

metrics, info = get_metrics(newline_labels, newline_probs, threshold=threshold)
...

and sklearn.metrics.auc really does not like that. To fix this, I set the offending line in get_metrics to be sigmoid(pred) as per the call to f1score in the same function:

def get_metrics(labels, preds, threshold: float = 0.01):
    # Compute precision-recall curve and AUC

    # I changed this line below
    precision, recall, thresholds = sklearn.metrics.precision_recall_curve(labels, sigmoid(preds))
    
    ....
    # Compute F1 score for a specific threshold (e.g., 0.01 after applying sigmoid)
    f1_at_specific_threshold = sklearn.metrics.f1_score(labels, sigmoid(preds) > threshold)
    ...

With all of these changes, running the initial command with train_lora.py and config.json worked. The results look OK but I have another few questions at this point:



1. Is the auxiliary objection function necessary for fine-tuning? I tried setting "use_auxiliary": false in the config.json file but ran into warnings while training and it errored while loading with SaT.


2. What is the difference between one_sample_per_line=True and one_sample_per_line=False? Does it actually make a difference on training results?


3. Is the optimal threshold returned by compute_metrics really the best threshold in your empirical experience?


4. Is corruption mandatory / desirable for validation? From my understanding the default LORA configs do not corrupt training samples but evaluate_sentence has this line:

separator = Constants.SEPARATORS[lang_code]

# This line below
sentences = [corrupt(sentence, do_lowercase, do_remove_punct) for sentence in sentences]
text = separator.join(sentences)



Once again, thank you for reading this and I hope you can answer my questions / concerns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationquestionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions